Working with Sequence Data (FASTA)

In this hands-on module, we will learn how to work with the FASTA data format. This is one of the most fundamental file formats in bioinformatics, used extensively for storing biological sequencing data (DNA, RNA, and protein sequences).

After going through this module, students should be able to:

Identify and understand valid FASTA files
Read FASTA files into Python data structures
Parse sequence headers and extract metadata
Write sequences to FASTA files

FASTA Format Basics

FASTA (pronounced “fast-A”) is a text-based format used to represent nucleotide or protein sequences. It was developed in the 1980s and remains one of the most widely used formats in bioinformatics today. FASTA files are simple, human-readable, and can be easily processed with basic text manipulation tools.

A FASTA file consists of one or more sequence records. Each record contains:

A header line (also called a definition line) that starts with a > (greater-than) symbol
One or more sequence lines containing the actual sequence data

Here is an example of a simple FASTA file with three protein sequences downloaded from Uniprot:

>sp|P02144|MYG_HUMAN Myoglobin OS=Homo sapiens OX=9606 GN=MB PE=1 SV=2
 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE
 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH
 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG

>sp|P68871|HBB_HUMAN Hemoglobin subunit beta OS=Homo sapiens OX=9606 GN=HBB PE=1 SV=2
 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
 KEFTPPVQAAYQKVVAGVANALAHKYH

 >sp|P01308|INS_HUMAN Insulin OS=Homo sapiens OX=9606 GN=INS PE=1 SV=1
 MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
 LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

In this example:

Each sequence starts with a > symbol followed by header information
The header contains a UniProt-specific identifier and additional descriptive information (see Uniprot’s documentation)
The sequence lines contain amino acid single-letter codes
Sequences can span multiple lines (as shown above)
Blank lines between records are optional but often used for readability

Sequences are expected to follow the standard IUB/IUPAC amino acid and nucleic acid codes, with some important exceptions:

Lower-case letters are allowed
A single hyphen or dash can be used to represent a gap in the sequence
In amino acid sequences, * is sometimes used to represent a stop codon

There are multiple filename extensions associated with FASTA files:

Extension	Meaning	Notes
.fasta, .fas, .fa	generic FASTA	Any generic FASTA file
.fna	FASTA nucleic acid	Used generically to specify nucleic acids
.ffn	FASTA nucleotide of gene regions	Contains coding regions for a genome
.faa	FASTA amino acid	Contains amino acid sequences
.mpfa	FASTA amino acids	Contains multiple protein sequences
.frn	FASTA non-coding RNA	Contains non-coding RNA regions for a genome, e.g. tRNA, rRNA

Working with FASTA files

Let’s get some practice working with FASTA files. We’ll use VSCode for these exercises. Open a VSCode RemoteSSH session (Cmd+Shift+P -> RemoteSSH) and create a new terminal (Cmd+Shift+P -> Terminal: Create New Terminal).

Within the terminal inside VSCode on your class VM, navigate to your mbs-337 project directory and create a new directory called working-with-bio-data.

Then, let’s create our input file:

Copy and paste the contents of this file into a new file called sequences.fasta

Helpful Linux Commands

There are a few linux commands that we can do from the command line to get some high-level information about our FASTA file.

First, let’s take a look at the FASTA file:

# Print the first 8 lines
[mbs-337]$ head -n 8 sequences.fasta

# Print the lines starting with '>'
[mbs-337]$ grep ">" sequences.fasta

# Count the number of lines starting with '>'
[mbs-337]$ grep -c ">" sequences.fasta

Read FASTA from File

Good news! The Biopython library that we installed earlier in the course makes working with sequence data very straightforward. Let’s activate our Python3 virtual environment, and create a new file called fasta_ex.py.

For reading FASTA files, we will use SimpleFastaParser from Biopython. SimpleFastaParser is a lightweight, memory-efficient generator that yields (header, sequence) tuples. This makes it great for working with large FASTA files.

Tuple vs List

Tuples and lists are both ordered items or values. The key differences between them are:

Tuples are unchangeable, and are contained in parentheses ()

Lists are changeable, and are contained in square brackets []

First, import SimpleFastaParser from Bio.SeqIO.FastaIO and read in our FASTA file:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('sequences.fasta', 'r') as f:
    for values in SimpleFastaParser(f):
        print(values)
        break # stop after the first

('sp|P36222|CH3L1_HUMAN Chitinase-3-like protein 1 OS=Homo sapiens OX=9606 GN=CHI3L1 PE=1 SV=2', 'MGVKASQTGFVVLVLLQCCSAYKLVCYYTSWSQYREGDGSCFPDALDRFLCTHIIYSFAN...')

For each sequence record, a tuple of strings is returned: the FASTA header line (without the leading ‘>’ character), and the sequence itself.

We can access the data within the tuples by iterating over both the header and the sequence:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('sequences.fasta', 'r') as f:
    for header, sequence in SimpleFastaParser(f):
        print(f"Header: {header}")
        print(f"Sequence: {sequence}")
        break

Header: sp|P36222|CH3L1_HUMAN Chitinase-3-like protein 1 OS=Homo sapiens OX=9606 GN=CHI3L1 PE=1 SV=2
Sequence: MGVKASQTGFVVLVLLQCCSAYKLVCYYTSWSQYREGDGSCFPDALDRFLCTHIIYSFANISNDHIDTWEWNDVTLYGMLNTLKNRNPNLKTLLSVGGWNFGSQRFSKIASNTQSRRTFIKSVPPFLRTHGFDGLDLAWLYPGRRDKQHFTTLIKEMKAEFIKEAQPGKKQLLLSAALSAGKVTIDSSYDIAKISQHLDFISIMTYDFHGAWRGTTGHHSPLFRGQEDASPDRFSNTDYAVGYMLRLGAPASKLVMGIPTFGRSFTLASSETGVGAPISGPGIPGRFTKEAGTLAYYEICDFLRGATVHRILGQQVPYATKGNQWVGYDDQESVKSKVQYLKDRQLAGAMVWALDLDDFQGSFCGQDLRFPLTNAIKDALAAT...

Formatting the data in a tuple format like this allows us to easily perform actions on our sequence data. For example, let’s say I wanted to make my sequence headers easier to read. I could use the .split() method to break the header string into pieces:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('sequences.fasta', 'r') as f:
    for header, sequence in SimpleFastaParser(f):
        parts = header.split()
        print(parts)
        break

['sp|P36222|CH3L1_HUMAN', 'Chitinase-3-like', 'protein', '1', 'OS=Homo', 'sapiens', 'OX=9606', 'GN=CHI3L1', 'PE=1', 'SV=2']

Our header looks quite different now! Using the default settings of .split(), we now have a list of each substring contained in our header string, each separated where there was a whitespace in the original string.

Tip

You can always learn more about a function or method using Python3’s interactive mode. For example, we could get additional information about the syntax and options associated with the .split() string method by doing the following:

>>> help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1) unbound builtins.str method
    Return a list of the substrings in the string, using sep as the separator string.

    sep
        The separator used to split the string.

        When set to None (the default value), will split on any whitespace
        character (including \n \r \t \f and spaces) and will discard
        empty strings from the result.
    maxsplit
        Maximum number of splits.
        -1 (the default value) means no limit.

    Splitting starts at the front of the string and works to the end.

Breaking down the first highlighted line:

self means the string you’re calling .split() on. Essentially, it means:
"hello world".split() # here, self == "hello world"
The / means: Everything before this must be passed to .split() positionally, not within the parentheses.
sep is the separator. If you provide it, the string will be split on that separator. If you don’t provide a separator, the default behavior (None) is to split on any whitespace.
maxsplit controls how many splits are allowed. The default (-1) means no limit.

So, now that we know how .split() works on strings, let’s see what happens when we separate on the | character:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('sequences.fasta', 'r') as f:
    for header, sequence in SimpleFastaParser(f):
        parts = header.split('|')
        print(parts)
        break

['sp', 'P36222', 'CH3L1_HUMAN Chitinase-3-like protein 1 OS=Homo sapiens OX=9606 GN=CHI3L1 PE=1 SV=2']

From here, I could format my data so that the sequence information only contains the accession ID by grabbing the second item within this list. I could also perform an action on the sequences to determine the length of each sequence:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('sequences.fasta', 'r') as f:
    for header, sequence in SimpleFastaParser(f):
        parts = header.split('|')
        print(f"Accession: {parts[1]}, Length: {len(sequence)}")

We can use SimpleFastaParser to build a list of dictionaries with the same structure:

from Bio.SeqIO.FastaIO import SimpleFastaParser

sequences = []

with open('sequences.fasta', 'r') as f:
    for header, sequence in SimpleFastaParser(f):
        parts = header.split('|')
        entry = {
            "id": parts[1],
            "description": header,
            "sequence": sequence,
            "length": len(sequence)
        }
        sequences.append(entry)

print(sequences[0])

Now we can do things like:

Count how many proteins are longer than 500 amino acids

count = 0

for seq in sequences:
    if seq['length'] > 500:
        count += 1  # this is shorthand for "count = count + 1"

print(f"Number longer than 500 aa: {count}")

Find the longest sequence

longest_length = 0
longest_seq = None  # placeholder string

for seq in sequences:
    if seq['length'] > longest_length:
        longest_length = seq['length']
        longest_seq = seq

print(longest_seq["description"])

Write FASTA to File

Let’s say we are only interested in continuing our analysis on a particular set of sequences within this FASTA file. We have a CSV file that contains a list of accessions, and we want to create a new FASTA file containing only the matching sequences (and their headers).

For this activity, you’ll need to copy/paste the contents of this file into a new file called accessions.csv. Notice that this CSV has a header row called ‘accession’.

Read in the CSV

First, we’ll use the csv module to read in the accessions as a list:

import csv
from Bio.SeqIO.FastaIO import SimpleFastaParser

# Read accessions from CSV into a list
accessions_to_keep = []
with open('accessions.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        accessions_to_keep.append(row['accession'])

In the above code, csv.DictReader treats the first row of f as the header. For every following row, it creates a dictionary where the key == ‘accession’ and the value is the accession itself:

{'accession': 'P36222'}
{'accession': 'Q13258'}
etc.

We then iterate over this reader, where each row is now a dictionary, and append the value associated with each ‘accession’ key to an empty list called accessions_to_keep. After the loop finishes, the list looks like:

['P36222', 'Q13258', 'Q6W5P4', 'Q8TAX7']

Filter the sequences

Now, we read in sequences.fasta with SimpleFastaParser and also create a new output file called filtered.fasta where we will write our output to.

For each sequence in sequences.fasta, we extract the accession. If it’s in our accessions_to_keep list, we write the header and sequence to our new file, making sure that we properly format our output to match FASTA requirements:

# Read input FASTA and output matches to a new output fasta
with open('sequences.fasta', 'r') as infile, open('filtered.fasta', 'w') as outfile:
    for header, sequence in SimpleFastaParser(infile):

        # Grab the accession ID
        parts = header.split('|')
        accession = parts[1]

        # Filter by accession ID, write output to new fasta file
        if accession in accessions_to_keep:
            outfile.write(f">{header}\n")
            outfile.write(f"{sequence}\n")

Notice the special formatting that we have to do when we write our output file:

Our header line must start with >
We add a newline character (\n) to make sure that a new line is created before we continue the loop

When you run this script, filtered.fasta will contain only the sequences whose primary accession appears in accessions.csv. You can confirm by checking the number of headers and comparing to the number of accessions in your CSV.