Working with Sequence Data (FASTQ)

In this hands-on module, we will learn how to work with the FASTQ data format. FASTQ extends FASTA by storing quality scores alongside sequence data from sequencing instruments.

After going through this module, students should be able to:

  • Identify and understand valid FASTQ files

  • Use Linux commands to safely examine FASTQ structure

  • Read FASTQ files in Python using Biopython and specify the correct quality encoding

  • Understand why quality encoding matters and when to use SeqIO.parse

FASTQ Format Basics

FASTQ is an extension of the FASTA format that includes quality scores along with sequence data. It is typically used to store raw sequence reads from high-throughput sequencing technologies like Illumina platforms. FASTQ files additionally contain per-base quality scores indicating the confidence of each base call.

A FASTQ file ends in the .fq or .fastq file extension, and consists of one or more sequence records. Each record contains exactly four lines:

  • Line 1: A sequence identifier starting with @ (at symbol)

  • Line 2: The raw sequence data

  • Line 3: A separator line that starts with + (plus symbol), optionally followed by the same identifier as line 1

  • Line 4: Quality scores encoded as ASCII characters; must have same number of characters as line 2

Here is an example of a FASTQ file downloaded from the NCBI Sequence Read Archive (SRA). This example contains exactly one sequence record:

@SRR20340966.1 1 length=150
CNGCTGAGTTGCTCCTCCAAGGCCTCCAAACGAAGCTTCAAAGCGCTTCCATTCACCTCCTGCTCCTCCATGCGTCTGCCCAGCTCCACCGTCTGCGCACAGGCAGCTTCCACAGCAGGTGGAATTGCAGGTTCGACCACAGGAGGCGCT
+SRR20340966.1 1 length=150
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF:,:FFFFF:FFFFFFFF::FFFFFF:FFFFF,FFF,:FFFFFFFFFFFFF

In this example:

  • Line 1 starts with @ and contains the read identifier in NCBI’s FASTQ format

  • Line 2 contains the sequence itself (nucleotides: A, T, G, C, N for unknown/ambiguous)

  • Line 3 starts with + and optionally repeats the identifier (often just +)

  • Line 4 contains quality scores as ASCII characters (one character per base)

Note on FASTQ Quality Score Encoding:

Quality scores in FASTQ files are encoded using ASCII characters, but different sequencing platforms and software versions use different encoding schemes. This is crucial to understand when working with FASTQ files from different sources.

Quality scores represent the probability that a base call is incorrect:

Phred Quality Score (Q)

Probability of incorrect base call

Base call accuracy

10

1 in 10

90%

20

1 in 100

99%

30

1 in 1000

99.9%

40

1 in 10,000

99.99%

50

1 in 100,000

99.999%

60

1 in 1,000,000

99.9999%

Different Platforms Use Different ASCII Encodings:

The quality score is then converted to a printable ASCII character by adding an offset. Different platforms use different offsets:

../_images/fastq-quality-encoding.png

Different sequencers use different Phred quality score encodings.

While you may come across older datasets (Solexa/Illumina 1.3-1.7) that use Phred+64 offset, the current standard is Phred+33 (Sanger/Illumina 1.8+).

EXERCISE

See if you can answer the following questions about our example FASTQ sequence record:

@SRR20340966.1 1 length=150
CNGCTGAGTTGCTCCTCCAAGGCCTCCAAACGAAGCTTCAAAGCGCTTCCATTCACCTCCTGCTCCTCCATGCGTCTGCCCAGCTCCACCGTCTGCGCACAGGCAGCTTCCACAGCAGGTGGAATTGCAGGTTCGACCACAGGAGGCGCT
+SRR20340966.1 1 length=150
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF:,:FFFFF:FFFFFFFF::FFFFFF:FFFFF,FFF,:FFFFFFFFFFFFF
  1. Can you tell which ASCII encoding scheme the above FASTQ file is using?

  2. Which ASCII character corresponds to the worst Phred score for this encoding?

  3. What is the Phred quality score of the majority of the base calls above?

  4. What is the Phred quality score of the second base call?

  5. Do you notice anything interesting about that second base call?

Attention

Many bioinformatics tools can auto-detect the encoding, but it is still essential for you to know your data source. Choosing the wrong encoding can lead to bad quality control and disastrous consequences down the line!

Working with FASTQ files

Let’s get some practice working with FASTQ files. We’ll use VSCode for these exercises. Open a VSCode RemoteSSH session (Cmd+Shift+P -> RemoteSSH) and create a new terminal (Cmd+Shift+P -> Terminal: Create New Terminal).

We’ll create an example FASTQ file within our working-with-bio-data directory. Copy and paste the contents of this file into a new file called raw_reads.fastq

Helpful Linux Commands

Now let’s examine our FASTQ file. Your gut instinct may be to do a similar grep command that we used with the FASTA file on the sequence identifier: grep "@" raw_reads.fastq. The problem with this is that @ could appear in line 4 since this line contains the ASCII-encoded quality scores, and @ is a valid ASCII symbol. We need a method that understands the structure of the file, not just the characters in it.

Introducing awk

awk is a powerful text-processing command that works on one line at a time. Unlike grep, which simply searches for matching text, awk allows us to perform actions on specific lines within our file. The basic form of an awk command looks like this:

awk '{ action }' filename

where awk automatically reads the file line by line, and then applies the { action } to every line (unless told otherwise).

It also provides a built-in variable called NR, which stands for Number of Records. By default, one record = one line, and NR can therefore be used to represent the line number, starting at 1.

Since FASTQ files have a strict structure in which every read uses exactly 4 lines, that means that header lines always appear on lines 1, 5, 9, 13, etc. We can express this pattern mathematically:

NR % 4 == 1

Let’s test it out with an awk command:

awk 'NR % 4 == 1' raw_reads.fastq

This command will print every fourth line, starting with the first line, and thus prints only the FASTQ header lines and nothing else. Now we need a way to count these header lines to determine how many reads are in the file.

Introducing | and wc

In Linux, the pipe symbol (|) is used to connect two commands together. It will redirect the output of the command on its left as the input to command on its right. It allows chaining multiple commands together in a single line.

In addition, the wc command (short for word count) can be used to count the number of words (or lines when we add the -l option) in a file from standard input:

To demonstrate this, we can combine awk and wc -l using a pipe:

awk 'NR % 4 == 1' raw_reads.fastq | wc -l

The number printed is the total number of read identifiers (and therefore sequencing reads) in the FASTQ file.

These command line tricks can be very useful to get a quick glance of the data we’re working with. For more advanced operations on our FASTQ files, we use Python and Biopython.

Read FASTQ from File

We can read FASTQ files in Python using SeqIO.parse from Biopython. Unlike the lightweight SimpleFastaParser we used for FASTA, SeqIO.parse returns full SeqRecord objects that include the sequence and its quality scores. This is appropriate when you need both sequence and quality information (e.g., for quality control or filtering).

Attention

Memory and large files: SeqIO.parse is an iterator, so it does not load the entire file into memory at once. However, each record is a full SeqRecord with sequence and quality data. For very large FASTQ files (e.g., billions of reads), consider whether you need all records in memory at once—if not, iterate and process in chunks; if you do (e.g., sorting), be aware of memory limits.

You must tell Biopython which FASTQ quality encoding variant your file uses. If you use the wrong format, quality values will be wrong. Common format strings:

  • fastq-sanger — Phred+33 (Sanger, Illumina 1.8+). Use this for most modern Illumina data.

  • fastq-illumina — Phred+64 (older Illumina 1.3–1.7).

  • fastq-solexa — Solexa/Illumina legacy encoding.

We can actually use the information contained in the read identifier to help us identify the quality encoding variant:

@SRR20340966.1 1 length=150

The string SRRxxxxxxxx corresponds to a SRA Run Accession. We can copy that run accession here to get more information about our data.

Our sample raw_reads.fastq was sequenced on the Illumina NovaSeq 6000 and was published in 2023. It is very likely that this data uses the modern Phred+33 (Illumina 1.8+) encoding. We also determined above that the existence of the # character confirms we are working with Phred+33 encoding. Therefore, we will use fastq-sanger.

Activate your Python virtual environment and create a file called fastq_ex.py:

from Bio import SeqIO

with open('raw_reads.fastq', 'r') as f:
    for record in SeqIO.parse(f, 'fastq-sanger'):
        print(f"ID: {record.id}")
        print(f"Sequence: {record.seq}")
        print(f"Annot: {record.letter_annotations}")
        break
ID: SRR20340966.1
Sequence: CNGCTGAGTTGCTCCTCCAAGGCCTCCAAACGAAGCTTCAAAGCGCTTCCATTCACCTCCTGCTCCTCCATGCGTCTGCCCAGCTCCACCGTCTGCGCACAGGCAGCTTCCACAGCAGGTGGAATTGCAGGTTCGACCACAGGAGGCGCT
Annot: {'phred_quality': [37, 2, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 25, 11, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 25, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 11, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]}

Each record has .id, .seq, and .letter_annotations (e.g., record.letter_annotations['phred_quality'] for a list of integer Phred scores per base).

Let’s see what would happen if we tried to use fastq-illumina instead of fastq-sanger:

from Bio import SeqIO

with open('raw_reads.fastq', 'r') as f:
    for record in SeqIO.parse(f, 'fastq-illumina'):
        print(f"ID: {record.id}")
        print(f"Sequence: {record.seq}")
        print(f"Length: {len(record.seq)}")
        print(f"Annot: {record.letter_annotations}")
        break
Traceback (most recent call last):
    File "/home/ubuntu/mbs-337/working-with-bio-data/fastq_ex.py", line 4, in <module>
        for record in SeqIO.parse(f, 'fastq-illumina'):
    File "/home/ubuntu/mbs-337/myenv/lib/python3.12/site-packages/Bio/SeqIO/QualityIO.py", line 1113, in __next__
        raise InvalidCharError(quality_string, invalid_index, details)
Bio.SeqIO.QualityIO.InvalidCharError: Invalid character (#) or (0x23) in quality string not in correct range (are you sure you're using the right QualityIO parser?) with context: [F#FFF...]

The output gives us an error because SeqIO.parse was expecting Phred+64 ASCII codes, and the # character is not included in Phred+64.

EXERCISE

Let’s say this FASTQ file contains some RNA-Seq data we just got back from the sequencer. The first thing we need to do is check the quality of our data, and record it in a human-readable format like JSON. We’ll read in the FASTQ, define two Pydantic models (see Unit 2): one for each read summary (sequence identifier, sequence, total bases, average Phred quality), and one top-level model whose reads field holds a list of those summaries.

The structure we want looks like this:

{
    "reads": [
        {
            "id": "SRR20340966.1 1 length=150",
            "sequence": "CNGCTGAGTTGCTCCTCCAAGGCCTCCAAAC...",
            "total_bases": 150,
            "average_phred": 35.25
        },
        ...
    ]
}

Let’s start by defining two Pydantic models: one for each read summary, and one (the top-level model) whose reads field will hold the list of read summaries:

import json
from Bio import SeqIO
from pydantic import BaseModel

# Define ReadSummary model
class ReadSummary(BaseModel):
    id: str
    sequence: str
    total_bases: int
    average_phred: float

# Define FastqSummary model (we'll create an instance of this with our list;
# .model_dump() will then convert that instance to a dict for JSON)
class FastqSummary(BaseModel):
    reads: list[ReadSummary]

Now, let’s create an empty list to store our ReadSummary instances for each read. We’ll use SeqIO.parse to extract the following from our FASTQ file:

  • The sequence identifier

  • The sequence itself

  • The length of the sequence

  • The average phred quality of the sequence

See if you can figure out this part before checking your work below:

Hint: Recall from the earlier example the format of record.letter_annotations:

{'phred_quality': [37, 2, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 25, 11, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 25, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 11, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]}
# Create list of ReadSummary instances
reads_list = []
with open('raw_reads.fastq', 'r') as f:
    for record in SeqIO.parse(f, 'fastq-sanger'):

        # Calculate average phred quality
        phred_scores = record.letter_annotations['phred_quality']
        average_phred = sum(phred_scores) / len(phred_scores)

        # Append information to reads_list
        reads_list.append(ReadSummary(
            id=record.id,
            sequence=str(record.seq),
            total_bases=len(record.seq),
            average_phred=round(average_phred, 2)
        ))

We should end up with a list of ReadSummary instances that contain the information we need regarding each read:

print(reads_list[0])
id='SRR20340966.1' sequence='CNGCTGAGTTGCTCCTCCAAGGCCTCCAAACGAAGCTTCAAAGCGCTTCCATTCACCTCCTGCTCCTCCATGCGTCTGCCCAGCTCC
ACCGTCTGCGCACAGGCAGCTTCCACAGCAGGTGGAATTGCAGGTTCGACCACAGGAGGCGCT' total_bases=150 average_phred=35.51

Now we create a FastqSummary instance from our list (so we have one top-level object with a reads field), convert that instance to a dictionary with .model_dump(), and write the result to fastq_summary.json.

data = FastqSummary(reads=reads_list)

with open('fastq_summary.json', 'w') as outfile:
    json.dump(data.model_dump(), outfile, indent=2)

print(f"Wrote {len(data.reads)} reads to fastq_summary.json")

Your new JSON file should look like this:

{
    "reads": [
        {
            "id": "SRR20340966.1",
            "sequence": "CNGCTGAGTTGCTCCTCCAAGGCCTCCAAACGAAGCTTCAAAGCGCTTCCATTCACCTCCTGCTCCTCCATGCGTCTGCCCAGCTCCACCGTCTGCGCACAGGCAGCTTCCACAGCAGGTGGAATTGCAGGTTCGACCACAGGAGGCGCT",
            "total_bases": 150,
            "average_phred": 35.51
        },
        {
            "id": "SRR20340966.2",
            "sequence": "CAGGATTCCCCAGCCTAGCGCCGGCTGCGCGGTGGGCCTTCCACCAAAGCCTTCCTCCAAGCTCCATACTGAATGTCAAACTCCGATTTCGTCAACTCTGAGCGAGGGGTCGCAGGATCTTCTGGTCTCCTGCCGGTCTCACGCCAATAC",
            "total_bases": 150,
            "average_phred": 36.75
        }
    ]
}

Congratulations! You’ve just performed a real-world bioinformatics task and successfully saved your raw sequencing quality metrics to a new JSON file.

In this exercise, we:

  1. Defined two Pydantic models:

  • ReadSummary (one per read)

  • FastqSummary (the top-level model with a reads field).

  1. Built a list of ReadSummary instances (reads_list)

  2. Created a FastqSummary instance with that list (data = FastqSummary(reads=reads_list))

  3. Used data.model_dump() to convert that instance to a plain Python dictionary and json.dump() to write the JSON file.

  4. Pydantic validates the types (e.g., total_bases is an integer, average_phred is a float) automatically.

When you run the script, fastq_summary.json will contain one object per read with the identifier, full sequence, total base count, and average Phred score. You can open the JSON file to inspect the structure or use it as input for another script.

Additional Resources