Homework 04

Due Date: Tuesday, February 10 by 11:00am CST

From Sequence to Structure

This homework builds on Unit 3 (FASTA, FASTQ, and mmCIF). You will use the same tools (Biopython SimpleFastaParser, SeqIO, MMCIFParser) but to solve new problems: FASTA statistics and length-based filtering, FASTQ quality filtering and output, and mmCIF multi-chain analysis and coordinate extraction.

Input files:

FASTA: A multi-sequence FASTA file named immune_proteins.fasta. Download with:

wget https://github.com/TACC/mbs-337-sp26/raw/refs/heads/main/docs/unit03/sample-data/immune_proteins.fasta.gz
gunzip immune_proteins.fasta.gz

FASTQ: A FASTQ file named sample1_rawReads.fastq.

wget https://github.com/TACC/mbs-337-sp26/raw/refs/heads/main/docs/unit03/sample-data/sample1_rawReads.fastq.gz
gunzip sample1_rawReads.fastq.gz

mmCIF: The hemoglobin structure 4HHB. Download with:

wget https://files.rcsb.org/download/4HHB.cif.gz
gunzip 4HHB.cif.gz

Exercise 1: Count residues in FASTA file

Create a Python script called exercise1.py that reads immune_proteins.fasta and prints:

The total number of sequences in the file

The total number of residues in the file

The accession ID and length of the longest sequence in the file

The accession ID and length of the shortest sequence in the file

Example output:

Num Sequences: 321
Total Residues: 196434
Longest Accession: P78527 (4128 residues)
Shortest Accession: Q9HCY8 (104 residues)

Requirements:

Use SimpleFastaParser from Bio.SeqIO.FastaIO.
Correctly extract the accession from each header.
Match the example output format exactly.

Exercise 2: Write New FASTA File

Create a Python script called exercise2.py that reads immune_proteins.fasta using SimpleFastaParser again. This time, your script should write out a new FASTA file called long_only.fasta containing only the sequences longer than or equal to 1000 residues. Each output record must be a valid FASTA with the original headers format preserved.

Requirements:

Use SimpleFastaParser from Bio.SeqIO.FastaIO.
Output long_only.fasta must be a valid FASTA and should contain 33 sequences.

Exercise 3: FASTQ quality filter and write

Create a Python script called exercise3.py that:

Reads sample1_rawReads.fastq using SeqIO.parse with the correct format.
Keeps only reads whose average Phred quality is at least 30.
Writes those filtered reads out to a new FASTQ file named sample1_cleanReads.fastq.
Prints to the terminal the total number of reads in the original file and the number of reads that passed quality control.

Example output:

Total reads in original file: 500
Reads passing filter: 483

Requirements:

Use Bio.SeqIO for reading and writing (see Bio.SeqIO.write() documentation).
Specify the correct quality encoding when reading.
Match the example output format exactly.

Exercise 4: mmCIF multi-chain summary (4HHB)

Create a Python script called exercise3.py that:

Parses 4HHB.cif with MMCIFParser from Bio.PDB.
Iterates over the full structure hierarchy (all models, all chains).
For each chain, prints:
- Chain ID
- Number of non-hetero-residues in that chain
- Number of atoms in the non-hetero-residues in that chain

Example Output:

Chain A: 141 residues, 1069 atoms
Chain B: 146 residues, 1123 atoms
Chain C: 141 residues, 1069 atoms
Chain D: 146 residues, 1123 atoms

Note

You may get multiple lines of the following warning printed to the terminal:

PDBConstructionWarning: WARNING: Chain D is discontinuous at line xxx

This is normal behavior when working with multiple chains and is safe to ignore.

Requirements:

Use 4HHB.cif.
Use MMCIFParser from Bio.PDB.MMCIFParser to read the mmCIF file.
Remember the object hierarchy in Biopython PDB: structure → models → chains → residues → atoms. Use nested for loops to walk this hierarchy.
Only when the residue is non-hetero should you increment your residue counter and loop over its atoms to count them.
Match the example output format exactly.

What to Turn In

Create a homework04 directory in your Git repository (on your VM).
Add all four Python scripts (exercise1.py through exercise4.py) to this directory.
Place the generated output files in homework04.
Add a README.md in homework04 that:
- Describes what each script does
- Explains where to get the input files
- Includes a section on AI usage (if applicable — see note below)
Commit and push your work to GitHub.

Generated files: Running your scripts should produce long_only.fasta (Exercise 2) and sample1_cleanReads.fastq (Exercise 3). Include these in a subdirectory within homework04 called output_files.

Expected directory layout:

my-mbs337-repo/
├── homework04/
│   ├── exercise1.py
│   ├── exercise2.py
│   ├── exercise3.py
│   ├── exercise4.py
│   ├── output_files/
│   │   ├── long_only.fasta
│   │   └── sample1_cleanReads.fastq
│   └── README.md

Note on Using AI

The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:

The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.

Additional Resources

Unit 3: FASTA
Unit 3: FASTQ
Unit 3: mmCIF
Biopython SeqIO
Biopython PDB
RCSB PDB — download mmCIF files (e.g. 4HHB)
Please find us in the class Slack channel if you have any questions!