Homework 06
Due Date: Tuesday, February 24 by 11:00am CST
Unit 5 Containerization
This homework applies to the first two parts of Unit 5 (Introduction to Containers and Advanced Containers). You will containerize (a single container) the scripts from the first three exercises from Homework 4 and the script from Homework 5, run the container multiple times to generate output from each script, push the container to Docker Hub, and finally write a README that describes how to build and use the container.
Part 1A: Input files
FASTA: A multi-sequence FASTA file named
immune_proteins.fasta. Download with:
wget https://github.com/TACC/mbs-337-sp26/raw/refs/heads/main/docs/unit03/sample-data/immune_proteins.fasta.gz
gunzip immune_proteins.fasta.gz
FASTQ: A FASTQ file named
sample1_rawReads.fastq.
wget https://github.com/TACC/mbs-337-sp26/raw/refs/heads/main/docs/unit03/sample-data/sample1_rawReads.fastq.gz
gunzip sample1_rawReads.fastq.gz
mmCIF: The hemoglobin structure 4HHB. Download with:
wget https://files.rcsb.org/download/4HHB.cif.gz
gunzip 4HHB.cif.gz
Part 1B: Scripts to containerize
Script 1: Count residues in FASTA file (from Homework 4 - Exercise 1)
Rename the Python script exercise1.py to fasta_stats.py and have it read immune_proteins.fasta
and write to a text file called immune_proteins_stats.txt instead of printing to the console. To make it
more flexible for container use, the script should take the input FASTA file and output text file as command-line
arguments in addition to the log level (see general requirements below) using argparse. The output text file
should contain the exact same information as before described in Homework 4 - Exercise 1.
Script 2: Write a new FASTA file (from Homework 4 - Exercise 2)
Rename the Python script exercise2.py to fasta_filter.py and have it read immune_proteins.fasta
and write out a new FASTA file called long_only.fasta containing only the sequences longer than or equal to
1000 residues (or a specified length). Each output record must be a valid FASTA with the original headers format
preserved. To make it more flexible for container use, the script should take the input FASTA file,
output FASTA file, and minimum sequence length as command-line arguments in addition to the log level
(see general requirements below) using argparse.
Script 3: FASTQ quality filter and write (from Homework 4 - Exercise 3)
Rename the Python script exercise3.py to fastq_filter.py and have it read sample1_rawReads.fastq
and write out a new FASTQ file called sample1_cleanReads.fastq containing only the reads where the average
Phred score is greater than or equal to 30 (or a specified threshold). To make it more flexible for container use,
the script should take the input FASTQ file, output FASTQ file, encoding, and Phred score threshold as command-line
arguments in addition to the log level (see general requirements below) using argparse. Instead of printing the
total number of reads and the number of reads that passed quality control to the console, write this information
to the log.
Script 4: mmCIF Summary Script (from Homework 5)
Just copy the Python script you created for Homework 5 (mmcif_summary.py). This script should only need minor
modifications to take the input CIF file and output JSON file as command-line arguments to make it more flexible
for container use. It’s output should still be the same JSON format as described in Homework 5.
General Requirements checklist for all scripts
Use the shebang line
#!/usr/bin/env python3at the top of each scriptAt least 1 function plus
main()Properly formatted
if __name__ == "__main__"statementType hints on all functions (parameters and return types)
Docstrings with description, Args, and Returns for every function
Logging at at least 1 level
argparse for log level and other parameters as described above
socket used in logging
At least one try/except for error handling
Part 2: Build the container
Write a Dockerfile to containerize all of the script described above.
Requirements
Use an official Python base image (e.g.,
python:3.12)Install all necessary dependencies
Copy all scripts into the container at
/codeMake sure all the scripts are executable (hint:
chmod ugo+x <file>)Add
/codeto the PATH so the scripts can be run from anywhere in the container
Part 3: Run each script
Run each script in the container to generate the expected output files. You can use the same input files as before
described above (immune_proteins.fasta, sample1_rawReads.fastq, and 4HHB.cif).
Requirements
Run each script with
docker runfrom outside the container (not interactively) with the proper command-line arguments to specify the input and output files (and other parameters as needed by each script).Make sure to mount $PWD (directory containing the input files and where output files will be written) to a directory in the container (
/data) using the-vflag ofdocker run.Make sure to use the
-uflag ofdocker runto run the container with the ubuntu user ID so that output files are owned by you and not root.
Part 4: Push to Docker Hub
When you have successfully built and run your container, push it to Docker Hub so that others can use it.
Requirements
Create a Docker Hub account if you don’t have one already.
Tag your container with your Docker Hub username and a repository name and version of your choosing (e.g.,
username/my_bio_tools:1.0).
Part 5: README
The README should describe how a user who has just cloned your repository can run your tools in a container from start to finish. Take special care to describe how to:
Build the image from a Dockerfile
Get the input data from the web (data should be cited)
Mount the data inside the container at run time
Run the containerized code as a specific user to avoid permission issues
Describe the available parameters for each script and how to specify them at run time
Describe the expected output files and where to find them after running the container
Includes a section on AI usage (if applicable — see note below)
What to Turn In
Create a
homework06directory in your Git repository (on your VM).Add
fasta_stats.py,fasta_filter.py,fastq_filter.py, andmmcif_summary.pyto this directory.Add your 4 output files (e.g.,
immune_proteins_stats.txt,long_only.fasta,sample1_cleanReads.fastq, and4HHB_summary.json) in anoutput_filesdirectory.Add a
README.mdinhomework06.Commit and push your work to GitHub.
Expected directory layout:
my-mbs337-repo/
└── homework06/
├── Dockerfile
├── README.md
├── fasta_filter.py
├── fasta_stats.py
├── fastq_filter.py
├── mmcif_summary.py
├── output_files
│ ├── 4HHB_summary.json
│ ├── immune_proteins_stats.txt
│ ├── long_only.fasta
│ └── sample1_cleanReads.fastq
Note on Using AI
The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:
The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.
Additional Resources
Please find us in the class Slack channel if you have any questions!