Homework 07
Due Date: Tuesday, March 3 by 11:00am CST
Unit 6 Databases and APIs
This homework applies to all of Unit 6 (Databases, Redis, APIs, and the NCBI API).
You will first start up the Redis database server in a container as we did in class
and then you will write a single, well-structured Python script called get_ncbi_genbank_records.py
that uses BioPython and the NCBI API to retrieve records from GenBank and stores them in the Redis database.
The script should also dump the records from the Redis database to a TEXT file. Finally, write a README that
describes how to start up the container and run the script.
Part 1: Redis Database
Normally with just a single container, you would use docker run to start up the container. In this case,
we will use docker compose to start up the container. This simplifies the process of starting it up
since you won’t have to keep writing out the long docker run command with all the necessary options.
Create a
docker-compose.ymlfile in yourhomework07directory that defines a single service calledredis-db(it should look very similar to the docker compose file that we showed in class minus the “summarize-data” service).Set the
imageto theredis:8.6.0image from Docker Hub.Set the
container_nametoredis.Set the
portsto map port 6379 on the host to port 6379 in the container.Create a
redis-datadirectory in yourhomework07directory. Set thevolumesto map that local directory to the/datadirectory in the container.Set the
userto your UID and GID (e.g.,1000:1000).Set the
commandto start the Redis server withredis-server --appendonly yes --appendfsync everysecto enable data persistence.Start up the container with
docker compose up -d(the-dflag runs the redis service in the background) and verify that it is running withdocker ps.You can stop the container with
docker compose downwhen you are done using it.
Part 2: Script
Create a Python script called get_ncbi_genbank_records.py that does the following:
Using
Entrez.esearchfromBio, search the NCBI protein database for records matching the search term “Arabidopsis thaliana AND AT5G10140” and retrieve the list of GI numbers for the matching records (make sure to set max return option,retmax=30to limit the number of results).Using
Entrez.efetchfromBio, retrieve the full GenBank records for the list of GI numbers obtained in step 1 (NOTE: theidparameter can be a single GI number or a comma-separated string of GI numbers).Parse the GenBank records using
SeqIO.parsefromBioand store the resulting record objects in a list.Connect to the Redis database running in the container using the
redisPython package and store each GenBank record in the Redis database with therecord.idas the key and the value being a JSON string containing the record’s ID (record.id), name (record.name), description (record.description), and sequence (str(record.seq)).After storing the records in Redis, retrieve all the records from Redis and write them to an output text file called
genbank_records.txtthat looks like the following:ID: AAV51219.1 Name: AAV51219 Description: flowering locus C protein [Arabidopsis thaliana] Sequence: MGRKKLEIKRIENKSSRQVTFSKRRNGLIEKARQLSVLCDASVALLVVSASGKLYSFSSGDNLVKILDRYGKQHADDLKALDHQSKALNYGSHYELLELVDSKLVGSNVKNVSIDALVQLEEHLETALSVTRAKKTELMLKLVENLKEKEKMLKEENQVLASQMENNHHVGAEAEMEMSPAGQISDNLPVTLPLLN ID: NP_001078563.1 Name: NP_001078563 Description: K-box region and MADS-box transcription factor family protein [Arabidopsis thaliana] Sequence: MGRKKLEIKRIENKSSRQVTFSKRRNGLIEKARQLSVLCDASVALLVVSASGKLYSFSSGDNLVKILDRYGKQHADDLKALDHQSKALNYGSHYELLELVDSKLVGSNVKNVSIDALVQLEEHLETALSVTRAKKTELMLKLVENLKEKEKMLKEENQVLASQIFLG ...
In addition to the
loglevelcommand-line argument for setting the logging level, add a command-line argument for specifying the output file name (default should begenbank_records.txt). BONUS: add a command-line argument for specifying the search term (default should be “Arabidopsis thaliana AND AT5G10140”).
Requirements checklist
Script name:
get_ncbi_genbank_records.pyUse the shebang line
#!/usr/bin/env python3at the top of the scriptAt least 2 functions plus
main()Properly formatted
if __name__ == "__main__"statementType hints on all functions (parameters and return types)
Docstrings with description, Args, and Returns for every function
Logging at at least 1 levels
argparse for log level
socket used in logging
At least one try/except for error handling
Output TEXT file matches the required format
.gitignore
In this assignment, you have created a directory called redis-data that is persisting
data for the Redis database. We don’t actually need (or want) to store that in our GitHub
repository, so if you haven’t already, create a file called .gitignore in your root GitHub
repository. This files tells Git which files and directories to ignore when you commit and push
your work. Add redis-data to your .gitignore file to ignore that directory. For Python projects,
you might want it to look like:
env
venv
.venv
redis-data
What to Turn In
Create a
homework07directory in your Git repository (on your VM).Add
docker-compose.ymlandget_ncbi_genbank_records.pyto this directory.Add your output file (e.g.,
genbank_records.txt) in anoutput_filesdirectory.Add a
README.mdinhomework07that:Describes how to start up the Redis container with
docker composeDescribes what the script does and how to run it (including example commands)
Includes a section on AI usage (if applicable — see note below)
Commit and push your work to GitHub.
Expected directory layout:
my-mbs337-repo/
├── homework07
├── docker-compose.yml
├── get_ncbi_genbank_records.py
├── output_files
│ └── genbank_records.txt
Note on Using AI
The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:
The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.
Additional Resources
Please find us in the class Slack channel if you have any questions!