Homework 07 =========== **Due Date: Tuesday, March 3 by 11:00am CST** Unit 6 Databases and APIs ------------------------- This homework applies to all of Unit 6 (Databases, Redis, APIs, and the NCBI API). You will first start up the Redis database server in a container as we did in class and then you will write a single, well-structured Python script called ``get_ncbi_genbank_records.py`` that uses BioPython and the NCBI API to retrieve records from GenBank and stores them in the Redis database. The script should also dump the records from the Redis database to a TEXT file. Finally, write a README that describes how to start up the container and run the script. Part 1: Redis Database ~~~~~~~~~~~~~~~~~~~~~~ Normally with just a single container, you would use ``docker run`` to start up the container. In this case, we will use ``docker compose`` to start up the container. This simplifies the process of starting it up since you won't have to keep writing out the long ``docker run`` command with all the necessary options. * Create a ``docker-compose.yml`` file in your ``homework07`` directory that defines a single service called ``redis-db`` (it should look very similar to the docker compose file that we showed in `class <../unit05/docker_compose.html#write-a-compose-file>`_ minus the "summarize-data" service). * Set the ``image`` to the ``redis:8.6.0`` image from Docker Hub. * Set the ``container_name`` to ``redis``. * Set the ``ports`` to map port 6379 on the host to port 6379 in the container. * Create a ``redis-data`` directory in your ``homework07`` directory. Set the ``volumes`` to map that local directory to the ``/data`` directory in the container. * Set the ``user`` to your UID and GID (e.g., ``1000:1000``). * Set the ``command`` to start the Redis server with ``redis-server --appendonly yes --appendfsync everysec`` to enable data persistence. * Start up the container with ``docker compose up -d`` (the ``-d`` flag runs the redis service in the background) and verify that it is running with ``docker ps``. * You can stop the container with ``docker compose down`` when you are done using it. Part 2: Script ~~~~~~~~~~~~~~ **Create a Python script called** ``get_ncbi_genbank_records.py`` **that does the following:** 1. Using ``Entrez.esearch`` from ``Bio``, search the NCBI protein database for records matching the search term "Arabidopsis thaliana AND AT5G10140" and retrieve the list of GI numbers for the matching records (make sure to set max return option, ``retmax=30`` to limit the number of results). 2. Using ``Entrez.efetch`` from ``Bio``, retrieve the full GenBank records for the list of GI numbers obtained in step 1 (NOTE: the ``id`` parameter can be a single GI number or a comma-separated string of GI numbers). 3. Parse the GenBank records using ``SeqIO.parse`` from ``Bio`` and store the resulting record objects in a list. 4. Connect to the Redis database running in the container using the ``redis`` Python package and store each GenBank record in the Redis database with the ``record.id`` as the key and the value being a JSON string containing the record's ID (``record.id``), name (``record.name``), description (``record.description``), and sequence (``str(record.seq)``). 5. After storing the records in Redis, retrieve all the records from Redis and write them to an output text file called ``genbank_records.txt`` that looks like the following: .. code-block:: text ID: AAV51219.1 Name: AAV51219 Description: flowering locus C protein [Arabidopsis thaliana] Sequence: MGRKKLEIKRIENKSSRQVTFSKRRNGLIEKARQLSVLCDASVALLVVSASGKLYSFSSGDNLVKILDRYGKQHADDLKALDHQSKALNYGSHYELLELVDSKLVGSNVKNVSIDALVQLEEHLETALSVTRAKKTELMLKLVENLKEKEKMLKEENQVLASQMENNHHVGAEAEMEMSPAGQISDNLPVTLPLLN ID: NP_001078563.1 Name: NP_001078563 Description: K-box region and MADS-box transcription factor family protein [Arabidopsis thaliana] Sequence: MGRKKLEIKRIENKSSRQVTFSKRRNGLIEKARQLSVLCDASVALLVVSASGKLYSFSSGDNLVKILDRYGKQHADDLKALDHQSKALNYGSHYELLELVDSKLVGSNVKNVSIDALVQLEEHLETALSVTRAKKTELMLKLVENLKEKEKMLKEENQVLASQIFLG ... 6. In addition to the ``loglevel`` command-line argument for setting the logging level, add a command-line argument for specifying the output file name (default should be ``genbank_records.txt``). **BONUS**: add a command-line argument for specifying the search term (default should be "Arabidopsis thaliana AND AT5G10140"). Requirements checklist `````````````````````` * Script name: ``get_ncbi_genbank_records.py`` * Use the shebang line ``#!/usr/bin/env python3`` at the top of the script * At least **2 functions** plus ``main()`` * Properly formatted ``if __name__ == "__main__"`` statement * **Type hints** on all functions (parameters and return types) * **Docstrings** with description, Args, and Returns for every function * **Logging** at at least **1 levels** * **argparse** for log level * **socket** used in logging * At least **one try/except** for error handling * Output TEXT file matches the required format .. admonition:: .gitignore In this assignment, you have created a directory called ``redis-data`` that is persisting data for the Redis database. We don't actually need (or want) to store that in our GitHub repository, so if you haven't already, create a file called ``.gitignore`` in your root GitHub repository. This files tells Git which files and directories to ignore when you commit and push your work. Add ``redis-data`` to your ``.gitignore`` file to ignore that directory. For Python projects, you might want it to look like: .. code-block:: text env venv .venv redis-data What to Turn In --------------- 1. Create a ``homework07`` directory in your Git repository (on your VM). 2. Add ``docker-compose.yml`` and ``get_ncbi_genbank_records.py`` to this directory. 3. Add your output file (e.g., ``genbank_records.txt``) in an ``output_files`` directory. 4. Add a ``README.md`` in ``homework07`` that: * Describes how to start up the Redis container with ``docker compose`` * Describes what the script does and how to run it (including example commands) * Includes a section on AI usage (if applicable — see note below) 5. Commit and push your work to GitHub. **Expected directory layout:** .. code-block:: text my-mbs337-repo/ ├── homework07 ├── docker-compose.yml ├── get_ncbi_genbank_records.py ├── output_files │   └── genbank_records.txt Note on Using AI ---------------- The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions: The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment. Additional Resources -------------------- * `Unit 6: Introduction to Databases and Persistence <../unit06/intro_to_redis.html>`_ * `Unit 6: Introduction to APIs <../unit06/intro_to_apis.html>`_ * `Unit 6: iNaturalist, RCSB PDB, and NCBI APIs <../unit06/bio_apis.html>`_ * `Docker Compose Docs `_ * `Docker Hub `_ * `Redis Docs `_ * `Redis Python Library `_ * `NCBI APIs documentation `_ * `BioPython documentation `_ * `BioPython Tutorial and Cookbook `_ * Please find us in the class Slack channel if you have any questions!