Homework 05

Due Date: Tuesday, February 17 by 11:00am CST

Unit 4 Best Practices: mmCIF Summary Script

This homework applies everything from Unit 4 (Code Organization, Documentation, Logging, and Error Handling). You will build a single, well-structured Python script called mmcif_summary.py that reads a mmCIF structure file, computes per-chain residue statistics, and writes the result to a JSON file in a specified format.

Input file: Use the hemoglobin structure 4HHB (same as in Homework 04). Download with:

wget https://files.rcsb.org/download/4HHB.cif.gz
gunzip 4HHB.cif.gz

Create a Python script named mmcif_summary.py that:

  1. Parses a mmCIF file (e.g., 4HHB.cif) using MMCIFParser from Bio.PDB.

  2. For each chain in the structure, computes:

    • total_residues — total number of residues in the chain

    • hetero_residue_count — number of hetero residues (waters, ligands, ions, etc.)

    • standard_residues — number of standard (non-hetero) residues

  3. Writes the summary to a JSON file in the exact format shown below:

{
  "chains": [
    {
      "chain_id": "A",
      "total_residues": 198,
      "standard_residues": 141,
      "hetero_residue_count": 57
    },
    {
      "chain_id": "B",
      "total_residues": 205,
      "standard_residues": 146,
      "hetero_residue_count": 59
    },
    {
      "chain_id": "C",
      "total_residues": 201,
      "standard_residues": 141,
      "hetero_residue_count": 60
    },
    {
      "chain_id": "D",
      "total_residues": 197,
      "standard_residues": 146,
      "hetero_residue_count": 51
    }
  ]
}

Requirements checklist

  • Script name: mmcif_summary.py

  • At least 3 functions plus main()

  • Properly formatted if __name__ == "__main__" statement

  • Type hints on all functions (parameters and return types)

  • Docstrings with description, Args, and Returns for every function

  • Logging at at least 3 levels

  • argparse for log level

  • socket used in logging

  • At least one try/except for error handling

  • Output JSON matches the required format

  • Use MMCIFParser from Bio.PDB; iterate over the first model and all chains

Type Hints

Your arguments and return values may involve Biopython objects (e.g., Structure, Chain, Residue) from Bio.PDB. You can use object as the type hint to indicate that the parameter is some object provided by the Biopython library.

For built-in types (str, list, dict, int, etc.) and your own data structures, use full type hints as usual.

Tip

You may see PDBConstructionWarning messages when parsing some mmCIF files (e.g., “Chain D is discontinuous”). These are safe to ignore for this assignment.

What to Turn In

  1. Create a homework05 directory in your Git repository (on your VM).

  2. Add mmcif_summary.py to this directory.

  3. Add your summary (e.g., 4HHB_summary.json) in an output_files directory.

  4. Add a README.md in homework05 that:

    • Describes what the script does and how to run it (including example commands)

    • Explains where to get the input file (4HHB.cif)

    • Includes a section on AI usage (if applicable — see note below)

  5. Commit and push your work to GitHub.

Expected directory layout:

my-mbs337-repo/
├── homework05/
│   ├── mmcif_summary.py
│   ├── output_files/
│   │   └── 4HHB_summary.json
│   └── README.md

Note on Using AI

The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:

The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.

Additional Resources