Code Organization ================= Following standard Python3 code organization practices will make our code easier to read by other developers, and by our future selves who are looking back to see what we did. After going through this module, students should be able to: * Organize code into ``main()`` functions * Import functions into other scripts without executing the ``main()`` block * Write functions in a generalizable way so they are reusable * Use a shebang in their Python3 scripts to make them executable Main Function -------------- In many Python programs, you will find the developer has organized their code into a ``main()`` function. Then, you will see a conditional statement that looks like the example below: .. code-block:: python3 def main(): # application code goes here print("Hello World!") if __name__ == '__main__': main() In this code, there is a function called ``main()`` that prints the phrase ``Hello World!``. There is also a conditional ``if`` statement that checks the value of ``__name__`` and sees if it matches the string ``__main__``. When the ``if`` statement evaluates to ``True``, Python will execute ``main()``. This code pattern is quite common in Python files that you want to be **executed as a script** and **imported in another module**. To understand how this works, let's explore how the Python interpreter sets ``__name__`` depending on how the code is being executed. Execution Modes in Python ~~~~~~~~~~~~~~~~~~~~~~~~~ There are two primary ways that you can instruct the Python interpreter to execute or use code: 1. You can execute the Python file **directly as a script** using the command line (what we have been doing thus far): .. code-block:: console $[mbs-337] python3 my_script.py 2. You can **import** the code from one Python file into another file or into the interactive interpreter: .. code-block:: python3 import my_script Python needs a way to understand which of these is happening. To do this, it will automatically create a special variable called ``__name__``. **What's in a** ``__name__`` **?** ``__name__`` is a special string variable that Python sets for you, and its value answers the following question: *"How is this file being used right now?"* Python then follows a simple rule: * If the file is run **directly**, then: .. code-block:: text __name__ == "__main__" * If the file is **imported**, then: .. code-block:: text __name__ == "" To demonstrate this, copy/paste the contents below into a file called ``test.py``: .. code-block:: python3 print("This is my file to test Python's execution methods.") print("The variable __name__ tells me which context this file is running in.") print("The value of __name__ is:", __name__) When we run this directly, we get the following output: .. code-block:: text :emphasize-lines: 3 This is my file to test Python's execution methods. The variable __name__ tells me which context this file is running in. The value of __name__ is: __main__ Now, let's create a second file called ``use_test.py``: .. code-block:: python3 import test When we run this, we get this output: .. code-block:: text :emphasize-lines: 3 This is my file to test Python's execution methods. The variable __name__ tells me which context this file is running in. The value of __name__ is: test When a Python file is imported, ``__name__`` becomes the name of the **module**! .. admonition:: Important! Did you notice that Python executed all of the code within ``test.py`` when we imported the ``test`` module? This is important behavior that we need to be aware of. More on this later... .. tip:: You'll see the words **file**, **module**, and **script** used throughout this doc. Sometimes they are used interchangeably, but today we should highlight the differences between them: * **File**: Any Python file that contains code; ends in the ``.py`` extension. * **Script**: A Python file that you intend to *execute from the command line*. * **Module**: A Python file that you intend to *import into a script*. Best Practices for Python Main Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These four best practices will ensure that your code can always be run as a script *and* as a module: 1. Put most of your code into a function or class 2. Use ``__name__`` to control execution of your code 3. Create a function called ``main()`` to contain the code you want to run 4. Call other functions from ``main()`` Put Most Code Into a Function or Class ++++++++++++++++++++++++++++++++++++++ Let's consider the code we wrote in the last unit to print the number of residues per chain in a mmCIF file: .. code-block:: python3 from Bio.PDB.MMCIFParser import MMCIFParser parser = MMCIFParser() with open('1MBN.cif', 'r') as f: structure = parser.get_structure('myoglobin', f) for model in structure: for chain in model: chain_id = chain.get_id() num_residues = 0 for residue in chain: num_residues += 1 print(f"Chain {chain_id}: {num_residues} residues") Our goal is to rewrite this so that it can be run as a script *and* imported into another script as a module. If we were to import this code as-is as a module into a new script, Python would execute the full code. Instead, we want the user to be able to control the execution of this code. The best way to do this is to put as much of the code into a **function** or **class**. When Python encounters a ``def`` or ``class`` keyword, it stores those definitions for later use and only executes them when you tell it to. Let's start by putting the "summarize chains" logic in a function: .. code-block:: python3 from Bio.PDB.MMCIFParser import MMCIFParser def summarize_chains(structure): # Print residue count for each chain in each model. for model in structure: for chain in model: chain_id = chain.get_id() num_residues = 0 for residue in chain: num_residues += 1 print(f"Chain {chain_id}: {num_residues} residues") Now we have reusable logic in ``summarize_chains()``, but we still need a way to control when the file actually runs the full workflow (parsing the file and calling ``summarize_chains``). Create a ``main()`` Function +++++++++++++++++++++++++++++ Now we need to put the step-by-step workflow (create parser, open file, get structure) in a single function called ``main()``. This will give us one clear **entry point** to run our code. Define **constants** at the top of the file (with other global configuration) so that they are easy to find and change, and so that ``main()`` only contains the workflow steps: .. code-block:: python3 from Bio.PDB.MMCIFParser import MMCIFParser CIF_FILE = "1MBN.cif" STRUCTURE_ID = "myoglobin" def summarize_chains(structure): # Print residue count for each chain in each model. for model in structure: for chain in model: chain_id = chain.get_id() num_residues = 0 for residue in chain: num_residues += 1 print(f"Chain {chain_id}: {num_residues} residues") def main(): # Create parser, open file, create structure object, call summarize_chains() parser = MMCIFParser() with open(CIF_FILE, 'r') as f: structure = parser.get_structure(STRUCTURE_ID, f) summarize_chains(structure) Use ``__name__`` to Control Execution ++++++++++++++++++++++++++++++++++++++ Then, at the bottom of the file, we'll write a conditional statement that will only call ``main()`` when ``__name__ == "__main__"`` so that running the file as a script runs the workflow, but importing the file does not: .. code-block:: python3 from Bio.PDB.MMCIFParser import MMCIFParser CIF_FILE = "1MBN.cif" STRUCTURE_ID = "myoglobin" def summarize_chains(structure): # Print residue count for each chain in each model. for model in structure: for chain in model: chain_id = chain.get_id() num_residues = 0 for residue in chain: num_residues += 1 print(f"Chain {chain_id}: {num_residues} residues") def main(): # Create parser, open file, create structure object, call summarize_chains() parser = MMCIFParser() with open(CIF_FILE, 'r') as f: structure = parser.get_structure(STRUCTURE_ID, f) summarize_chains(structure) if __name__ == "__main__": main() Call Other Functions From ``main()`` ++++++++++++++++++++++++++++++++++++++ Your ``main()`` function should orchestrate the workflow by calling your other functions. * Keep the actual steps inside helper functions (``summarize_chains()``) * Use ``main()`` to run them in order. Doing this makes the file both a standalone script as well as a module that can be loaded into other scripts. Let's put this code in a module called ``mmcif_analysis.py``. If this code is imported into another Python3 script, that other script will have access to the ``summarize_chains()`` function and the ``main()`` function, but it will not automatically execute either. Let's create another file called ``use_mmcif_analysis.py`` containing the following code: .. code-block:: python3 import mmcif_analysis When we run this script, no output is printed to the terminal. Why is that? .. toggle:: When you import the script, ``__name__`` is now == the module name (``mmcif_analysis``). So the ``if __name__ == '__main__'`` block is now False. ``main()`` is not called, and the code within ``mmcif_analysis.py`` is not run automatically on import anymore. If you *do* want the same workflow to run from the importing script, call ``main()`` explicitly after the import: .. code-block:: python3 # within use_mmcif_analysis.py import mmcif_analysis mmcif_analysis.main() # run the same workflow (parse 1MBN.cif and print chain summary) You now also have access to any other functions defined in the ``mmcif_analysis`` module. For example, you can parse your own mmCIF file and pass the resulting structure to ``summarize_chains()``: .. code-block:: python3 import mmcif_analysis from Bio.PDB.MMCIFParser import MMCIFParser parser = MMCIFParser() with open("other.cif", "r") as f: structure = parser.get_structure("other", f) mmcif_analysis.summarize_chains(structure) Refactoring ------------ **Refactoring** is when you reorganize your code while preserving its original behavior. Refactoring code is analogous to factoring in mathematics. For example: ``f(x) = x^2 + x`` can be written as ``f(x) = x(x+1)`` or in the opposite direction: ``f(x) = x(x+1)`` → ``f(x) = x^2 + x`` The expression changes, but the result does not. In software engineering, we refactor our code so that it is better organized, more readable, and easier to reason out. EXERCISE ~~~~~~~~ Now let's consider another script that we wrote to write FASTQ quality metrics to a JSON: .. code-block:: python3 :linenos: import json from Bio import SeqIO from pydantic import BaseModel # Define ReadSummary model class ReadSummary(BaseModel): id: str sequence: str total_bases: int average_phred: float # Define FastqSummary model class FastqSummary(BaseModel): reads: list[ReadSummary] # Create list of ReadSummary instances reads_list = [] with open('raw_reads.fastq', 'r') as f: for record in SeqIO.parse(f, 'fastq-sanger'): reads_list.append(ReadSummary( id=record.id, sequence=str(record.seq), total_bases=len(record.seq), average_phred=sum(record.letter_annotations['phred_quality']) / len(record.letter_annotations['phred_quality']) )) data = FastqSummary(reads=reads_list) with open('fastq_summary.json', 'w') as outfile: json.dump(data.model_dump(), outfile, indent=2) When looking at our code, we should always ask ourselves the following: 1. Can I succinctly describe what this code is doing? 2. Can I reorganize my code in some way that improves its readability and the ability of others to reason what it does? .. note:: Let the following software development principle guide your thinking: **Single Responsibility Principle (SRP):** A function, class, or module should have a single, well-defined job or responsibility. The first thing we can do is put our Pydantic models into a separate module called ``models.py``. Models such as this should always live in their own module so that this module has a single responsibility: describe the data. .. code-block:: python3 :linenos: :caption: models.py from pydantic import BaseModel class ReadSummary(BaseModel): id: str sequence: str total_bases: int average_phred: float class FastqSummary(BaseModel): reads: list[ReadSummary] The next thing we should do is refactor our analysis logic into small functions with single jobs: .. code-block:: python3 :linenos: :emphasize-lines: 3, 25 :caption: fastq_summary.py import json from Bio import SeqIO from models import ReadSummary, FastqSummary def summarize_record(record) -> ReadSummary: # Convert one FASTQ record into a ReadSummary instance phred_scores = record.letter_annotations['phred_quality'] average_phred = sum(phred_scores) / len(phred_scores) return ReadSummary( id=record.id, sequence=str(record.seq), total_bases=len(record.seq), average_phred=round(average_phred, 2) ) def summarize_fastq_file(fastq_file: str, encoding: str) -> FastqSummary: # Read a FASTQ file and return a FastqSummary instance reads_list = [] with open(fastq_file, 'r') as f: for record in SeqIO.parse(f, encoding): reads_list.append(summarize_record(record)) return FastqSummary(reads=reads_list) def write_summary_to_json(summary: FastqSummary, output_file: str) -> None: # Write FastqSummary to a JSON file with open(output_file, 'w') as outfile: json.dump(summary.model_dump(), outfile, indent=2) Now we've decomposed the analysis into smaller, more focused functions. Now we just need to define a ``main()`` function that will handle the configuration and execution of these functions: .. code-block:: python3 :linenos: :emphasize-lines: 2,3 def main(): summary = summarize_fastq_file(FASTQ_FILE, ENCODING) write_summary_to_json(summary, OUTPUT_JSON) if __name__ == '__main__': main() Putting this all together, we end up with a file that can be run as a standalone script or imported as a module: .. code-block:: python3 :linenos: :caption: fastq_summary.py import json from Bio import SeqIO from models import ReadSummary, FastqSummary # ------------------------- # Constants (configuration) # ------------------------- FASTQ_FILE = 'raw_reads.fastq' OUTPUT_JSON = 'fastq_summary.json' ENCODING = 'fastq-sanger' # ------------------------- # Functions # ------------------------- def summarize_record(record) -> ReadSummary: # Convert one FASTQ record into a ReadSummary instance phred_scores = record.letter_annotations['phred_quality'] average_phred = sum(phred_scores) / len(phred_scores) return ReadSummary( id=record.id, sequence=str(record.seq), total_bases=len(record.seq), average_phred=round(average_phred, 2) ) def summarize_fastq_file(fastq_file: str, encoding: str) -> FastqSummary: # Read a FASTQ file and return a FastqSummary instance reads_list = [] with open(fastq_file, 'r') as f: for record in SeqIO.parse(f, encoding): reads_list.append(summarize_record(record)) return FastqSummary(reads=reads_list) def write_summary_to_json(summary: FastqSummary, output_file: str) -> None: # Write FastqSummary to a JSON file with open(output_file, 'w') as outfile: json.dump(summary.model_dump(), outfile, indent=2) def main(): summary = summarize_fastq_file(FASTQ_FILE, ENCODING) write_summary_to_json(summary, OUTPUT_JSON) if __name__ == '__main__': main() We now have a cleaner, easier to read application that is simple to reason about. We could use this as a module if we wanted: .. code-block:: python3 :linenos: :caption: my_fastq_analysis.py from fastq_summary import summarize_fastq_file, write_summary_to_json summary = summarize_fastq_file('new_reads.fastq', 'fastq-sanger') write_summary_to_json(summary, 'new_output.json') Shebang ------- So far, we have been running Python programs like this: .. code-block:: console [mbs-337]$ python3 fastq_summary.py The above code explicitly tells the operating system to use the Python3 interpreter to run the file. A **shebang** is a special instruction that we put at the top of a script that tells the operating system which interpreter should be used to run the script. You will often see these used in Python, Perl, Bash, C shell, and a number of other scripting languages. In our case, we want to use the following shebang, which should appear on the first line of our Python3 scripts: .. code-block:: python3 #!/usr/bin/env python3 * ``#!``: This tells the operating system that this file is a script and it needs some interpreter to run it * ``/usr/bin/env``: This is the path to ``env``, which is a utility on Unix-like systems that finds interpreters on your system. So instead of hard-coding the exact location of Python3, we ask ``env`` to find it for us * ``python3``: This is the interpreter that we want to use to run the script. Conceptually, this shebang is saying: "*Find Python 3 on this system and use it to run this file.*" We also need to make the script executable using the Linux command ``chmod``: .. code-block:: console [mbs-337]$ chmod u+x fastq_summary.py This enables you to call the Python3 code within as a standalone executable without invoking the interpreter on the command line: .. code-block:: console [mbs-337]$ ./fastq_summary.py This is helpful to lock in a Python version (e.g. Python3) for a script that may be executed on multiple different machines or in various environments. Essentially, a shebang lets you tell the computer which program should run your script, so the script can be executed directly instead of calling Python explicitly. Other Tips ---------- As our Python3 scripts become longer and more complex, we should put more thought into how different contents of the script are ordered. As a rule of thumb, try to organize the different sections of your Python3 code into this order: .. code-block:: python3 # Shebang # Imports # Global variables / constants # Class definitions # Function definitions # Main function definition # Call to main function Other general tips for writing code that is easy to read can be found in the `PEP 8 Style Guide `_, including: * Use four spaces per indentation level (no tabs) * Limit lines to 80 characters, wrap and indent where needed * Avoid extraneous whitespace unless it improves readability * Be consistent with naming variables and functions * Classes are usually ``CapitalWords`` * Constants are usually ``ALL_CAPS`` * Functions and variables are usually ``lowercase_with_underscores`` * Consistency is key * Use functions to improve organization and reduce redundancy * Document and comment your code .. note:: Beyond individual Python3 scripts, there is a lot more to learn about organizing *projects* which may consist of many files. We will get into that later in the semester. Additional Resources -------------------- * Many of the materials in this module were adapted from `COE 332: Software Engineering & Design `_ * `PEP 8 Style Guide `_ * `Defining Main Functions in Python `_