Development Environment ======================= The development environment, in this context, refers to the workspace and tools that you use to write, test, debug, and deploy your projects. A good development environment strives to help you be more productive and efficient. It should also provide you with the necessary tools to quickly iterate on your ideas and implement changes. The final projects for this class should be generally developed and deployed on the class virtual machines. However, the same concepts described below could be applied to development locally on your own laptop, and / or deployment to cloud providers like AWS, Azure, or Google Cloud. After going through this module, students should be able to: * Set up the files and folders necessary to work on the final project * Describe each component and why they are important to the overall system * Describe the general development cycle from a new idea all the way to production * Write targets for the major steps into a Makefile * Set up test and production deployment environments * Find and interpret logs for debugging purposes * Apply semantic versioning to code and container images File Organization ----------------- Consider the DNA Sequence Classifier previously described in this class: .. figure:: images/dashboard.png :width: 600px :align: center Sample dashboard The project contains the following functionality: 1. A dashboard for users to upload sequences and view results 2. A database for storing and recalling uploaded sequences 3. A machine learning model for classifying sequences An example file organization scheme for developing this project may look like: .. code-block:: text dna-seq-classifier/ ├── .gitignore ├── Dockerfile ├── README.md ├── app.py ├── docker-compose.yml ├── model/ │ ├── svm_linear_model.pkl │ ├── svml_predict.py │ ├── svml_train_model.py │ └── train_cols.pkl ├── redis-data/ │ └── dump.rdb └── requirements.txt In this example, the file ``app.py`` is the main entrypoint to the dashboard. It is the standard name for a plotly dash app. All of the logic for rendering the dashboard, interaction with the database, and inference using the classifier model, are contained in that file. The ``model/`` subdirectory has four files: * ``svml_train_model.py`` was used to train a model and write pickel files * ``svm_linear_model.pkl`` is the pickeled model file that is used for inference * ``train_cols.pkl`` is the pickeled file that contains the column names for the model * ``svml_predict.py`` is the script that loads the model and makes predictions - imported into ``app.py`` and used for inference The ``redis-data/`` subdirectory is where the Redis database will store backup data in a file called ``dump.rdb``. It is mounted as a volume in the Redis container at runtime. The ``requirements.txt`` file is a standard way to list all of the Python dependencies for the project. We have explored a few different ways in this class, but currently this is probably the most universal way to do it especially when dependencies are not too complicated. The ``Dockerfile`` containerize the relevant source code for the dashboard along with all of its dependencies. We don't have a separate Dockerfile for the Redis container because it is just a stock image from Docker Hub. The ``docker-compose.yml`` file is used to orchestrate the dashboard and Redis containers together. It specifies how the two containers should be built, run, and connected. The ``README.md`` file serves as the main point of documentation for the project. It should contain instructions for the developer on how to deploy the project, and it should contain instructions for the user on how to interact with the dashboard and interpret results. .. tip:: Make sure to put your folders under version control and commit regularly as you work. Don't save it all until the end. QUESTIONS ~~~~~~~~~ * Is the data set used to train the model part of this repository? Why or why not? * Should we commit ``dump.rdb`` to GitHub? Why or why not? * What other file might I want to include in ``redis-data/``? * Is using a ``requirements.txt`` file a good idea? Are there better alternatives? * What might be listed in the ``.gitignore``? * What other notable files are *not* present? EXERCISE ~~~~~~~~ Set up a new directory on your class virtual machine with an organization scheme similar to the one above. Ignore the ``model/`` files for now, pull a sample ``Dockerfile`` from a `previous lesson <../unit08/dash_production.html#containerizing-our-dash-app-with-docker>`_, and use the following for ``app.py``: .. code-block:: python :linenos: from dash import Dash, html app = Dash() server = app.server app.layout = [html.Div(children='Hello, Dash!')] if __name__ == '__main__': app.run(host='0.0.0.0', port=8050, debug=True) (We will use this as a starting point for setting up automations and CI/CD pipelines in the next several modules.) Development Cycle ----------------- We previously talked at great length about why it is a good idea to containerize an app / service that you develop. Some of the main reasons include portability to other machines and reproducibility. In this class, development and deployment are generally done on the same virtual machine (on Jetstream), but this is not always the case in the real world. But, one drawback is that containerization is an added step in our development cycle. The development cycle for a containerized app such as this tends to follow the form: 1. Edit some source code (e.g. add a new feature to the dashboard) 2. Delete any running container with the old source code 3. Re-build the image 4. Start up a new container 5. Test the new code / functionality 6. Repeat This 6-step cycle is great for iterating on the frontend dashboard or the supporting back end code (e.g. ML model) either independently or simultaneously. However, watch out for potential error sources. For example if you stop the Redis container, the frontend dashboard may immediately break and will need to be restarted. Or, consider that the model is loaded into memory so any changes to the backend code will not be reflected in the running container until it is restarted. Makefile Automation ------------------- Makefiles are a useful automation tool for building, running, testing, and pushing code for your projects. They are also useful to assign short keywords to long, frequently-used commands that you would otherwise need to type out. Now that our development cycle is becoming longer and more complicated, it is worth investing some time in Makefiles to speed up the development process. Here, we will set up a Makefile to help with the 6-step cycle above. Using certain keywords (called "targets") we will create shortcuts to cleaning up running containers, re-building docker images, and running new containers. Targets are listed in a file called ``Makefile`` in this format: .. code-block:: text target: prerequisite(s) recipe Targets are short keywords, and recipes are shell commands. For example, a common objective might be to list what containers are currently running on a certain port (e.g. port 8050), and format the output in a nice, readable way. The command to do this may become tedious to type out, so we can assign it to a target in a Makefile like this: .. code-block:: text filter: docker ps --filter "expose=8050" --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}\t{{.Status}}" Put this text in a file called ``Makefile`` in your current directory, then run the following in the terminal: .. code-block:: console [mbs337-vm]$ make filter And that will list any the Docker container with the 8050 port exposed. Makefiles can be further abstracted with variables to make them a little bit more flexible. Consider the following Makefile: .. code-block:: text PORT ?= 8050 filter: docker ps --filter "expose=${PORT}" --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}\t{{.Status}}" Here we have added a variable ``PORT`` at the top so we can easily customize the behavior. We can also create targets for building and running our containers. For example, the following targets expand our current Makefile with functionality for building and running the dashboard container, and a target for stopping all containers: .. code-block:: text PORT ?= 8050 NAME ?= "username/project:tag" filter: docker ps --filter "expose=${PORT}" --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}\t{{.Status}}" build: docker build -t ${NAME} . run: build docker run -d -p 8050:8050 ${NAME} stop: docker stop $(shell docker ps -q) all: stop build run Note the syntax of the ``run`` target - it has a prerequisite of ``build``, which means that the build target will be run first before the run target is executed. This ensures that we are always running the latest version of code in the current working directory. Try some of the targets out in the terminal: .. code-block:: console [mbs337-vm]$ make build [mbs337-vm]$ make run [mbs337-vm]$ make stop .. warning:: Makefile syntax is expecting a tab character before the recipe, not spaces! EXERCISE ~~~~~~~~ Prepare a docker compose file for your project that orchestrates two containers: a redis database and a plotly dash frontend. Then, write a Makefile that, at a minimum: 1. Removes running containers in your namespace 2. Builds and runs new containers / services 3. Cycles through steps 1-2 listed above in one command .. note:: Docker compose automates much of the build and run process for us. But the commands can be tedious to type. Makefiles can be used to automate any arbitrary command that is part of your development cycle. EXERCISE ~~~~~~~~ Plan and write a new feature for your dashboard. For example, download some sequence data as csv and display it in a table in the dashboard. To get started, consider the following code: .. code-block:: python :linenos: :emphasize-lines: 2-3,8,12-14 from dash import Dash, html import dash_ag_grid as dag import pandas as pd app = Dash() server = app.server df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data') app.layout = [ html.Div(children='Hello, Dash!'), dag.AgGrid( rowData=df.to_dict('records'), columnDefs=[{"field": i} for i in df.columns] ) ] if __name__ == '__main__': app.run(host='0.0.0.0', port=8050, debug=True) How would you go about putting this new code in a container and make it available for testing? What other considerations should be made to get this to work in a container? Test and Production Deployments ------------------------------- Often when you are working on a project that is in "production" (like a dashboard that is live on the web) but you are adding new features quickly, it is a good idea to have a separate test or "staging" deployment that allows you to see and test new code changes before they are live to everyone else. One simple way to test your changes in is to deploy them onto another port on the same virtual machine - for example port 8051 instead of 8050. To do this in practice, simply change the port number in your Docker compose file. Even better, write a second Docker compose file for your staging environment that is identical to the production environment except for the port number(s). After verifying that the staging deployment looks good, then it is generally safe to deploy into production. EXERCISE ~~~~~~~~ Write a new Docker compose file for your staging environment, and add new Makefile targets to build and deploy to staging. Container Logs -------------- When you are developing and testing your code, it is inevitable that you will encounter errors. When this happens, it is important to know how to find and interpret logs to debug the problem. For example, if your dashboard container is not displaying or working as expected, you should know how to check the container logs. First identify the container ID with the following command: .. code-block:: console [mbs337-vm]$ docker ps If the containers are exited because an error caused them to crash, you may need to: .. code-block:: console [mbs337-vm]$ docker ps -a Then, use the name or ID of the container to check the logs: .. code-block:: console [mbs337-vm]$ docker logs EXERCISE ~~~~~~~~ Introduce an error into your dashboard app.py. 1. First try a SyntaxError - what happens when you try to deploy in a container? Can you find the corresponding log? 2. Next try something more subtle - e.g. put a typo in the path to the download data or remove one of the import statements at the beginning. What type of error is raised? Versioning ---------- We have not spent much time discussing versioning in this class other than to see do not use the tag 'latest' when versioning your repos or Docker images. There is a well-accepted standard for versioning called 'Semantic Versioning'. It follows the specification: Given a version number **MAJOR.MINOR.PATCH**, increment the: * **MAJOR** version when you make incompatible API changes, * **MINOR** version when you add functionality in a backwards compatible manner, and * **PATCH** version when you make backwards compatible bug fixes. You can assign a tag to the current state of a repository on the command line by doing: .. code-block:: console [mbs337-vm]$ git tag -a 0.1.0 -m "first release" [mbs337-vm]$ git push origin 0.1.0 .. tip:: Do you have a new software system that just kind of works and has a little bit of functionality, but you don't know what version tag to assign it? A good place to start is version 0.1.0. Additional Resources -------------------- * `Semantic Versioning `_ * `Useful .gitignore Builder `_ * `README Tips `_