Homework 10

Due Date: Tuesday, April 14 by 11:00am CDT

Hands-on Lab: MLOps with a Linear Classifier

Expanding on the homework assignment you did last week, you will now take the linear classifier you built and put it into production. As a reminder, your homework last week was to develop a linear classifier for the UCI Breast Cancer Wisconsin Dataset. The dataset contains many features of cell nuclei from breast cancer biopsies.

Perform the following steps inside a Python script (not in a Jupyter Notebook). Save the code and the model(s) generated into your homework repository for this assignment.

Part 1: Fit Two Models

As before, write the appropriate code to import the data, split the data into training and test datasets, and fit a linear classifier. After validating the accuracy of the model, save the model to file using the pickle module.

Next, write the appropriate code to import the data, and create a pipeline which performs data normalization using the StandardScaler, then classification with a linear classifier. Fit the pipeline to your training data, and finally save that pipeline to a new file using the pickle module.

The code above should all be in the same Python script, but the model and pipeline should be written to two different, appropriately named files. The Python script should be organized into functions, and should be written follow Python best practices.

In the README.md, include a section that describes the process you went through to prepare data for and fit each of the the two models, and note the performance of each model on the test data.

Part 2: Deploy Two Models

In a new Python script, use the pickle module to load the model and the pipeline you created in Part 1. Then, write code to take sample data from a user as, e.g., a csv file. The test data should be in the same format as the original data available here, but it may be as few as one sample.

The objective is to be able to run just this new Python script, have it take sample data from the user, then load in the model and pipeline, and make two predictions on the sample data - one using the original model, and one using the pipeline which normalizes the data before classifying. For example, usage may look like:

$ python inference.py --sample_data sample_data.csv
Your sample data contains 1 entry:
non-normalized model predicts: [malignant]
normalized model in pipeline predicts: [benign]

(The above is a rough example, your output may look very different)

What to Turn In

  1. Create a homework10/ directory in your homework repository

  2. Include the script used to fit / validate the model / pipeline (e.g. training.py)

  3. Include the script used to load the model / pipeline and make predictions on sample data (e.g. inference.py)

  4. Include the pickled model and pipeline files generated in Part 1 and named appropriately

  5. Include a small sample data file in csv format that can be used to test the inference script

  6. Include a requirements.txt file that lists the dependencies for your project

  7. Add a README.md in homework09/ that:

    • Generally describes the purpose of the homework project

    • Describes the data preparation, model fitting, and performance of each model (as described in Part 1)

    • Instructions to the user on how to run the inference script (as described in Part 2)

    • Includes a section on AI usage (if applicable — see note below)

Expected directory layout:

my-mbs337-repo/
└── homework10/
    ├── README.md
    ├── inference.py
    ├── classifier.pkl
    ├── requirements.txt
    ├── sample_data.csv
    ├── normalizer_and_data_classifier_pipeline.pkl
    └── training.py

Note on Using AI

The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:

The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.

Additional Resources