Homework 09

Due Date: Tuesday, April 7 by 11:00am CDT

Hands-on Lab: Linear Classification of Breast Cancer Malignancy

For this homework, you will develop a linear classifier for the UCI Breast Cancer Wisconsin Dataset. The dataset contains many features of cell nuclei from breast cancer biopsies.

Perform the following steps inside a Jupyter notebook. Plan to use a combination of code cells and markdown cells. Save the notebook into your homework repository for this assignment. The instructors will plan to open up the notebook, execute the code cells, and read the explanatory text from your markdown cells.

Part 1: Retrieve the Data

Import the breast cancer dataset from sklearn as follows:

>>> from sklearn.datasets import load_breast_cancer
>>> data = load_breast_cancer()

Write code cells to perform the following steps, and accompany with markdown cells that explain the process and answer the questions below:

  1. Examine the features, target, and shape of the dataset as we did with the iris example.

  • How many features are there?

  • What do the features represent?

  • What are the different classes in the target variable?

  • How many samples are in the dataset?

Part 2: Prepare the Data

Import the necessary function to split the data into training and test datasets as follows:

>>> from sklearn.model_selection import train_test_split

Write code cells to perform the following steps, and accompany with markdown cells that explain the process and answer the questions below:

  1. Set X and y variables in preparation for linear classification.

  2. Split the X and y variables into training and test datasets. Make sure your split is reproducible and that it maintains roughly the proportion of benign and malignant tumors.

  • What proportion of the data is in the training set?

  • Why was that proportion chosen?

Part 3: Fit a Linear Classifier

Import a linear classifier that uses stochastic gradient descent as the optimization algorithm as follows:

>>> from sklearn.linear_model import SGDClassifier

Write code cells to perform the following steps, and accompany with markdown cells that explain the process and answer the questions below:

  1. Fit the data to a linear classifier using the Perceptron algorithm.

  • According to the documentation, what sort of data is this algorithm designed to work with?

  • Does the breast cancer dataset fit that description?

Part 4: Validation and Assessment

Import the necessary functions to check the accuracy of your model and to plot a confusion matrix as follows:

>>> from sklearn.metrics import accuracy_score
>>> from sklearn.metrics import ConfusionMatrixDisplay

Write code cells to perform the following steps, and accompany with markdown cells that explain the process and answer the questions below:

  1. Check the accuracy of your model on the test data set.

  2. Check the accuracy of your model on the training data set.

  3. Plot a confusion matrix for your model.

  • How does the model perform with respect to different labels in the target class?

  • Do you think one is more important to minimize?

What to Turn In

  1. Create a homework09/ directory in your homework repository

  2. Include the Jupyter notebook in a clean, logically organized format for the instructors to read and execute

  3. Include a requirements.txt file that lists the dependencies for your notebook

  4. Add a README.md in homework09/ that:

    • Generally describes the purpose of the homework project

    • Describes very briefly how to launch a Jupyter server and open the notebook

    • Includes a section on AI usage (if applicable — see note below)

Expected directory layout:

my-mbs337-repo/
└── homework09/
    ├── Notebook.ipynb
    ├── README.md
    └── requirements.txt

Note on Using AI

The use of AI to complete this assignment is not recommended, but it is permitted with the following restrictions:

The use of LLMs (like ChatGPT, Copilot, etc) or any other AI must be rigorously cited. Any code blocks or text that are generated by an AI model should be clearly marked as such with in-code comments describing what was generated, how it was generated, and why you chose to use AI in that instance. The homework README must also contain a section that summarizes where AI was used in the assignment.

Additional Resources