Homework Assignment 4

Homework Assignment 4#

Using the Perceptron, SVMs, and PCA with Toxicity Data#

✅ Put your name here.
#

✅ Put your GitHub username here.
#

Goals for this homework assignment#

By the end of this assignment, you should be able to:

Use git and the branching functionality to track your work and turn in your assignment
Read in data and prepare it for modeling
Build, fit, and evaluate an SVC model of data
Use PCA to reduce the number of important features
Build, fit, and evaluate an SVC model of PCA-transformed data
Train a perceptron and compare to SVC model

Assignment instructions:#

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are 63 points possible on this assignment. Point values for each part are included in the section headers.

This assignment is due by 11:59 pm on Friday, December 5. It should be pushed to your repo (see Part 1) AND submitted to D2L.

Imports#

It’s useful to put all of the imports you need for this assignment in one place. Read through the assignment to figure out which imports you’ll need or add them here as you go.

# Put all necessary imports here

Part 1: Git Repo Management and Branching (6 points)#

For this assignment, you’re going to add it to the cmse202-f25-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to

✅ Do the following:

Navigate to your cmse202-f25-turnin local repository and create a new directory called hw-04
Move this notebook into that new directory in your repository.
Create a new branch called hw04_branch.
“Check out” the new branch (so that you’ll be working on that branch).
Double check to make sure you are actually on that branch.
Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw04_branch to push your new branch to GitHub.

Finally, ✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.

# Put your answer here

Important: Double check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the previous homework assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-f25-turnin” repository inside the hw-04 directory that you just created within the new branch hw04_branch.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

Part 2. Loading the dataset: Toxicity data (10 points)#

The dataset contains information about molecules along with the an indication if it is toxic or nontoxic.

The goal of this assignment is to use this dataset to practice using the Perceptron classifier, SVMs, and PCA tools we’ve covered in class. Since the goal of the assignment is to develop models, we have supplied a clean dataset without any missing values.

The data#

✅ Do This: To get started, you’ll need to download the associated data.csv file: https://raw.githubusercontent.com/hoolagans/CMSE202_FS24/refs/heads/main/data.csv

Once you’ve downloaded the data, open the files using a text viewer or other tool on your computer and take a look at the data to get a sense of the information it contains. If you are curious about this dataset it came from the following link Toxicity Data.

2.1 Load the data#

✅ Task 2.1 (3 point): Read the data.csv file into your notebook. We’re going to use the “Class” column as the classes that we’ll be trying to predict with our classification models. You will want to replace the values in “Class” with 1 and 0 by assigning “Toxic” to 1 and “NonToxic” to 0.

Once you’ve loaded in the data, display the head of the DataFrame to make sure it looks reasonable.

# Put your code here

2.2 Plotting the Data#

✅ Task 2.2 (2 points): Use the seaborn pairplot function to view the distributions of the different classes across the different feature pairs. There are too many featuers to display here, so just display the last 5 features in the dataset. You should use the “hue” option to set the points to be colored based on the “Class” so you can easily identify the different class distributions. You should get a 5x5 frame of plots.

# Put your code here

✅ Question 2.1 (2 point): Looking at the plots, does it look like we should be able to reasonably find a classifier to separate the classes? Record your observations.

✎ Erase this and put your observations here.

2.3 Separating the “features” from the “labels”#

As we’ve seen when working with sklearn it can be much easier to work with the data if we have separate variables that store the features and the labels.

✅ Task 2.3 (1 point): Split your DataFrame so that you have two separate DataFrames, one called features, which contains all of the seed features, and one called labels, which contains all of the integer “Class” labels. Display both of these new DataFrames to make sure they look correct.

# Put your code here

✅ Question 2.2 (2 points): What accuracy would you achieve if you produced a model that just predicted “NonToxic” for the entire dataset? Would you consider this model to be useful? Why or why not? (Include the code you used to determine this along with your written answer below.)

# Put your code here

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 2”, and push the changes to GitHub.

Part 3. Building an SVC model (7 points)#

Now, to tackle this classification problem, we will use a support vector machine. Of course, we could easily replace this with any sklearn classifier we choose, but for now we will just use an SVC with a rbf kernel.

3.1 Splitting the data#

But first, we need to split our data into training and testing data!

✅ Task 3.1 (2 point): Split your data into a training and testing set with a training set representing 80% of your data. For reproducibility , set the random_state argument to 12. Print the lengths to show you have the right number of entries.

# Put your code here

3.2 Modeling the data and evaluating the fit#

As you have done this a number of times at this point, we ask you to do most of the analysis for this problem in one cell.

✅ Task 3.2 (4 points): Build a rbf kernel SVC model with C=10, fit it to the training set, and use the test features to predict the outcomes. Evaluate the fit using the confusion matrix and classification report.

First Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.

# Put your code here

✅ Question 3.1 (1 point): How accurate is your model? Looking at the Confusion Matrix, is the model doing anything useful?

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 3”, and push the changes to GitHub.

Part 4. Building an SVC model using transformed data (10 points)#

As we saw in class, we can synthesize new features that can allow us to take a problem that is challenging to separate linearly and make it easier to classify. In some cases, we may even be able to make data linearly separable.

This dataset has many features and is very challenging so it would be unreasonable to expect you to come up with new features here, so I’ve done this step for you. You will just load in this new dataset and begin exploring it in this section.

4.1 Loading in transformed data#

✅ Task 4.1 (3 points): Load in the new dataset titled ‘transformed.csv’ from ‘https://raw.githubusercontent.com/hoolagans/CMSE202_FS24/refs/heads/main/transformed.csv’ and then create a seaborn pairplot using the entire dataframe with the points colored based on the class label. (Note: you will also need to replace the labels with 1 and 0 just as you did in the earlier section with the other dataset.)

# Put your code here

✅ Question 4.1 (2 points): Looking at the pairplots, does it seem that these transformed features will make it easier to classify.

✎ Erase this and put your answer here.

4.2 Modeling the data and evaluating the fit#

Now using the new data, create training and testing splits and then fit the data using a SVC model.

✅ Task 4.2 (4 points): Build a linear kernel SVC model with C=1.0, fit it to the training set, and use the test set to evaluate the model. Evaluate the fit using the confusion matrix and classification report. (Note: use seed 12 again when splitting the data and use train_size=0.8.)

# Put your code here

✅ Question 4.2 (1 point): How accurate is your model? Did it perform better with this new transformed data?

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 4”, and push the changes to GitHub.

Part 5. Finding and using the best hyperparameters (9 points)#

At this point, we have fit an SVC model on two datasets and determined it’s performance, but is it the best model? We can use GridSearchCV to find the best model (given our choices of parameters). Once we do that, we will use that “best” model for making predictions.

In this section, continue using the transformed data.

5.1 Performing a grid search#

✅ Task 5.1 (4 points): Using the following parameters C = 0.1, 1.0, 10.0, 100.0, 1000.0 and gamma = 0.01, 0.1, 1.0, 10.0 for a linear, rbf, and sigmoid kernels use GridSearchCV with the SVC() model to find the best fit parameters. Once, you’re run the grid search, print the “best params” that the grid search found (hint: there’s an attribute associated with the GridSearchCV object that stores this information). Note that this code could take a while to run since it is repeatedly training your SVM.

# Put your code here

✅ Question 5.1 (1 point): How do the “best params” results of the grid search compare to what you used in Part 4? Did the hyper parameter(s) change? What kernel did the grid search determine was the best option?

✎ Erase this and put your answer here.

5.2 Evaluating the best fit model#

Now that we have found the “best params”, let’s determine how good the fit is.

✅ Task 5.2 (2 points): Use the test features to predict the outcomes for the best model. Evaluate the fit using the confusion matrix and classification report.

Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.

# Put your code here

✅ Question 5.2.1 (1 point): How accurate is this “best” model? What evidence are you using to determine that? How many false positives and false negatives does it predict?

✎ Erase this and put your answer here.

✅ Question 5.2.2 (1 point): How does the model compare to the state-of-the-art performance achieved in the article at this link Paper? Look at Table 2 and Table 3 in the paper.

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 5”, and push the changes to GitHub.

Part 6. Using Principal Components (11 points)#

The full model uses all 10 transformed features to predict the results and you likely found that the model is decently accurate using all 10 features, but not perfect. Could we get the same level of accuracy (or better) using fewer features? When datasets start to get very large and complex, applying some sort of feature reduction method can reduce the computational resources needed to train the model and, in some case actually improve the accuracy.

When performing feature reduction, one could simply try to identify which features seem most important and drop the ones that aren’t, but performing a Principal Component Analysis (PCA) to determine the features that contribute the most to the model (through their accounted variance) can be more effective.

6.1 Running a Principle Component Analysis (PCA)#

Since we have 10 total features to start with, let’s see how well we can do with just 1 feature. Reduce the feature count to 1 principle components.

✅ Task 6.1 (3 points): Using PCA() and the associated fit() method, run a principle component analysis on your training features using 1 component. Transform both the test and training features using the result of your PCA. Print the explained_variance_ratio_.

# Put your code here

✅ Question 6.1 (1 point): What is the total explained variance ratio captured by this simple 1-component PCA? How well do you think a model with just 1 feature will perform? Why?

✎ Erase this and put your answer here.

6.2 Fit and Evaluate an SVC model#

Using the PCA transformed features, we need to train and test a new SVC model. You’ll want to perform the GridSearchCV again since there may a better choice for the kernel and the hyper-parameters.

✅ Task 6.2 (2 points): Using the PCA transformed training data, build and train an SVC model using the GridSearchCV tool to make sure you’re using the best kernel and hyper-parameter combination. Predict the classes using the PCA transformed test data. Evaluate the model using the classification report, and the confusion matrix. (Note: use the same parameter options as we used in the previous hyperparameter search.)

# Put your code here

✅ Question 6.2 (1 point): How accurate is this model? What evidence are you using to determine that? How does it compare to the full feature model?

✎ Erase this and put your answer here.

6.3 Repeat your analysis with more components#

What if we increase the number of principle components to 4? What happens now?

✅ Task 6.3 (2 points): Repeat your analysis from 6.1 and 6.2 using 4 components instead. As part of your analysis, print the total explained variance ratio as well as the sum of these values.

# Put your code here

✅ Question 6.3 (2 point): What is the total explained variance ratio captured by this PCA? How accurate is this model? What evidence are you using to determine that?

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 6”, and push the changes to GitHub.

7. Revisiting the Perceptron classifier (10 points)#

In class you implemented your own perceptron class. Fortunately, there is a perceptron classifier already built into scikit learn, so in this portion of the assignment we will be exploring scikit learn’s perceptron

✅ Do this: Run the following cell to import the code from the Perceptron class.

from sklearn.linear_model import Perceptron

✅ Task 7.1 (4 points): Create an instance of the Perceptron object using alpha=0.001 and penalty=’l2’. Then, use the fit() to train the classifier using the training features and labels dataset you’ve been using in the assignment up to this point. Finally, use the predict() method to predict the labels for the test features and print the accuracy score.

# Put your code here

✅ Question 7.1 (1 points): How good of job did the Perceptron classifier do classifying this datset? How does it compare to the SVC model you built in the previous parts of this assignment?

✎ Erase this and put your answer here.

✅ Task 7.2 (4 points): Now perform a grid search as you did with the support vector classifier earlier in this assignment. Here you will want to search over penalty = l2, l1, elasticnet and alpha= 0.0001, 0.001, 0.01, and 0.1. Find and return the best parameters, the confusion matrix, and the classification report.

# Put your code here

✅ Question 7.2 (1 point): How do these results compare to the results when using a support vector classifier now that we optimized the parameters? Did the perceptron do better or worse?

✎ Erase this and put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 7”, and push the changes to GitHub.

Part 1. Continued#

Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main branch.

✅ Do the following:

Switch back to your main branch.
Merge your hw04_branch with your main branch.
Finally, push the changes to GitHub.

Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” folder, find the submission folder for Homework 4, and upload your notebook.

Homework Assignment 4

Contents

Homework Assignment 4#

Using the Perceptron, SVMs, and PCA with Toxicity Data#

✅ Put your name here.#

✅ Put your GitHub username here.#

Goals for this homework assignment#

Assignment instructions:#

Imports#

Part 1: Git Repo Management and Branching (6 points)#

Part 2. Loading the dataset: Toxicity data (10 points)#

The data#

2.1 Load the data#

2.2 Plotting the Data#

2.3 Separating the “features” from the “labels”#

🛑 STOP#

Part 3. Building an SVC model (7 points)#

3.1 Splitting the data#

3.2 Modeling the data and evaluating the fit#

🛑 STOP#

Part 4. Building an SVC model using transformed data (10 points)#

4.1 Loading in transformed data#

4.2 Modeling the data and evaluating the fit#

🛑 STOP#

Part 5. Finding and using the best hyperparameters (9 points)#

5.1 Performing a grid search#

5.2 Evaluating the best fit model#

🛑 STOP#

Part 6. Using Principal Components (11 points)#

6.1 Running a Principle Component Analysis (PCA)#

6.2 Fit and Evaluate an SVC model#

6.3 Repeat your analysis with more components#

🛑 STOP#

7. Revisiting the Perceptron classifier (10 points)#

🛑 STOP#

Part 1. Continued#

Congratulations, you’re done!#

✅ Put your name here.
#

✅ Put your GitHub username here.
#