Homework Assignment 4#

Using the Perceptron, SVMs, and PCA with Seeds Data#

✅ Put your name here.

#

✅ Put your GitHub username here.

#

Underwater naval mines

Goals for this homework assignment#

By the end of this assignment, you should be able to:

  • Use git and the branching functionality to track your work and turn in your assignment

  • Read in data and prepare it for modeling

  • Build, fit, and evaluate an SVC model of data

  • Use PCA to reduce the number of important features

  • Build, fit, and evaluate an SVC model of PCA-transformed data

  • Train a perceptron and compare to SVC model

Assignment instructions:#

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are 65 points possible on this assignment. Point values for each part are included in the section headers.

This assignment is due by 11:59 pm on Monday, December 2. It should be pushed to your repo (see Part 1) AND submitted to D2L.

Imports#

It’s useful to put all of the imports you need for this assignment in one place. Read through the assignment to figure out which imports you’ll need or add them here as you go.

# Put all necessary imports here

Part 1: Git Repo Management and Branching (6 points)#

For this assignment, you’re going to add it to the cmse202-f24-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to

✅ Do the following:

  1. Navigate to your cmse202-f24-turnin local repository and create a new directory called hw-04

  2. Move this notebook into that new directory in your repository.

  3. Create a new branch called hw04_branch.

  4. “Check out” the new branch (so that you’ll be working on that branch).

  5. Double check to make sure you are actually on that branch.

  6. Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw04_branch to push your new branch to GitHub.

Finally, ✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.

# Put your answer here

Important: Double check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the previous homework assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-f24-turnin” repository inside the hw-04 directory that you just created within the new branch hw04_branch.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.


Part 2. Loading a the dataset: Seeds data (7 points)#

The dataset contains information about seeds along with the type of seed.

The goal of this assignment is to use this dataset to practice using the Perceptron classifier, SVMs, and PCA tools we’ve covered in class. Since the goal of the assignment is to develop models, we have supplied a clean dataset without any missing values.

The data#

✅ Do This: To get started, you’ll need to download the associated seeds.tsv file: https://raw.githubusercontent.com/hoolagans/CMSE202_FS24/main/seeds.tsv

Once you’ve downloaded the data, open the files using a text browser or other tool on your computer and take a look at the data to get a sense of the information it contains. If you are curious about this dataset it came from the following link Seeds Data.

2.1 Load the data#

✅ Task 2.1 (2 point): Read the seeds.tsv file into your notebook. When loading in the dataset assign the following names to the features ([“F1”,”F2”,”F3”,”F4”,”F5”,”F6”,”F7”,”Class”]). We’re going to use “Class” column as the classes that we’ll be trying to predict with our classification models.

Once you’ve loaded in the data, display the DataFrame to make sure it looks reasonable. You should have 8 columns and 210 rows.

# Put your code here

2.2 Plotting the Data#

✅ Task 2.2 (2 points): Use the seaborn pairplot function to view the distributions of the different classes across the different feature pairs. You should use the “hue” option to set the points to be colored based on the “Class” so you can easily identify the different class distributions.

# Put your code here

Question 2.1 (2 point): Looking at the plots, does it look like we should be able to reasonably find a classifier to separate the classes? Record your observations.

Erase this and put your observations here.

2.3 Separating the “features” from the “labels”#

As we’ve seen when working with sklearn it can be much easier to work with the data if we have separate variables that store the features and the labels.

✅ Task 2.3 (1 point): Split your DataFrame so that you have two separate DataFrames, one called features, which contains all of the seed features, and one called labels, which contains all of the integer “Class” labels. Display both of these new DataFrames to make sure they look correct.

# Put your code here

Question 2.2 (2 point): How balanced are the classes? Does it matter for the set of classes to be balanced? Why or why not? (Include the code you used to determine this along with your written answer below.)

Erase this and put your answer here.

# Put your code here

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 2”, and push the changes to GitHub.



Part 3. Building an SVC model (5 points)#

Now, to tackle this classification problem, we will use a support vector machine. Of course, we could easily replace this with any sklearn classifier we choose, but for now we will just use an SVC with a rbf kernel.

3.1 Splitting the data#

But first, we need to split our data into training and testing data!

✅ Task 3.1 (2 point): Split your data into a training and testing set with a training set representing 70% of your data. For reproducibility , set the random_state argument to 12. Print the lengths to show you have the right number of entries.

# Put your code here

3.2 Modeling the data and evaluating the fit#

As you have done this a number of times at this point, we ask you to do most of the analysis for this problem in one cell.

✅ Task 3.2 (4 points): Build a rbf kernel SVC model with C=1.0, fit it to the training set, and use the test features to predict the outcomes. Evaluate the fit using the confusion matrix and classification report.

First Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.

# Put your code here

Question 3.1 (1 point): How accurate is your model? What evidence are you using to determine that? How many false positives and false negatives does it predict for each class?

Erase this and put your answer here.


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 3”, and push the changes to GitHub.



Part 4. Finding and using the best hyperparameters (8 points)#

At this point, we have fit one model and determined it’s performance, but is it the best model? We can use GridSearchCV to find the best model (given our choices of parameters). Once we do that, we will use that “best” model for making predictions.

4.2 Evaluating the best fit model#

Now that we have found the “best params”, let’s determine how good the fit is.

✅ Task 4.2 (2 points): Use the test features to predict the outcomes for the best model. Evaluate the fit using the confusion matrix and classification report.

Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.

# Put your code here

Question 4.2 (1 point): How accurate is this “best” model? What evidence are you using to determine that? How many false positives and false negatives does it predict?

Erase this and put your answer here.


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 4”, and push the changes to GitHub.



Part 5. Using Principal Components (13 points)#

The full model uses all 6 features to predict the results and you likely found that the model is decently accurate using all 6 features, but not perfect. Could we get the same level of accuracy (or better) using fewer features? When datasets start to get very large and complex, applying some sort of feature reduction method can reduce the computational resources needed to train the model and, in some case actually improve the accuracy.

When performing feature reduction, one could simply try to identify which features seem most important and drop the ones that aren’t, but performing a Principal Component Analysis (PCA) to determine the features that contribute the most to the model (through their accounted variance) can be more effective.

5.1 Running a Principle Component Analysis (PCA)#

Since we have 7 total features to start with, let’s see how well we can do with just 1 feature. Reduce the feature count to 1 principle components. We’ll see how well we can predict the classes of the seeds dataset with just 1 feature!

✅ Task 5.1 (3 points): Using PCA() and the associated fit() method, run a principle component analysis on your training features using 1 component. Transform both the test and training features using the result of your PCA. Print the explained_variance_ratio_.

# Put your code here

Question 5.1 (1 point): What is the total explained variance ratio captured by this simple 1-component PCA? How well do you think a model with just 1 feature will perform? Why?

Erase this and put your answer here.

5.2 Fit and Evaluate an SVC model#

Using the PCA transformed features, we need to train and test a new SVC model. You’ll want to perform the GridSearchCV again since there may a better choice for the kernel and the hyper-parameters.

✅ Task 5.2 (2 points): Using the PCA transformed training data, build and train an SVC model using the GridSearchCV tool to make sure you’re using the best kernel and hyper-parameter combination. Predict the classes using the PCA transformed test data. Evaluate the model using the classification report, and the confusion matrix.

# Put your code here

Question 5.2 (1 point): How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the full feature model?

Erase this and put your answer here.

5.3 Repeat your analysis with more components#

You probably found that the model with 1 feature didn’t actually do too bad, which is great given how few features we’re using, but it’s still not as good as just using all of the feature. Can we do better?

What if we increase the number of principle components to 4? What happens now?

✅ Task 5.3 (2 points): Repeat your analysis from 5.1 and 5.2 using 4 components instead. As part of your analysis, print the total explained variance ratio as well as the sum of these values.

# Put your code here

Question 5.3 (4 point): What is the total explained variance ratio captured by this PCA? How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the 1 PCA component model? To the full feature model?

Erase this and put your answer here.


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 5”, and push the changes to GitHub.



Part 6. How well does PCA work? (14 points)#

Clearly, the number of components we use in our PCA matters. Let’s investigate how they matter by systematically building a model for any number of selected components. While this might seem a bit unnecessary for such a relatively small dataset, this can be very useful for more complex datasets and models!

6.1 Accuracy vs. Components#

To systematically explore how well PCA improves our classification model, we will do this by writing a function that creates the PCA, the SVC model, fits the training data, predict the labels using test data, and returns the accuracy scores and the explained variance ratio. So your function will take as input:

  • the number of requested PCA components

  • the training feature data

  • the testing feature data

  • the training data labels

  • the test data labels

and it should return the accuracy score for an SVC model fit to pca transformed features and the total explained variance ratio (i.e. the sum of the explained variance for each component).

✅ Task 6.1 (4 points): Create this function, which you will use in the next section.

# Put your code here

6.2 Compute accuracies#

Now that you have created a function that returns the accuracy for a given number of components, we will use that to plot the how the accuracy of your SVC model changes when we increase the number of components used in the PCA.

✅ Task 6.2 (2 points): Going from 1 to 7 components, use your function above to compute and store (as a list) the accuracy of your models and the total explained variance ratio of your models.

Note: you’ll be running many grid searches to do this, so it might take your computer a bit of time to run all of these models. Please be patient. It shouldn’t more than a couple minutes!

# Put your code here

6.3 Plot accuracy vs number of components#

Now that we have those numbers, it makes sense to look at the accuracy vs # of components.

✅ Task 6.3 (2 points): Plot the accuracy vs # of components.

# Put your code here

✅ Question 6.1 (3 point): What do you observe about the accuracy as a function of the number of PCA components you use? One goal of using dimension reduction strategies is to develop a model with the fewest features while maximizing the accuracy. Given that motivation, what number of principal components would you choose and why?

Erase this and put your answer here.

6.4 Plot total explained variance vs number of components#

What if we look at total explained variance as a function of # of components?

✅ Task 6.4 (2 points): Plot the total explained variance ratio vs # of components.

# Put your code here

✅ Question 6.2 (1 points): Based on your answer from question 6.1 and the plot above, what is the explained variance for the number of principal components that you chose?

Erase this and put your answer here.


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 6”, and push the changes to GitHub.



7. Revisiting the Perceptron classifier (10 points)#

In class you implemented your own perceptron class. Fortunately, there is a perceptron classifier already built into scikit learn, so in this portion of the assignment we will be exploring scikit learn’s perceptron

Do this: Run the following cell to import the code from the Perceptron class.

from sklearn.linear_model import Perceptron

✅ Task 7.1 (4 points): Create an instance of the Perceptron object using alpha=0.001 and penalty=’l2’. Then, use the fit() to train the classifier using the training features and labels from the seeds dataset you’ve been using in the assignment up to this point. Finally, use the predict() method to predict the labels for the test features and print the accuracy score.

# Put your code here

✅ Question 7.1 (1 points): How well of job did the Perceptron classifier do classifying this datset? How does it compare to the SVC model you built in the previous parts of this assignment?

Erase this and put your answer here.

✅ Task 7.2 (4 points): Now perform a grid search as you did with the support vector classifier earlier in this assignment. Here you will want to search over penalty = l2, l1, elasticnet and alpha= 0.0001, 0.001, 0.01, and 0.1. Find and return the best parameters, the confusion matrix, and the classification report.

# Put your code here

✅ Question 7.2 (1 point): How do these results compare to the results when using a support vector classifier now that we optimized the parameters? Did the perceptron do better or worse?

Erase this and put your answer here.


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository hw04_branch using the commit message “Committing Part 7”, and push the changes to GitHub.



Part 1. Continued#

Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main branch.

✅ Do the following:

  1. Switch back to your main branch.

  2. Merge your hw04_branch with your main branch.

  3. Finally, push the changes to GitHub.

Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” folder, find the submission folder for Homework 4, and upload your notebook.

© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University