Homework 4

Homework 4#

Perceptron, SVM, and PCA#

✅ Put your name here.
#

✅ Put your GitHub username here.
#

Goal for this homework assignment#

We have worked some basics on perceptron, SVM, and PCA in the pre-class and in-class assignments. In this homework assignment, we will:

Continue to use git as the version control tool
Work on unfamiliar data
Use perceptron to classify data
Use SVM to classify data
Use principal component analysis to facilitate classification

This assignment is due by 11:59 pm on Friday, April 25th. Note that ONLY the copy on GITHUB will be graded. There are 60 standard points possible in this assignment including points for Git commits/pushes. The distribution of points can be found in the section headers.

Part 1: Git repository (6 points)#

You’re going to add this assignment to the cmse202-s25-turnin repository you previously created. The history of progress on the assignment will be tracked via git commitments.

✅ Do the following:

Navigate to your cmse202-s25-turnin local repository and create a new directory called hw-04
Move this notebook into that new directory in your repository.
Double check to make sure your file is at the correct directory.
Once you’re certain that file and directory are correct, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw04 to push your file to GitHub.

Finally, ✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below. Points for this part will be given for correctly setting up branch, etc., above, and for doing git commits/pushes mentioned throughout the assignment.

✎ Put your answer here

Important: Double check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the previous homework assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-s25-turnin” repository inside the hw-04 directory that you just created.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the problems for a bit.

Part 2: Deal with unfamiliar data (35 points)#

Warm up with perceptron for binary classification#

2.1 Load up the dataset#

This data is obtained from Kaggle/diabetes. It contains multiple measured values and a label for whether the patient is diagnosed as diabetic.

Use commands to dowdload the dataset from https://raw.githubusercontent.com/huichiayu/cmse202-s25-supllemental_data/refs/heads/main/HW04/diabetes_prediction_dataset.csv
Use Pandas to load in the data and briefly examine it.
Succeed data load-up gets 2 pt.

# put your code here

How many patients are in this dataset? What are features of the patients?

✎ Put your answer here

Use your perceptron class built in Day18 and Day19 assignments to classify whether patients are diabetic.#

You should see that there are some features that are non-numerics.
The first one is gender. Find the types of classes and convert them to numerics in your dataframe.
The second one is smoking_history, convert those string labels to numerics.
Note that since perceptron is a binary classifier, which only determines which side of the dividing line the data points reside, we should also convert the labels to +1 and -1.
Completing data conversion gets 5 pt.

# put your code here

Now all feature varilables are numerics.#

🛑 STOP (1 Point)#

Pause, save and commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

2.2 Binary perceptron classifier#

Copy your perceptron class to the cell below.

DO NOT use the one from statsmodel. We want to test the perceptron you built.
Note that your predict method should output +1 or -1 for positive or negative values, respectively.
A functional perceptron classifier gets 4 pt.

# copy your perceptron class to his cell

Split data to 70-30 train-test sets 1 pt.
Train your perceptron.
Show the accuracy of your pereptron 2 pt.

# put your code here

Use test set to evaulate the accuracy of your perceptron. What is your accuracy? (2 pt)

# put your code here

There may be some ways to increase the accruacy, such as increasing the number of train iterations or adjust learning rate. Give a try to train a perceptron you can best get. Record the values of parameters and the optimal accuracy. (3 pt)

# put your code here

🛑 STOP (1 Point)#

Pause, save and commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

2.3 Next we shall test perceptron’s capability of multiple-label classification.#

Dowdload the dataset from https://raw.githubusercontent.com/huichiayu/cmse202-s25-supllemental_data/refs/heads/main/HW04/Telecust1.csv.
This is a customer category dataset (Kraggle/Customer Classification). Each cusmtoer has several feature variables.
There are five categories of customers, which are non-numerics. Thus, let’s convert those string labels to numerics.
Successful data load-up gets 2 pt.

# Download and load the dataset. Convert non-numerical labels to numerics.
# put your code here

2.4 Multi-label perceptron classification#

As we know, perceptron is a binary classifier. For multiple-label classification, we can use One-vs-Rest (OvR) Strategy.
In this case, let’s train five individual perceptrons.
For each classifier, it treats the current class as “positive” and all others as “negative.”
When classifying a new sample, each classifier gives a “score,” and the class with the highest score is chosen.

Copy your perceptron to the code cell below. We need to add a score method, which outputs dot of weights and features, as opposed to the previous binary predict method. The score method should output a signed floating score value, not +1 or -1. This can be done by removing the binary segmenting, i.e., directly outputing the dot value.

Functioning score() method gets 2 pt.

# put your modified perceptron class here

Now let’s do a train-test split of the data with a test_size = 0.3.
Since we are training 5 perceptrons, we should have have 5 class label sets. For instance, in the label set for category A, the label value will be +1 if it’s type A and otherwise -1.
Setting label sets gets 4 pt.

# put your code here

Use training set and the 5 training label sets to train your 5 perceptrons. Report the accuracy of those five training.
Efficiently train the five perceptrons using nest loop gets 5 pt.

# put your code here

Use the test vector to examine the accuracy.
For each feature set, there should be 5 output scores, each from a perceptron. The predicted label should be the label that corresponds to the highest score.
Report your accuracy. (3 pt)

# put your code here

How good is your multiple-label perceptron classification?

✎ Put your answer here

🛑 STOP (1 Point)#

Pause, save and commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

Part 3 SVM classifiers (19 points)#

3.1 SVM#

Let’s re-use the customer category data. There are five caterogies with multiple feature variables.

Use sklearn library to build a SVM classifier. Since we do not know what the best parametes are, perform a GridSearch for best parameters.
NOTE: Because the dataset contains a large number of points, it’s expected to have a long computer running time for GridSearch. Thus, let’s use only the first 200 data points for GridSearch. You can start the grid search parameter like the image below. However, NOTE that if the kernal used cannot find a hyperplane to classify data points, the GridSearch function will stall. You need to manually remove that kernal from the parameter set and re-run GridSearch.

As in the previous section, make a 70-30 train-test split and train your SVM classifier.
Complete GridSearch to extract best parameters gets 5 pt.

# put your code here.

Examine the accuracy of this SVC and report the accuracy. Draw a confusion matrix. 2 pt

# put your code here

Does SVM classifier work much better than your percetron?

✎ Put your answer here

🛑 STOP (1 Point)#

Pause, save and commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

3.2 PCA#

Although we only have 11 feature variables in the dataset, let’s examine how much principal component analysis (PCA) can accelerate the classification. We will increase the PCA components from 1 to 11. For each case, we will perform a GridSearch and use test set to examine the accuracy.

Write a code to loop over n_components = 1 through 11. 4 pt
Record the accuracy of each case and plot the profile of accuracy versus n_components. In the mean time, record the computer run times and plot the profile of time versus n_components. 2 pt

# put your code here

Please answer the following questions.

How is the overall accuracy of this SVM classifier? 1 pt
If the performance is not good, what do you think the cause is? 2 pt

✎ Put your answer here

Describe the curves of time vs n_components and accuracy vs n_components. 1 pt
Explain why the curves behave as they are in the figures 2 pt

✎ Put your answer here

🛑 STOP (1 Point)#

Pause, save and commit your FINAL changes to your Git repository!

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/mB0YjLYvAA" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

Homework 4

Contents

Homework 4#

Perceptron, SVM, and PCA#

✅ Put your name here.
#

✅ Put your GitHub username here.
#

Goal for this homework assignment#

Part 1: Git repository (6 points)#

Part 2: Deal with unfamiliar data (35 points)#

Warm up with perceptron for binary classification#

2.1 Load up the dataset#

Use your perceptron class built in Day18 and Day19 assignments to classify whether patients are diabetic.#

Now all feature varilables are numerics.#

🛑 STOP (1 Point)#

2.2 Binary perceptron classifier#

🛑 STOP (1 Point)#

2.3 Next we shall test perceptron’s capability of multiple-label classification.#

2.4 Multi-label perceptron classification#

🛑 STOP (1 Point)#

Part 3 SVM classifiers (19 points)#

3.1 SVM#

🛑 STOP (1 Point)#

3.2 PCA#

🛑 STOP (1 Point)#

Assignment wrap-up#

Congratulations, you’re done!#

Homework 4

Contents

Homework 4#

Perceptron, SVM, and PCA#

✅ Put your name here.#

✅ Put your GitHub username here.#

Goal for this homework assignment#

Part 1: Git repository (6 points)#

Part 2: Deal with unfamiliar data (35 points)#

Warm up with perceptron for binary classification#

2.1 Load up the dataset#

Use your perceptron class built in Day18 and Day19 assignments to classify whether patients are diabetic.#

Now all feature varilables are numerics.#

🛑 STOP (1 Point)#

2.2 Binary perceptron classifier#

🛑 STOP (1 Point)#

2.3 Next we shall test perceptron’s capability of multiple-label classification.#

2.4 Multi-label perceptron classification#

🛑 STOP (1 Point)#

Part 3 SVM classifiers (19 points)#

3.1 SVM#

🛑 STOP (1 Point)#

3.2 PCA#

🛑 STOP (1 Point)#

Assignment wrap-up#

Congratulations, you’re done!#

✅ Put your name here.
#

✅ Put your GitHub username here.
#