In-Class Assignment: Principal Component Analysis (PCA)#
Day 22#
CMSE 202#
✅ Put your name here
#✅ Put your group member names here
#
Image from http://lazyprogrammer.me/
PCA Explained Visually: http://setosa.io/ev/principal-component-analysis/
Problem Description#
In Machine Learning, a good “rule of thumb” is to have more than 10 times the number of training samples as dimensions in your feature vector. So if your feature vector has 4 features, then you should have 40 samples for training (this doesn’t count the samples for testing).
This is also called the “curse of dimensionality” and can sometimes be counter intuitive. Basically we want more dimensions to ensure that we have captured the patterns we want to learn in the model. However, too many dimensions requires more training data. This can be a big problem when you have high dimensional data.
For example, the face images provided by sci-kit learn, which we visited briefly at the end of the last class period and are using today in class, have 50 rows and 37 columns; this results in a feature vector length of 1850. By the Machine Learning rule of thumb, we should have 18,500 examples in our training set (we only have at most 1560 examples and that does not include the ones we have set aside for testing. When we ignore this rule of thumb, we can end up with less than desirable results!
In this assignment we are going to use Principal Component Analysis (PCA) to to reduce the size of our feature vector and only select features which are the “most” descriptive.
Agenda for today’s class#
Imports:#
# imports for the day
import numpy as np
from sklearn.datasets import fetch_lfw_people
import matplotlib.pylab as plt
from ipywidgets import interact
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.svm import SVC
import pickle
import time
from sklearn.decomposition import PCA
1. Review of Principle Component Analysis (as introduced in the pre-class assignment)#
As a class, we’ll spend a bit of time talk talking about how PCA works and why it might be useful for machine learning (as indicated by the “Problem Description” at the top of the notebook).
2. Review of Face Recognition from last time#
✅ For each of the following steps; run the code and discuss as a group what the code is doing. Look carefully at the results, is the model working well? Why/why not?
First, lets download and view the data:
sk_data = fetch_lfw_people(min_faces_per_person=50, resize=0.4)
feature_vectors = sk_data.data
class_labels = sk_data.target
categories = sk_data.target_names
n_samples, n_features = feature_vectors.shape
N, h, w = sk_data.images.shape
n_classes = len(categories)
def browse_images(images, labels, categories):
n = len(images)
def view_image(i):
plt.imshow(images[i], cmap=plt.cm.gray, interpolation='nearest')
plt.title('%s' % categories[labels[i]])
plt.axis('off')
plt.show()
interact(view_image, i=(0,n-1))
browse_images(sk_data.images, sk_data.target, sk_data.target_names)
Step 1: Splitting the dataset into training and testing sets#
Make train and test datasets as before.
# put your answer here
Step 2: Train an SVM Classifier based on the training dataset.#
As you may have learned in the last class, this training step with the 'rbf'
kernel can take a while (in some cases, it might take several minutes, depending on your computer and the complexity of the problem). To save some time, the instructors modified the code from last time to save the results of the training data to a file and have made this file available to you. Try running the model yourself first, but if you find that it is taking a long time (more than a couple minutes), you can interrupt your notebook and then use the saved data by setting rerun_training
to False
and downloading the trained model.
You should be able to download the pre-built trained model from here:
https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/full_face_model.p
Once you’ve downloaded the file and make sure it’s in the same location as this notebook, you should be able to run the provided code. Again, if you which to run the classifer yourself, you can change the value of rerun_training
.
rerun_training = True
filename = 'full_face_model.p'
# Train a SVM classification model NOTE This can take ~ 5 minutes or more!!!!
#make some temporary variables so you can change this easily
tmp_vectors = train_vectors
tmp_labels = train_labels
start = time.time()
if rerun_training:
print("Fitting the classifier to the training set")
param_grid = {'C': [10.0, 50.0, 100.0, 500.0, 1000.0, 5000.0, 10000.0],
'gamma': [0.0005, 0.001, 0.005, 0.01]}
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid, n_jobs= -1)
clf = clf.fit(tmp_vectors, tmp_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
#save the model to a file
pickle.dump(clf, open(filename, 'wb'))
else:
#read the model from a file
print("reading pickle file.")
clf = pickle.load(open(filename, 'rb'))
print("Best estimator found by grid search:")
print(clf.best_estimator_)
end = time.time()
print("Runtime",end - start)
✅ Question: What library and functions are used to save and load a python object in this example?
✎ Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)
Step 3. Show the results of the classification on the testing dataset.#
# Quantitative evaluation of the model quality on the test set
#make some temporary variables so you can change this easily
predict_vectors = test_vectors
true_labels = test_labels
print("Predicting people's names on the test set")
pred_labels = clf.predict(predict_vectors)
print(classification_report(true_labels, pred_labels, zero_division=0))
# print(confusion_matrix(true_labels, pred_labels, labels=range(n_classes)))
ConfusionMatrixDisplay.from_estimator(clf, test_vectors, test_labels);
# ATTENTION: if the above line throws an error when you run this cell, try changing your Python kernel to 3.10
# If it still doesn't work, comment out the line and uncomment the "print(confusion_matrix(...))" line and re-run
def plot_gallery(images, true_titles, pred_titles, h, w, n_row=5, n_col=5):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title('Pred='+str(categories[pred_titles[i]]), size=9)
plt.xlabel('Actual='+str(categories[true_titles[i]]), size=9)
plt.xticks(())
plt.yticks(())
plot_gallery(test_vectors, test_labels, pred_labels, h,w)
✅ Question: How does the classifier do based on the parameters it found from the GridSearchCV? Which labels seem to have the highest precision, recall, and f1-scores values? Why might this be the case?
✎ Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)
3. Feature Selection using Principal Component Analysis#
Now we are going to go through the entire program above and use Principal Component Analysis (PCA) to select a much smaller set of better features. The following code will reduce the original picture feature vector to a feature vector with 4 (n_components). This is often called unsupervised feature extraction:
n_components = 4 # This is much less than the original n_features
print("Extracting the top %d eigenfaces from %d faces" % (n_components, train_vectors.shape[0]))
#Set up the pca object with the number of compoents we want to find
pca = PCA(n_components=n_components, whiten=True)
#Fit the training data to the pca model.
_ = pca.fit(train_vectors)
We now have a pca
object that has our data fit to it. We can use this object to transform the original training vectors consisting of the entire image into new training vectors using the transform
function:
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)
print("Training set changed from a size of: ", train_vectors.shape, ' to: ', pca_train_vectors.shape)
print("Testing set changed from a size of: ", test_vectors.shape, ' to: ', pca_test_vectors.shape)
✅ Do This: We have now reduced the size of our feature vectors. Modify the code in Section 2 to substitute these two new reduced size vectors for the full size vectors (or, if you find it easier, create a new cell and copy and paste the code from above). This modification should replace both the training vectors and the testing vectors. Make sure you also consider the following:
Recalculate the
clf
model and do not load it from memory – this is important!You probably want to change the filename so you do not overwrite the instructors file
Make sure your changes ensure that the tmp_vectors and predict_vectors reference the new smaller vectors
✅ Question: What is the weighted average precision that the SVM algorithm achieves with only the first 4 feature vectors? And how long did this take to run?
✎ Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)
4. Picking the “best” Feature Vectors#
If you remember from your pre-class assignment the PCA algorithm finds a transform of the data such that the first component contains the “most” information the second component contains the “second most” important information. How much information each component contains is actually included in the pca
object and can be expressed as a ratio from 0 (no information) to 1 (all information). Lets plot these values below:
#Lets plot the variance of the eigen values
plt.plot(pca.explained_variance_ratio_, marker="o")
Another way to look at this is we can sum up the total ratios and see how much our new set of features represents the “variance” in the original data:
total_variance = np.sum(pca.explained_variance_ratio_)*100
print("These %d eigenvectors account for a total of %d percent of the total variance in the original dataset"
% (n_components, total_variance))
✅ Question: How many components would we need to represent 90% of the variance in our original data? (Hint: modify Section 3 and change n_components to find a total_variance > 90)
✎ Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)
✅ Do This: generate a plot that creates a bar graph that compares the weighted average precision achieved for each of the following:
original - The original image as a complete feature vector.
4 - The four “best” PCA components
90% - The components representing the top 90% of the PCA variance
#Put your code here
5. Eigenfaces (just for fun and knowledge)#
An eigenvector is a transform from the original image space into a new space along a single axis. You can actually think of these vectors as a weighted sum of the components of an original image. So the length of the eigenvectors are the same size as the original data. Let’s plot the gallery of the most significant eigenvectors:
eigenfaces = pca.components_.reshape((n_components, h, w))
def plot_gallery(images, titles, h, w, n_row=3, n_col=5):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w, n_row=1, n_col=4)
These vectors are often called eigenfaces. Each image gives you information about where the greatest amount of variation occurs in the images. Some of the eigenfaces tell you more about the lighting in the image or about the orientation of the face and less about the actual facial features, but others highlight where the most variation is from face to face.
✅ Do this: Spend some time reflecting on the eigenfaces and see if you can make sense of what information of the original imagines might be captured by these eigenfaces. Talk with your group. If you’re curious about what more of the eigenfaces look like, try change the values for n_row
and n_col
in the last line of the code cell above.
6. Got extra time left over? Let’s inspect our choices for the GridSearchCV parameters#
What parameters were used the GridSearchCV
call? What values were found to produce the best fit? Is there anything concerning about the parameter ranges and the “best values”? What about the choice of kernel? Are there other kernels we could or should have tried?
✅ Do This: If you have extra time, consider changing some of the GridSearchCV
parameters and see if your results change.
Congratulations, we’re done!#
Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today’s submission folder (Don’t forget to add your names in the first cell).
© Copyright 2023, The Department of Computational Mathematics, Science and Engineering at Michigan State University