Jupyter Notebook#
Lecture 20 - PCA#
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns
# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
1. PCA on Penguins#
Artwork by @allison_horst
For this lab, we are going to again use the Palmer Penguins data set by Allison Horst, Alison Hill, and Kristen Gorman. You should have done this in a previous notebook, but if you don’t have the package installed to get the data, you can run
pip install palmerpenguins
to have access to the data.
from palmerpenguins import load_penguins
penguins = load_penguins()
penguins = penguins.dropna()
#Shuffle the data
# penguins = penguins.sample(frac=1)
penguins.head()
Before we get to the full version, let’s just take a look at two of the columns: flipper length and bill length. A nice thing we can do is to also color the data by which species label the data point has.
sns.scatterplot(x = penguins.bill_length_mm,
y = penguins.flipper_length_mm,
hue = penguins.species)
Before we get to it, we’re going to just work with the columns that are numeric.
penguins_num = penguins.select_dtypes(np.number)
penguins_num.head()
We will also use mean centered data to make the visualization easier (meaning shifting our data to have mean 0 in every column, and have standard deviation 1).
p_normalized = (penguins_num - penguins_num.mean())/penguins_num.std()
p_normalized.head()
PCA with just two input columns#
To try to draw pictures similar to what we just saw on the slides, we’ll first focus on two of the columns.
penguins_subset2 = p_normalized[['bill_length_mm', 'flipper_length_mm']]
penguins_subset2
We run PCA using the PCA
command from scikitlearn
.
from sklearn.decomposition import PCA
# Set up the PCA object
pca = PCA(n_components=2)
# Fit it using our data
pca.fit(penguins_subset2.values)
pca_df = pca.fit_transform(penguins_subset2.values)
plt.scatter(pca_df[:,0], pca_df[:,1])
The pca.components_
store information about the lines we are going to project our data onto. Specifically, each row gives us one of these lines.
pca.components_
sns.scatterplot(data = penguins_subset2,
x = 'bill_length_mm',
y = 'flipper_length_mm',
hue = penguins.species)
for i, comp in enumerate(pca.components_):
slope = comp[1]/comp[0]
plt.plot(np.array([-2,2]), slope*np.array([-2,2]))
plt.axis('square')
A common way to look at the relative importance of the PC’s is to draw these components as vectors with length based on the explained variance.
pca.explained_variance_
sns.scatterplot(data = penguins_subset2,
x = 'bill_length_mm',
y = 'flipper_length_mm',
hue = penguins.species)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
slope = comp[1]/comp[0]
plt.plot(np.array([-2,2]), slope*np.array([-2,2]))
comp = comp * var # scale component by its variance explanation power
plt.plot(
[0, comp[0]],
[0, comp[1]],
label=f"Component {i}",
linewidth=5,
color=f"C{i + 2}",
)
plt.axis('square')
The next important part are the PC’s, which we can get from the pca
object as follows. I’m going to put them in a dataframe to make drawing and visualization easier. Basically, \(PC_1\) is our \(Z_1\) in the slides, and \(PC_2\) is the \(Z_2\).
# The transform function takes in bill,flipper data points,
# and returns a PC1,PC2 coordinate for each one.
penguins_pca = pca.fit_transform(penguins_subset2)
penguins_pca = pd.DataFrame(data = penguins_pca, columns = ['PC1', 'PC2'])
penguins_pca.shape
penguins.species
This is the scatterplot of the data points transformed into the PC space.
sns.scatterplot(data = penguins_pca, x = 'PC1', y = 'PC2',hue = penguins.species)
✅ Do this: What are the PC coordinates for the first data point (index 0)? Which quadrant would this point be drawn in?
# Your answer here
The PC’s can be thought of as how far along their associated line the point would be projected. Here’s one way to draw all the projections.
sns.scatterplot(data = penguins_subset2,
x = 'bill_length_mm',
y = 'flipper_length_mm',
hue = penguins.species)
# Show points projected onto the 1st PC line
X1 = penguins_pca.PC1*pca.components_[0,0]
Y1 = penguins_pca.PC1*pca.components_[0,1]
plt.scatter(X1,Y1, marker = '+', color = 'violet')
# Show points projected onto the 2st PC line
X2 = penguins_pca.PC2*pca.components_[1,0]
Y2 = penguins_pca.PC2*pca.components_[1,1]
plt.scatter(X2,Y2, marker = 'o', color = 'pink')
plt.axis('square')
Below is code that emphasizes the projected points.
✅ Do this: the value of index
below is just picking out a different point in our data set. Mess around with this number. How do the X and star points move around as you change index
?
sns.scatterplot(data = penguins_subset2,
x = 'bill_length_mm',
y = 'flipper_length_mm',
hue = penguins.species)
plt.axis('square')
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
slope = comp[1]/comp[0]
plt.plot(np.array([-2,2]), slope*np.array([-2,2]))
#===========
# Emphasize one point and its projections
#===========
index = 300 #<---------- play with this!
# Here's one data point
plt.scatter([penguins_subset2.iloc[index,0]],
[penguins_subset2.iloc[index,1]],
marker = 'D', color = 'black', s = 100, label = 'Data pt')
# Here's the projection of that point on PC1 (X shape)
plt.scatter([X1[index]], [Y1[index]],
marker = 'X', color = 'purple', s = 100, label = 'Project: PC1')
# And here's the projection of that point on PC2 (star)
plt.scatter([X2[index]], [Y2[index]],
marker = '*', color = 'purple', s = 100, label = 'Project: PC2')
plt.legend()
Everything we just did is great for understanding what the PCA is doing, but in reality, we’re usually going to be looking at the data in the transformed space.
✅ Do this: Make a scatter plot of PC1 and PC2. Color the points by penguins.species
. What do you notice about how the points have moved from the (bill
, flipper
) scatter plot?
# Your code here
Penguins PCA with all columns#
We used only two columns above for visualization, but we can instead use all the input columns to run our PCA.
penguins_num.head()
pca = PCA(n_components=4)
penguins_pca_all = pca.fit_transform(penguins_num)
penguins_pca_all = pd.DataFrame(data = penguins_pca_all,
columns = ['PC1', 'PC2', 'PC3', 'PC4'])
penguins_pca_all
✅ Do this: Make a scatter plot of PC1 and PC2 using this new model, and again color the points by penguins.species
. What do you notice about how the PC plot has changed from the previous setting?
# your code here
Congratulations, we’re done!#
Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.