Jupyter Notebook#

Lecture 20 - PCA#

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time

import seaborn as sns

# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

1. PCA on Penguins#

Palmer Penguins Picture

Artwork by @allison_horst

For this lab, we are going to again use the Palmer Penguins data set by Allison Horst, Alison Hill, and Kristen Gorman. You should have done this in a previous notebook, but if you don’t have the package installed to get the data, you can run

pip install palmerpenguins

to have access to the data.

from palmerpenguins import load_penguins
penguins = load_penguins()
penguins = penguins.dropna()

#Shuffle the data
# penguins = penguins.sample(frac=1)
penguins.head()

Before we get to the full version, let’s just take a look at two of the columns: flipper length and bill length. A nice thing we can do is to also color the data by which species label the data point has.

sns.scatterplot(x = penguins.bill_length_mm, 
                y = penguins.flipper_length_mm, 
                hue = penguins.species)

Before we get to it, we’re going to just work with the columns that are numeric.

penguins_num = penguins.select_dtypes(np.number)
penguins_num.head()

We will also use mean centered data to make the visualization easier (meaning shifting our data to have mean 0 in every column, and have standard deviation 1).

p_normalized = (penguins_num - penguins_num.mean())/penguins_num.std()
p_normalized.head()

PCA with just two input columns#

To try to draw pictures similar to what we just saw on the slides, we’ll first focus on two of the columns.

penguins_subset2 = p_normalized[['bill_length_mm', 'flipper_length_mm']]
penguins_subset2

We run PCA using the PCA command from scikitlearn.

from sklearn.decomposition import PCA
# Set up the PCA object
pca = PCA(n_components=2)

# Fit it using our data
pca.fit(penguins_subset2.values)
pca_df = pca.fit_transform(penguins_subset2.values)
plt.scatter(pca_df[:,0], pca_df[:,1])

The pca.components_ store information about the lines we are going to project our data onto. Specifically, each row gives us one of these lines.

pca.components_
sns.scatterplot(data = penguins_subset2, 
                x = 'bill_length_mm', 
                y = 'flipper_length_mm', 
                hue = penguins.species)

for i, comp in enumerate(pca.components_):
    slope = comp[1]/comp[0]
    plt.plot(np.array([-2,2]), slope*np.array([-2,2]))
    
plt.axis('square')

A common way to look at the relative importance of the PC’s is to draw these components as vectors with length based on the explained variance.

pca.explained_variance_
sns.scatterplot(data = penguins_subset2, 
                x = 'bill_length_mm', 
                y = 'flipper_length_mm', 
                hue = penguins.species)

for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    slope = comp[1]/comp[0]
    plt.plot(np.array([-2,2]), slope*np.array([-2,2]))
    
    comp = comp * var  # scale component by its variance explanation power
    plt.plot(
        [0, comp[0]],
        [0, comp[1]],
        label=f"Component {i}",
        linewidth=5,
        color=f"C{i + 2}",
    )

plt.axis('square')

The next important part are the PC’s, which we can get from the pca object as follows. I’m going to put them in a dataframe to make drawing and visualization easier. Basically, \(PC_1\) is our \(Z_1\) in the slides, and \(PC_2\) is the \(Z_2\).

# The transform function takes in bill,flipper data points, 
# and returns a PC1,PC2 coordinate for each one. 
penguins_pca = pca.fit_transform(penguins_subset2)
penguins_pca = pd.DataFrame(data = penguins_pca, columns = ['PC1', 'PC2'])
penguins_pca.shape
penguins.species

This is the scatterplot of the data points transformed into the PC space.

sns.scatterplot(data = penguins_pca, x = 'PC1', y = 'PC2',hue = penguins.species)

Do this: What are the PC coordinates for the first data point (index 0)? Which quadrant would this point be drawn in?

# Your answer here

The PC’s can be thought of as how far along their associated line the point would be projected. Here’s one way to draw all the projections.

sns.scatterplot(data = penguins_subset2, 
                x = 'bill_length_mm', 
                y = 'flipper_length_mm', 
                hue = penguins.species)


# Show points projected onto the 1st PC line
X1 = penguins_pca.PC1*pca.components_[0,0]
Y1 = penguins_pca.PC1*pca.components_[0,1]

plt.scatter(X1,Y1, marker = '+', color = 'violet')


# Show points projected onto the 2st PC line
X2 = penguins_pca.PC2*pca.components_[1,0]
Y2 = penguins_pca.PC2*pca.components_[1,1]

plt.scatter(X2,Y2, marker = 'o', color = 'pink')
plt.axis('square')

Below is code that emphasizes the projected points.

Do this: the value of index below is just picking out a different point in our data set. Mess around with this number. How do the X and star points move around as you change index?

sns.scatterplot(data = penguins_subset2, 
                x = 'bill_length_mm', 
                y = 'flipper_length_mm', 
                hue = penguins.species)
plt.axis('square')

for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    slope = comp[1]/comp[0]
    plt.plot(np.array([-2,2]), slope*np.array([-2,2]))

#===========
# Emphasize one point and its projections
#===========

index = 300 #<---------- play with this!

# Here's one data point
plt.scatter([penguins_subset2.iloc[index,0]],
            [penguins_subset2.iloc[index,1]], 
            marker = 'D', color = 'black', s = 100, label = 'Data pt')

# Here's the projection of that point on PC1 (X shape)
plt.scatter([X1[index]], [Y1[index]], 
           marker = 'X', color = 'purple', s = 100, label = 'Project: PC1')

# And here's the projection of that point on PC2 (star)
plt.scatter([X2[index]], [Y2[index]], 
           marker = '*', color = 'purple', s = 100, label = 'Project: PC2')

plt.legend()

Everything we just did is great for understanding what the PCA is doing, but in reality, we’re usually going to be looking at the data in the transformed space.

Do this: Make a scatter plot of PC1 and PC2. Color the points by penguins.species. What do you notice about how the points have moved from the (bill, flipper) scatter plot?

# Your code here

Penguins PCA with all columns#

We used only two columns above for visualization, but we can instead use all the input columns to run our PCA.

penguins_num.head()
pca = PCA(n_components=4)
penguins_pca_all = pca.fit_transform(penguins_num)
penguins_pca_all = pd.DataFrame(data = penguins_pca_all, 
                                columns = ['PC1', 'PC2', 'PC3', 'PC4'])
penguins_pca_all

Do this: Make a scatter plot of PC1 and PC2 using this new model, and again color the points by penguins.species. What do you notice about how the PC plot has changed from the previous setting?

# your code here

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.