Jupyter Notebook#

Lecture 20 - PCR#

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time

import seaborn as sns

# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

PCR on Hitters Data#

Loading in the data#

Ok, here we go, let’s play with a baseball data set again. Note this cleanup is all the same as previously.

df = pd.read_csv('../../DataSets/Hitters.csv').dropna().drop('Player', axis = 1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')

# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)

X.info()

Principal Component Regression#

Ok, let’s do Principal Components Regression. Scikit-learn doesn’t have a built in function to do PCR (aka PCA and then regression) but it’s just as easy for us to do it ourselves.

First step, let’s figure out the PCA function.

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale #<--- this does the scaling of variables for us
pca = PCA()
print(X.shape)
X_PC = pca.fit_transform(scale(X))
print(X_PC.shape)

“But Dr. Munch, you said PCA was supposed to do dimension reduction, why is my feature output the same size?”

Glad you asked, young data scientist. The PCA command outputs all of the PCs, all the way up through \(p=19\) the original number of dimensions.

So, if I want the first three PCs, I can get them out as follows. I’ll put it in a data frame just to add column labels, but you don’t need to do that.

First3PCs = X_PC[:,:3]

pd.DataFrame(First3PCs, columns = ['Z1','Z2', 'Z3'])

Now we can just do regression on the PCs like before.

from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(X_PC[:,:3], y)
mean_squared_error(y,regr.predict(X_PC[:,:3]))

Do this: My code above contains the rookie mistake of only reporting training error. Write modified code to return the 10-fold CV error of linear regression on the first 3 PCs.

# Your code here #



# You'll probably want this....
from sklearn.model_selection import cross_val_score

Do this: Take the code you figured out above to get the score for 10-fold CV on the first 3 PCs, and include it in the for-loop below to see how the MSE changes as the number of PCs you use changes.

n = len(X_PC)
regr = LinearRegression()
mse = []

# Calculate MSE using CV for the 19 principle components, adding one component at a time.
for i in np.arange(1, 20): # i is the number of principal components to use each time
    # ====
    score = 0 # Your code from above goes here!
    # ====

    mse.append(score)
    
# Plot results    
plt.plot(mse, '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Predicting Salary')
plt.xlim(xmin=-1);

Q: Based on the graph you generated above, how many PCs do you think you should use?

Note: Based on graphs I generated, I can see a few different options for what I might decide to use for number of principal components. This is one of those cases where you potentially have a different answer and/or reasoning from your neighbor, so I enourage you to talk this one through with your group.

Your answer here

Q: Of the models you’ve built so far (Ridge, Lasso, and PCR), which would you choose to use and why?

Note: This goes in the no-one-right-answer bucket. Go argue with your group.

Your answer here


Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.