Jupyter Notebook

Jupyter Notebook#

Lecture 20 - PCR#

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time

import seaborn as sns

# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

PCR on Hitters Data#

Loading in the data#

Ok, here we go, let’s play with a baseball data set again. Note this cleanup is all the same as previously.

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/7c9ee5baa268e98bd5a1cade1809d7fe/Hitters.csv"
df = pd.read_csv(url).dropna().drop('Player', axis = 1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 AtBat      263 non-null    int64  
 Hits       263 non-null    int64  
 HmRun      263 non-null    int64  
 Runs       263 non-null    int64  
 RBI        263 non-null    int64  
 Walks      263 non-null    int64  
 Years      263 non-null    int64  
 CAtBat     263 non-null    int64  
 CHits      263 non-null    int64  
 CHmRun     263 non-null    int64  
CRuns      263 non-null    int64  
CRBI       263 non-null    int64  
CWalks     263 non-null    int64  
League     263 non-null    object 
Division   263 non-null    object 
PutOuts    263 non-null    int64  
Assists    263 non-null    int64  
Errors     263 non-null    int64  
Salary     263 non-null    float64
NewLeague  263 non-null    object 
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB

y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')

# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)

X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 AtBat        263 non-null    float64
 Hits         263 non-null    float64
 HmRun        263 non-null    float64
 Runs         263 non-null    float64
 RBI          263 non-null    float64
 Walks        263 non-null    float64
 Years        263 non-null    float64
 CAtBat       263 non-null    float64
 CHits        263 non-null    float64
 CHmRun       263 non-null    float64
CRuns        263 non-null    float64
CRBI         263 non-null    float64
CWalks       263 non-null    float64
PutOuts      263 non-null    float64
Assists      263 non-null    float64
Errors       263 non-null    float64
League_N     263 non-null    bool   
Division_W   263 non-null    bool   
NewLeague_N  263 non-null    bool   
dtypes: bool(3), float64(16)
memory usage: 35.7 KB

Principal Component Regression#

Ok, let’s do Principal Components Regression. Scikit-learn doesn’t have a built in function to do PCR (aka PCA and then regression) but it’s just as easy for us to do it ourselves.

First step, let’s figure out the PCA function.

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale #<--- this does the scaling of variables for us

pca = PCA()
print(X.shape)
X_PC = pca.fit_transform(scale(X))
print(X_PC.shape)

(263, 19)
(263, 19)

“But Professor, you said PCA was supposed to do dimension reduction, why is my feature output the same size?”

Glad you asked, young data scientist. The PCA command outputs all of the PCs, all the way up through \(p=19\) the original number of dimensions.

So, if I want the first three PCs, I can get them out as follows. I’ll put it in a data frame just to add column labels, but you don’t need to do that.

First3PCs = X_PC[:,:3]

pd.DataFrame(First3PCs, columns = ['Z1','Z2', 'Z3'])

	Z1	Z2	Z3
0	-0.009649	-1.870522	1.265145
1	0.411434	2.429422	-0.909193
2	3.466822	-0.825947	0.555469
3	-2.558317	0.230984	0.519642
4	1.027702	1.573537	1.331382
...	...	...	...
258	-0.331167	0.165663	0.775890
259	3.246183	0.543315	1.209150
260	-1.094609	0.751386	-0.935937
261	1.977682	2.184517	-0.282012
262	1.833002	0.301097	-1.055323

263 rows × 3 columns

Now we can just do regression on the PCs like before.

from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(X_PC[:,:3], y)
mean_squared_error(y,regr.predict(X_PC[:,:3]))

117234.531905816

✅ Do this: My code above contains the rookie mistake of only reporting training error. Write modified code to return the 10-fold CV error of linear regression on the first 3 PCs.

# Your code here #

# You'll probably want this....
from sklearn.model_selection import cross_val_score

✅ Do this: Take the code you figured out above to get the score for 10-fold CV on the first 3 PCs, and include it in the for-loop below to see how the MSE changes as the number of PCs you use changes.

n = len(X_PC)
regr = LinearRegression()
mse = []

# Calculate MSE using CV for the 19 principle components, adding one component at a time.
for i in np.arange(1, 20): # i is the number of principal components to use each time
    # ====
    score = 0 # Your code from above goes here!
    # ====

    mse.append(score)
    
# Plot results    
plt.plot(mse, '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Predicting Salary')
plt.xlim(xmin=-1);

../../../_images/71c15e2858175c63853871d78772eec194b2c07998e1ed9010d876e684aebdde.png

✅ Q: Based on the graph you generated above, how many PCs do you think you should use?

Note: Based on graphs I generated, I can see a few different options for what I might decide to use for number of principal components. This is one of those cases where you potentially have a different answer and/or reasoning from your neighbor, so I enourage you to talk this one through with your group.

##Your answer here##

✅ Q: Of the models you’ve built so far (Ridge, Lasso, and PCR), which would you choose to use and why?

Note: This goes in the no-one-right-answer bucket. Go argue with your group.

##Your answer here##

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.