Jupyter Notebook#
Lecture 20 - PCR#
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns
# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
PCR on Hitters Data#
Loading in the data#
Ok, here we go, let’s play with a baseball data set again. Note this cleanup is all the same as previously.
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/7c9ee5baa268e98bd5a1cade1809d7fe/Hitters.csv"
df = pd.read_csv(url).dropna().drop('Player', axis = 1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AtBat 263 non-null int64
1 Hits 263 non-null int64
2 HmRun 263 non-null int64
3 Runs 263 non-null int64
4 RBI 263 non-null int64
5 Walks 263 non-null int64
6 Years 263 non-null int64
7 CAtBat 263 non-null int64
8 CHits 263 non-null int64
9 CHmRun 263 non-null int64
10 CRuns 263 non-null int64
11 CRBI 263 non-null int64
12 CWalks 263 non-null int64
13 League 263 non-null object
14 Division 263 non-null object
15 PutOuts 263 non-null int64
16 Assists 263 non-null int64
17 Errors 263 non-null int64
18 Salary 263 non-null float64
19 NewLeague 263 non-null object
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)
X.info()
<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AtBat 263 non-null float64
1 Hits 263 non-null float64
2 HmRun 263 non-null float64
3 Runs 263 non-null float64
4 RBI 263 non-null float64
5 Walks 263 non-null float64
6 Years 263 non-null float64
7 CAtBat 263 non-null float64
8 CHits 263 non-null float64
9 CHmRun 263 non-null float64
10 CRuns 263 non-null float64
11 CRBI 263 non-null float64
12 CWalks 263 non-null float64
13 PutOuts 263 non-null float64
14 Assists 263 non-null float64
15 Errors 263 non-null float64
16 League_N 263 non-null bool
17 Division_W 263 non-null bool
18 NewLeague_N 263 non-null bool
dtypes: bool(3), float64(16)
memory usage: 35.7 KB
Principal Component Regression#
Ok, let’s do Principal Components Regression. Scikit-learn doesn’t have a built in function to do PCR (aka PCA and then regression) but it’s just as easy for us to do it ourselves.
First step, let’s figure out the PCA function.
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale #<--- this does the scaling of variables for us
pca = PCA()
print(X.shape)
X_PC = pca.fit_transform(scale(X))
print(X_PC.shape)
(263, 19)
(263, 19)
“But Professor, you said PCA was supposed to do dimension reduction, why is my feature output the same size?”
Glad you asked, young data scientist. The PCA command outputs all of the PCs, all the way up through \(p=19\) the original number of dimensions.
So, if I want the first three PCs, I can get them out as follows. I’ll put it in a data frame just to add column labels, but you don’t need to do that.
First3PCs = X_PC[:,:3]
pd.DataFrame(First3PCs, columns = ['Z1','Z2', 'Z3'])
| Z1 | Z2 | Z3 | |
|---|---|---|---|
| 0 | -0.009649 | -1.870522 | 1.265145 |
| 1 | 0.411434 | 2.429422 | -0.909193 |
| 2 | 3.466822 | -0.825947 | 0.555469 |
| 3 | -2.558317 | 0.230984 | 0.519642 |
| 4 | 1.027702 | 1.573537 | 1.331382 |
| ... | ... | ... | ... |
| 258 | -0.331167 | 0.165663 | 0.775890 |
| 259 | 3.246183 | 0.543315 | 1.209150 |
| 260 | -1.094609 | 0.751386 | -0.935937 |
| 261 | 1.977682 | 2.184517 | -0.282012 |
| 262 | 1.833002 | 0.301097 | -1.055323 |
263 rows × 3 columns
Now we can just do regression on the PCs like before.
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X_PC[:,:3], y)
mean_squared_error(y,regr.predict(X_PC[:,:3]))
117234.531905816
✅ Do this: My code above contains the rookie mistake of only reporting training error. Write modified code to return the 10-fold CV error of linear regression on the first 3 PCs.
# Your code here #
# You'll probably want this....
from sklearn.model_selection import cross_val_score
✅ Do this: Take the code you figured out above to get the score for 10-fold CV on the first 3 PCs, and include it in the for-loop below to see how the MSE changes as the number of PCs you use changes.
n = len(X_PC)
regr = LinearRegression()
mse = []
# Calculate MSE using CV for the 19 principle components, adding one component at a time.
for i in np.arange(1, 20): # i is the number of principal components to use each time
# ====
score = 0 # Your code from above goes here!
# ====
mse.append(score)
# Plot results
plt.plot(mse, '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Predicting Salary')
plt.xlim(xmin=-1);
✅ Q: Based on the graph you generated above, how many PCs do you think you should use?
Note: Based on graphs I generated, I can see a few different options for what I might decide to use for number of principal components. This is one of those cases where you potentially have a different answer and/or reasoning from your neighbor, so I enourage you to talk this one through with your group.
##Your answer here##
✅ Q: Of the models you’ve built so far (Ridge, Lasso, and PCR), which would you choose to use and why?
Note: This goes in the no-one-right-answer bucket. Go argue with your group.
##Your answer here##
Congratulations, we’re done!#
Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.