Ch 6.3: PCR

Lecture 20 - CMSE 381

Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering

Wed, Mar 11, 2026

Announcements

Last time:

PCA

This lecture:

PCR

Announcements:

HW4 due Sunday 3/15.
Exam #2 on Monday 3/16!
- Bring 8.5x11 sheet of paper
- Handwritten both sides
- Anything you want on it, but must be your work
- Write your name and group number
- You will turn it in
- Non-internet calculator
- questions about project

Course schedule for weeks 11-20, listing
topics like Logistic Regression, PCA, and
key dates including Midterm and Spring
Break

Section 1

Previously…

Shrinkage

Find

β

to minimize

𝑅𝑆𝑆 = \sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2}

subject to:

Least Squares:
No constraints

Ridge:

\sum_{j = 1}^{p} β_{j}^{2} \leq s

Linear transformation of predictors

Original Predictors:

X_{1}, \dots, X_{p}

New Predictors:

Z_{1}, \dots, Z_{M}

Z_{m} = \sum_{j = 1}^{p} φ_{𝑗𝑚} X_{j}

PCA - First PC

Scatter
plot
of
2D
data
with
projection
line
at
45
degrees.

Histogram
of
the
2D
data
projected
onto
the
line
at
45
degrees.

Scatter
plot
of
2D
data
with
projection
line
at
90
degrees.align

Histogram
of
the
2D
data
projected
onto
the
line
at
90
degrees.align

Scatter
plot
of
2D
data
with
projection
line
at
180
degrees.align

Histogram
of
the
2D
data
projected
onto
the
line
at
180
degrees.align

Scatter
plot
of
2D
data
with
projection
line
at
45
degrees.align

Histogram
of
the
2D
data
projected
onto
the
line
at
135
degrees.align

Projection onto first PC

Scatter plot of Ad Spending versus Population with a first principal component line and
dashed lines showing data projections.

Z_{1} = 0.839 \cdot (𝚙𝚘𝚙 - \bar{𝚙𝚘𝚙}) + 0.544 \cdot (𝚊𝚍 - \bar{𝚊𝚍})

Drawing points in PC space

Two plots illustrating PCA: data projected onto the first principal component line (left) and the
resulting PC scores (right).

What will you learn from this lecture?

Why do you want to use the PCs in regression models?
- What assumptions do you have to make for it to be a good idea to use principal component regression (PCR)?
- Or conversely, what is a typical bad scenario to use PCR?
How do you implement PCR in Python?
How do you interpret the model coefficients when using PCR?
How do you choose the number of PC to use in PCR?
- Given a figure of, e.g., cross-validation score as a function of the number of PCs, you should be able to choose appropriately and provide rationales in terms of bias-variance tradeoff.
- You should be able to generate such figures in Python.
What is the relationship/differences between PCR and feature selection and regularization methods that you learned in this part of the course?

Section 2

Principal Components Regression

So you’ve found your PCA coefficients

Now what?

What are we assuming?

Interpretation of PCR coefficients

Original Predictors:

X_{1}, \dots, X_{p}

New Predictors:

Z_{1}, \dots, Z_{M}

Z_{m} = \sum_{j = 1}^{p} φ_{𝑗𝑚} X_{j}

Learned model:

y = 𝜃_{0} + 𝜃_{1} Z_{1} + \dots + 𝜃_{M} Z_{M}

Picking

M

Two plots showing standardized PCR coefficients (left) and
cross-validation MSE (right) versus the number of principal
components.

Do PCR with hitters data

Bias-Variance Trade-off

Example with simulated data:

n = 50

observations of

p = 45

predictors

Y

is a function of all predictors Plot of
training and test Mean Squared Error
versus number of components, illustrating
the bias-variance tradeoff.

Plot of
training and test Mean Squared Error
versus number of components, illustrating
the bias-variance tradeoff.

Y

is a function of 2 predictors Plot of
Squared Bias, Variance, and Test MSE
versus number of components,
demonstrating the bias-variance
decomposition.

Plot of
Squared Bias, Variance, and Test MSE
versus number of components,
demonstrating the bias-variance
decomposition.

Comparison to results on shrinkage

Y

is a function of all predictors Plot of
training and test Mean Squared Error
versus number of components, illustrating
the bias-variance tradeoff.

Two plots
showing MSE, squared bias, and variance
versus lambda (left) and training
R-squared (right), illustrating the
bias-variance tradeoff.

Y

is a function of 2 predictors Plot of
Squared Bias, Variance, and Test MSE
versus number of components,
demonstrating the bias-variance
decomposition.

Two plots showing MSE,
squared bias, and variance for Lasso versus
lambda (left) and training R-squared
(right).

Properties of PCR

TL;DR

PCR

Unsupervised dimensionality reduction $+$ linear regression
Choose component $Z_{1}$ in the direction of most variance using only $X_{i}$ ’s information
Choose $Z_{2}$ and beyond by the same method after “getting rid” of info in the directions already explained

Scatter plot of Ad Spending versus Population
with a first principal component line and
dashed lines showing data projections.

Next time

Course schedule for weeks 11-20, listing topics like Logistic Regression, PCA, and key dates including
Midterm and Spring Break.