Ch 6.2: Shrinkage - Ridge regression

Lecture 17 - CMSE 381

Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering

Wed, Feb 25, 2026

Announcements

Last time:

Subset selection

This time:

Ridge regression

Announcements:

HW #3 due Sunday 3/8

Screenshot of the course schedule for
lectures 11 to 20.

Section 1

Last time

Subset selection

Algorithm
diagram
showing
the
forward
stepwise
selection
procedure
for
building
a
model
by
adding
variables
one
at
a
time.

Algorithm
diagram
showing
the
backward
stepwise
selection
procedure
for
building
a
model
by
removing
variables
one
at
a
time.

What should you learn from this lecture?

What is regularization? Why do we need it?
What are the two basic types of regularization methods? How are they implemented mathematically in linear regression?
How do you fit a ridge regression model in python?
How do you control the model flexibility & bias-variance tradeoff when using regularization?
How do you find the right amount of regularization using cross-validation? How do you do this in python?
What additional precautions do you need to take when using regularization (compared to least squares)?
What are the advantages of regularization compared to Least Squares?
What are the advantages of regularization compared to subset selection?

Section 2

Ridge Regression

Goal

Fit model using all $p$ predictors
Aim to constrain (regularize) coefficient estimates
Shrink the coefficient estimates towards 0

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4}

Ridge regression

Before:

𝑅𝑆𝑆 = \sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2}

Example from the Credit data

𝑅𝑆𝑆 + λ \sum_{j = 1}^{p} β_{j}^{2}

Plot of ridge regression coefficient
paths for the Credit data, showing
how the coefficient estimates shrink
toward zero as lambda increases.

Same Setting, Different Plot

𝑅𝑆𝑆 + λ \sum_{j = 1}^{p} β_{j}^{2} ∥ β ∥_{2} = \sqrt{\sum_{j = 1}^{p} β_{j}^{2}}

Plot of ridge regression coefficient
paths for the Credit data, showing
how coefficient estimates shrink as
the relative L2 norm decreases from
left to right.

Scale equivavariance (or lack thereof)

Scale equivariant: Multiplying a variable by

c

(

c X_{i}

) just returns a coefficient multiplied by

1 ∕ c

(

1 ∕ c β_{i}

)

Solution: Standardize predictors

{\tilde{x}}_{𝑖𝑗} = \frac{x_{𝑖𝑗}}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{𝑖𝑗} - {\bar{x}}_{j})}^{2}}}

Using Cross-Validation to find

λ

Choose a grid of $λ$ values
Compute the ( $k$ -fold) cross-validation error for each value of $λ$
Select the tuning parameter value $λ$ for which the CV error is smallest.
The model is re-fit using all of the available observations and the selected value of the tuning parameter.

LOOCV choice of

λ

for ridge regression and Credit data

Plot of LOOCV error versus lambda for ridge regression on the Credit
data, with dashed vertical lines indicating the selected lambda value.

Coding

Bias-Variance tradeoff

Squared bias (black), variance (green), and test mean squared error (purple) for simulated data.

More Bias-Variance Tradeoff

Second plot illustrating the bias-variance tradeoff
for simulated data, showing squared bias,
variance, and test mean squared error.

Squared bias (black), variance (green), and test mean squared error (purple) for simulated data.

Advantages of Ridge

Ridge vs. Least Squares:

Ridge vs. Subset Selection:

Look back and look ahead

Screenshot
of
the
course
schedule
for
lectures
11
to
20.

What is regularization? Why do we need it?
What are the two basic types of regularization methods? How are they implemented mathematically in linear regression?
How do you fit a ridge regression model in python?
How do you control the model flexibility /span> bias-variance tradeoff when using regularization?
How do you find the right amount of regularization using cross-validation? How do you do this in python?
What additional precautions do you need to take when using regularization (compared to least squares)?
What are the advantages of regularization compared to Least Squares?
What are the advantages of regularization compared to subset selection?