Ch 6.2: Shrinkage - The Lasso

Lecture 18 - CMSE 381

Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering

Fri, Feb 27, 2026

Announcements

Last time:

Ridge Regression

This time:

The Lasso

Announcements:

HW3 due Sunday 3/8
HW4 due Sunday 3/15
Think about the project, choose a partner

Screenshot of the course schedule forlectures 11 to 20.

What should you learn from the previous and this lecture?

What is regularization? Why do we need it?
What are the two basic types of regularization methods? How are they implemented mathematically in linear regression? Why are they also called Shrinkage methods?
How do you fit a Lasso regression model in python?
How do you control the model flexibility & bias-variance tradeoff when using regularization?
How do you find the right amount of regularization using cross-validation? How do you do this in python?
What additional precautions do you need to take when using regularization (compared to least squares)?
When do you choose one Shrinkage method over another?

Section 1

Last time - Ridge Regression

Goal

Fit model using all $p$ predictors
Aim to constrain (regularize) coefficient estimates
Shrink the coefficient estimates towards 0

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4}

Ridge regression

Before:

𝑅𝑆𝑆 = \sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2}

After:

\sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2} + λ \sum_{j = 1}^{p} β_{j}^{2} = 𝑅𝑆𝑆 + λ \sum_{j = 1}^{p} β_{j}^{2}

Scale equivariance (or lack thereof)

Scale equivariant: Multiplying a variable by

c

(

c X_{i}

) just returns a coefficient multiplied by

1 ∕ c

(

1 ∕ c β_{i}

)

Least squares is scale equivariant
Ridge regression is not

Solution: standardize predictors

{\tilde{x}}_{𝑖𝑗} = \frac{x_{𝑖𝑗}}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{𝑖𝑗} - {\bar{x}}_{j})}^{2}}}

Same goal as before

Fit model using all $p$ predictors
Aim to constrain (regularize) coefficient estimates
Shrink the coefficient estimates towards 0

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4}

The lasso

Least Squares:

𝑅𝑆𝑆 = \sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2}

Ridge:

\sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2} + λ ∑_{j = 1}^{p} β_{j}^{2} = 𝑅𝑆𝑆 + λ ∑_{j = 1}^{p} β_{j}^{2}

Subsets with lasso

\sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2} + λ \sum_{j = 1}^{p} | β_{j} | = 𝑅𝑆𝑆 + λ \sum_{j = 1}^{p} | β_{j} |

An example on Credit data set

Lasso

Ridge

More example on Credit data set

Lasso

Ridge Plot of ridge regression
coefficient paths for the Credit data,
showing how coefficient estimates
shrink as the relative L2 norm
decreases from left to right.

Why the hell lasso can select variable (while ridge cannot)?

Alternative formulation of lasso &ridge regression (play more with

ℓ_{p}

)

min_{β} \sum {(y_{i} - ŷ_{i})}^{2} where \sum | β_{j} | \leq s

min_{β} \sum {(y_{i} - ŷ_{i})}^{2} where \sum | β_{j} |^{2} \leq s

Contour plot illustrating the geometric difference between lasso and ridge constraints, showing
why lasso can set coefficients exactly to zero while ridge generally cannot.

Bias-Variance tradeoff

Squared bias (black), variance (green), and test mean squared error (purple) for simulated data.

Using Cross-Validation to find

λ

Choose a grid of $λ$ values
Compute the ( $k$ -fold) cross-validation error for each value of $λ$
Select the tuning parameter value $λ$ for which the CV error is smallest.
The model is re-fit using all of the available observations and the selected value of the tuning parameter.

10-fold CV choice of

λ

for lasso and simulated data

Plot of 10-fold cross-validation error and lasso coefficient behavior for
simulated data, with dashed vertical lines indicating the selected
lambda and only the signal variables remaining nonzero near the
chosen model.

Ridge vs Lasso

Ridge Regression:

Lasso:

TL;DR - Original forumlation

Least Squares:

𝑅𝑆𝑆 = \sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2}

Ridge:

\sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{𝑖𝑗})}^{2} + λ ∑_{j = 1}^{p} β_{j}^{2} = 𝑅𝑆𝑆 + λ ∑_{j = 1}^{p} β_{j}^{2}