blackblack
picture picture

Ch 6.2: Shrinkage - The Lasso

Lecture 18 - CMSE 381
Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering
Fri, Feb 27, 2026
Announcements

Last time:

This time:

Announcements:

Screenshot of the course schedule forlectures 11 to 20.
What should you learn from the previous and this lecture?

Section 1 picture picture

Last time - Ridge Regression
Goal

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4
Ridge regression

Before:
𝑅𝑆𝑆 = i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2
After:
i=1n (y i β0 j=1pβ jx𝑖𝑗) 2+λ j=1pβ j2 = 𝑅𝑆𝑆+λ j=1pβ j2

Plot of ridge regression
coefficient paths for the
Credit data, showing how
the coefficient estimates
shrink toward zero as
lambda increases.

Plot of ridge regression
coefficient paths for the
Credit data, showing how
coefficient estimates shrink
as the relative L2 norm
decreases from left to right.
Scale equivariance (or lack thereof)

Scale equivariant: Multiplying a variable by c (cXi) just returns a coefficient multiplied by 1c (1cβi)
Solution: standardize predictors
x~𝑖𝑗 = x𝑖𝑗 1n i=1n(x𝑖𝑗 x¯j)2

Section 2 picture picture

The Lasso
Same goal as before

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4
The lasso

Least Squares:
𝑅𝑆𝑆 = i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2
Ridge:
i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2+λ j=1pβ j2 = 𝑅𝑆𝑆+λ j=1pβ j2

The Lasso:

i=1n (y i β0 j=1pβ jx𝑖𝑗) 2 + λ j=1p|β j| = 𝑅𝑆𝑆 + λ j=1p|β j|
Subsets with lasso

i=1n (y i β0 j=1pβ jx𝑖𝑗) 2 + λ j=1p|β j| = 𝑅𝑆𝑆 + λ j=1p|β j|
An example on Credit data set

Lasso Plot of lasso coefficient paths
for the Credit data, showing how
some coefficient estimates shrink
exactly to zero as lambda increases.
Ridge Plot of ridge regression
coefficient paths for the Credit data,
showing how the coefficient estimates
shrink toward zero as lambda
increases.
More example on Credit data set

Lasso Plot of lasso coefficient paths
for the Credit data, showing how
some coefficient estimates shrink
exactly to zero and produce variable
subsets.

Ridge Plot of ridge regression
coefficient paths for the Credit data,
showing how coefficient estimates
shrink as the relative L2 norm
decreases from left to right.

Why the hell lasso can select variable (while ridge cannot)?

Alternative formulation of lasso &ridge regression (play more with p)
min β (yi ŷi)2 where  |βj|s
min β (yi ŷi)2 where  |βj|2 s

Contour plot illustrating the geometric difference between lasso and ridge constraints, showing
why lasso can set coefficients exactly to zero while ridge generally cannot.

Bias-Variance tradeoff

Plot illustrating the bias-variance tradeoff for simulated data, showing
squared bias, variance, and test mean squared error. Squared bias (black), variance (green), and test mean squared error (purple) for simulated data.
Using Cross-Validation to find λ

10-fold CV choice of λ for lasso and simulated data

Plot of 10-fold cross-validation error and lasso coefficient behavior for
simulated data, with dashed vertical lines indicating the selected
lambda and only the signal variables remaining nonzero near the
chosen model.
Coding example

Ridge vs Lasso

Ridge Regression:
Lasso:
TL;DR - Original forumlation

Least Squares:
𝑅𝑆𝑆 = i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2
Ridge:
i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2+λ j=1pβ j2 = 𝑅𝑆𝑆+λ j=1pβ j2

The Lasso:

i=1n (y i β0 j=1pβ jx𝑖𝑗 ) 2 + λ j=1p|β j|= 𝑅𝑆𝑆 + λ j=1p|β j|
Next time

Screenshot of the course schedule for lectures 11 to 20.