HW 4

HW 4#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.cross_decomposition import PLSRegression
from sklearn.decomposition import PCA

Homework 4 Spring 2026#

6.6.0 (2+2+2+2+2+2= 12 points) Conceptual questions
- (a) What is the definition of scale equivariant?
- (b) Why is it important to standardize the predictors when using ridge regression or the lasso?
- (c) What is the difference between ridge regression and the lasso?
- (d) What is an advantage of using ridge regression or the lasso over least squares linear regression?
- (e) What is the purpose of PCA?
- (f) What does the first principle component maximize?
6.6.9 (a-g) In this exercise, we will predict the number of applications received using the other variables in the College data set.

Grading distribution#

6.60 (12 points)
6.6.9 (44 points)

6.6.0 Conceptual questions#

(a) What is the definition of scale equivariant?

### YOUR ANSWER HERE###

(b) Why is it important to standardize the predictors when using ridge regression or the lasso?

### YOUR ANSWER HERE###

(c) What is the difference between ridge regression and the lasso?

### YOUR ANSWER HERE###

(d) What is an advantage of using ridge regression or the lasso over least squares linear regression?

### YOUR ANSWER HERE###

(e) What is the purpose of PCA (Principal Component Analysis)?

### YOUR ANSWER HERE###

(f) What does the first principle component maximize?

### YOUR ANSWER HERE###

6.6.9#

In this exercise, we will predict the number of applications received using the other variables in the College data set.

## Load the dataset
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/cc29ec6408d657de88bc7fe6de6b1170/College.csv"
college_df = pd.read_csv(url)
college_df = college_df.set_index('Unnamed: 0')

## One-hot encode the categorical variable
college_df = pd.get_dummies(college_df, drop_first=True)

college_df.head()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate	Private_Yes
Unnamed: 0
Abilene Christian University	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60	True
Adelphi University	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56	True
Adrian College	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54	True
Agnes Scott College	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59	True
Alaska Pacific University	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15	True

# Convert entire dataframe to float before modeling
college_df = college_df.astype(float)

print(college_df.dtypes)

Apps           float64
Accept         float64
Enroll         float64
Top10perc      float64
Top25perc      float64
F.Undergrad    float64
P.Undergrad    float64
Outstate       float64
Room.Board     float64
Books          float64
Personal       float64
PhD            float64
Terminal       float64
S.F.Ratio      float64
perc.alumni    float64
Expend         float64
Grad.Rate      float64
Private_Yes    float64
dtype: object

(a) (5 points):Split the data in test set and training set (this will only be used for the linear regression, all other models are cross-validated with the whole set).

###YOUR CODE HERE###

print(X_train.dtypes)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(X_train.dtypes)

NameError: name 'X_train' is not defined

(b) (5 points): Fit a linear model using least squares on the training set, and report the test error obtained.

###YOUR CODE HERE###

✅ Question (b): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (b) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###

Required Documentation (2 points)
- For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
- For any part where you used other external resources, list those sources.
- For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(c) (5 points): Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

###YOUR CODE HERE###

(d)(5 points): Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.

###YOUR CODE HERE###

✅ Question (d): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (d) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###

Required Documentation (2 points)
- For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
- For any part where you used other external resources, list those sources.
- For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(e)(5 points): Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

###YOUR CODE HERE###

(f) (5 points): Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

###YOUR CODE HERE###

✅ Question (f): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (f) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###

Required Documentation (2 points)
- For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
- For any part where you used other external resources, list those sources.
- For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(g) (5 points): Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?

###YOUR COMMENT HERE###

HW 4

Contents

HW 4#

Homework 4 Spring 2026#

Grading distribution#

6.6.0 Conceptual questions#

6.6.9#

✅ Question (b): Documenting Your Solution Process (3 points)#

✅ Question (d): Documenting Your Solution Process (3 points)#

✅ Question (f): Documenting Your Solution Process (3 points)#