HW 4#
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.cross_decomposition import PLSRegression
from sklearn.decomposition import PCA
Homework 4 Spring 2026#
6.6.0 (2+2+2+2+2+2= 12 points) Conceptual questions
(a) What is the definition of scale equivariant?
(b) Why is it important to standardize the predictors when using ridge regression or the lasso?
(c) What is the difference between ridge regression and the lasso?
(d) What is an advantage of using ridge regression or the lasso over least squares linear regression?
(e) What is the purpose of PCA?
(f) What does the first principle component maximize?
6.6.9 (a-g) In this exercise, we will predict the number of applications received using the other variables in the
Collegedata set.
Grading distribution#
6.60 (12 points)
6.6.9 (44 points)
6.6.0 Conceptual questions#
(a) What is the definition of scale equivariant?
### YOUR ANSWER HERE###
(b) Why is it important to standardize the predictors when using ridge regression or the lasso?
### YOUR ANSWER HERE###
(c) What is the difference between ridge regression and the lasso?
### YOUR ANSWER HERE###
(d) What is an advantage of using ridge regression or the lasso over least squares linear regression?
### YOUR ANSWER HERE###
(e) What is the purpose of PCA (Principal Component Analysis)?
### YOUR ANSWER HERE###
(f) What does the first principle component maximize?
### YOUR ANSWER HERE###
6.6.9#
In this exercise, we will predict the number of applications received
using the other variables in the College data set.
## Load the dataset
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/cc29ec6408d657de88bc7fe6de6b1170/College.csv"
college_df = pd.read_csv(url)
college_df = college_df.set_index('Unnamed: 0')
## One-hot encode the categorical variable
college_df = pd.get_dummies(college_df, drop_first=True)
college_df.head()
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | Private_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | ||||||||||||||||||
| Abilene Christian University | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 | True |
| Adelphi University | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 | True |
| Adrian College | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 | True |
| Agnes Scott College | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 | True |
| Alaska Pacific University | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 | True |
# Convert entire dataframe to float before modeling
college_df = college_df.astype(float)
print(college_df.dtypes)
Apps float64
Accept float64
Enroll float64
Top10perc float64
Top25perc float64
F.Undergrad float64
P.Undergrad float64
Outstate float64
Room.Board float64
Books float64
Personal float64
PhD float64
Terminal float64
S.F.Ratio float64
perc.alumni float64
Expend float64
Grad.Rate float64
Private_Yes float64
dtype: object
(a) (5 points):Split the data in test set and training set (this will only be used for the linear regression, all other models are cross-validated with the whole set).
###YOUR CODE HERE###
print(X_train.dtypes)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(X_train.dtypes)
NameError: name 'X_train' is not defined
(b) (5 points): Fit a linear model using least squares on the training set, and report the test error obtained.
###YOUR CODE HERE###
✅ Question (b): Documenting Your Solution Process (3 points)#
Please answer the following clearly and completely:
Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (b) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).
###YOUR ANSWER HERE###
Required Documentation (2 points)
For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
For any part where you used other external resources, list those sources.
For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).
Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.
### YOUR ANSWER HERE##
#YOUR PROMPTS##
##AI OUTPUTS##
(c) (5 points): Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
###YOUR CODE HERE###
(d)(5 points): Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.
###YOUR CODE HERE###
✅ Question (d): Documenting Your Solution Process (3 points)#
Please answer the following clearly and completely:
Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (d) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).
###YOUR ANSWER HERE###
Required Documentation (2 points)
For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
For any part where you used other external resources, list those sources.
For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).
Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.
### YOUR ANSWER HERE##
#YOUR PROMPTS##
##AI OUTPUTS##
(e)(5 points): Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
###YOUR CODE HERE###
(f) (5 points): Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
###YOUR CODE HERE###
✅ Question (f): Documenting Your Solution Process (3 points)#
Please answer the following clearly and completely:
Prior Knowledge vs. External Resources (1 point)
Indicate which parts of Question (f) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).
###YOUR ANSWER HERE###
Required Documentation (2 points)
For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.
For any part where you used other external resources, list those sources.
For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).
Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.
### YOUR ANSWER HERE##
#YOUR PROMPTS##
##AI OUTPUTS##
(g) (5 points): Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?
###YOUR COMMENT HERE###