HW 4#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.cross_decomposition import PLSRegression
from sklearn.decomposition import PCA

Homework 4 Spring 2026#

  • 6.6.0 (2+2+2+2+2+2= 12 points) Conceptual questions

    • (a) What is the definition of scale equivariant?

    • (b) Why is it important to standardize the predictors when using ridge regression or the lasso?

    • (c) What is the difference between ridge regression and the lasso?

    • (d) What is an advantage of using ridge regression or the lasso over least squares linear regression?

    • (e) What is the purpose of PCA?

    • (f) What does the first principle component maximize?

  • 6.6.9 (a-g) In this exercise, we will predict the number of applications received using the other variables in the College data set.

Grading distribution#

  • 6.60 (12 points)

  • 6.6.9 (44 points)

6.6.0 Conceptual questions#

  • (a) What is the definition of scale equivariant?

### YOUR ANSWER HERE###
  • (b) Why is it important to standardize the predictors when using ridge regression or the lasso?

### YOUR ANSWER HERE###
  • (c) What is the difference between ridge regression and the lasso?

### YOUR ANSWER HERE###
  • (d) What is an advantage of using ridge regression or the lasso over least squares linear regression?

### YOUR ANSWER HERE###
  • (e) What is the purpose of PCA (Principal Component Analysis)?

### YOUR ANSWER HERE###
  • (f) What does the first principle component maximize?

### YOUR ANSWER HERE###

6.6.9#

In this exercise, we will predict the number of applications received using the other variables in the College data set.

## Load the dataset
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/cc29ec6408d657de88bc7fe6de6b1170/College.csv"
college_df = pd.read_csv(url)
college_df = college_df.set_index('Unnamed: 0')

## One-hot encode the categorical variable
college_df = pd.get_dummies(college_df, drop_first=True)

college_df.head()
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Private_Yes
Unnamed: 0
Abilene Christian University 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60 True
Adelphi University 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56 True
Adrian College 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54 True
Agnes Scott College 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59 True
Alaska Pacific University 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15 True
# Convert entire dataframe to float before modeling
college_df = college_df.astype(float)

print(college_df.dtypes)
Apps           float64
Accept         float64
Enroll         float64
Top10perc      float64
Top25perc      float64
F.Undergrad    float64
P.Undergrad    float64
Outstate       float64
Room.Board     float64
Books          float64
Personal       float64
PhD            float64
Terminal       float64
S.F.Ratio      float64
perc.alumni    float64
Expend         float64
Grad.Rate      float64
Private_Yes    float64
dtype: object

(a) (5 points):Split the data in test set and training set (this will only be used for the linear regression, all other models are cross-validated with the whole set).

###YOUR CODE HERE###
print(X_train.dtypes)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(X_train.dtypes)

NameError: name 'X_train' is not defined

(b) (5 points): Fit a linear model using least squares on the training set, and report the test error obtained.

###YOUR CODE HERE###

✅ Question (b): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (b) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###
  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(c) (5 points): Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

###YOUR CODE HERE###

(d)(5 points): Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.

###YOUR CODE HERE###

✅ Question (d): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (d) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###
  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(e)(5 points): Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

###YOUR CODE HERE###

(f) (5 points): Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

###YOUR CODE HERE###

✅ Question (f): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (f) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###
  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(g) (5 points): Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?

###YOUR COMMENT HERE###