Chapter 5 and 6 Assigned Problems#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Homework 3 Spring 2026#

  • 5.4.3

    • Note: “Explain” means no code is necessary. You can use words, pictures, and/or pseudocode to show me that you understand how this procedure works.

  • 5.4.5: The Default data set is on the DataSets page

    • Hint: For part (c), you do not need to regenerate the data. You have the same data set, but you are generating a new split of the data. This could be as easy as setting a new seed.

  • 5.4.8 (a-e)

    • Added part f: Repeat part (c) using \(k\)-fold CV for \(k=5,10,15,20\). Plot your results for error vs. degree for all these plus the LOOCV version. What do you notice?

    • You don’t need to plot this for the training error, it’s annoyingly difficult to get that out of the easy-mode version of \(k\)-fold CV. Just do it for test error. If you really want to try to get the training error plotted, too, take a look here.

  • 6:

    • Forward and backward selection

Below are the training and testing error from doing linear regression on different subsets of the variables from the auto data set to predict mpg.

Variables

Train Score

Test Score

null model

60.76

60.73

(cylinders,)

24.02

24.15

(horsepower,)

23.94

24.19

(weight,)

18.68

18.84

(acceleration,)

49.87

50.26

(cylinders, horsepower)

20.85

21.13

(cylinders, weight)

18.38

18.55

(cylinders, acceleration)

23.94

24.38

(horsepower, weight)

17.84

18.03

(horsepower, acceleration)

22.46

22.70

(weight, acceleration)

18.25

18.61

(cylinders, horsepower, weight)

17.76

17.99

(cylinders, horsepower, acceleration)

20.06

20.44

(cylinders, weight, acceleration)

18.13

18.54

(horsepower, weight, acceleration)

17.84

18.16

(cylinders, horsepower, weight, acceleration)

17.76

18.13

  • (Parts a, b, and c) For each of the three subset selection methods discussed in class ((a) best subset selection, (b) forward selection, and (c) backward selection), do the following

    • Describe the steps taken in the algorithm to arrive at a conclusion for the best possible model.

    • Be sure to say what \(M_k\) is for \(k= 0 , 1, \cdots, 4\).

    • What is the best model returned by the algorithm?

    • How many models do you have to train to arrive at the conclusion?

  • (d) Are your answers to (a), (b), and (c) the same? Do we expect them to be?

Note that we am not assuming you need to code any of these options, you only need to calculate by hand.

Grading distribution#

  • 5.4.3 (13 points)

  • 5.4.5 (36 points)

  • 5.4.8 (43 points)

  • 6 (8 points)

5.4.3#

We now review k-fold cross-validation.

  • (a) (5 points): Explain how k-fold cross-validation is implemented.

###YOUR ANSWER HERE###

5.4.3#

  • (b)(8 points): What are the advantages and disadvantages of k-fold cross validation relative to:

    • i. The validation set approach?

    • ii. LOOCV?

###YOUR ANSWER HERE###

5.4.5#

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

(a) (8 points) Fit a logistic regression model that uses income and balance to predict default.

from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/0bf0b0b65f603971cd33a04ad934449c/Default.csv"
Default = pd.read_csv(url)
Default.head()
default student balance income
0 No No 729.526495 44361.62507
1 No Yes 817.180407 12106.13470
2 No No 1073.549164 31767.13895
3 No No 529.250605 35704.49394
4 No No 785.655883 38463.49588
###YOUR CODE HERE###

(b)(11 points): Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

  • i. Split the sample set into a training set and a validation set.

  • ii. Fit a multiple logistic regression model using only the training observations.

  • iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

  • iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

  • i (2 points). Split the sample set into a training set and a validation set.

### YOUR ANSWER HERE###
  • ii (4 points). Fit a multiple logistic regression model using only the training observations.

### YOUR ANSWER HERE###
  • iii (3 points). Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

  • iv (2 points). Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

###YOUR ANSWER HERE###

(c) (4 points): Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

###YOUR CODE HERE###

(d) (13 points): Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

###YOUR CODE HERE### (10 pints)

✅ Question (d): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (d) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###
  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

5.4.8 (a-e, added part f)#

We will now perform cross-validation on a simulated data set.

(a)(3 points): Generate a simulated data set as follows:

In this data set, what is n and what is p? Write out the model used to generate the data in equation form.

rng = np.random.default_rng(1) 
x = rng.normal(size=100) 
y = x - 2 * x**2 + rng.normal(size=100)
###YOUR ANSWER HERE###

(b) (3 points): Create a scatterplot of X against Y . Comment on what you find.

###YOUR ANSWER HERE###

(c)(13 points): Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

  • \(Y = \beta_0 + \beta_1 X + \varepsilon\)

  • \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon\)

  • \(Y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3 + \varepsilon\)

  • \(Y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3+ \beta_4 X^4 + \varepsilon\)

Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y .

from sklearn.preprocessing import PolynomialFeatures 

d = 4

X_poly = PolynomialFeatures(degree=d, include_bias=False).fit_transform(x[:,None])

X = pd.DataFrame(X_poly, columns=[f'x^{i}' for i in range(1,d+1)])

y = pd.Series(y)
from sklearn.model_selection import cross_val_score ### (10 points)
from sklearn.model_selection import KFold
##YOUR CODE HERE## (10 points)

✅ Question (c): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (c) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE##

  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(d) (4 points): Repeat (c) using another random seed, and report your results.

Are your results the same as what you got in (c)? Why?

##YOUR CODE HERE##

###YOUR COMMENTS HERE###

(e) (4 points): Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.

###YOUR ANSWER HERE###

Added part f (16 points): Repeat part (c) using \(k\)-fold CV for \(k=5,10,15,20\). Plot your results for error vs. degree for all these plus the LOOCV version. What do you notice?

###YOUR CODE HERE## (13 points)

✅ Question (f): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (f) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE##

  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

Forward and backward selection#

Below are the training and testing error from doing linear regression on different subsets of the variables from the auto data set to predict mpg.

Variables

Train Score

Test Score

null model

60.76

60.73

(cylinders,)

24.02

24.15

(horsepower,)

23.94

24.19

(weight,)

18.68

18.84

(acceleration,)

49.87

50.26

(cylinders, horsepower)

20.85

21.13

(cylinders, weight)

18.38

18.55

(cylinders, acceleration)

23.94

24.38

(horsepower, weight)

17.84

18.03

(horsepower, acceleration)

22.46

22.70

(weight, acceleration)

18.25

18.61

(cylinders, horsepower, weight)

17.76

17.99

(cylinders, horsepower, acceleration)

20.06

20.44

(cylinders, weight, acceleration)

18.13

18.54

(horsepower, weight, acceleration)

17.84

18.16

(cylinders, horsepower, weight, acceleration)

17.76

18.13

  • (Parts a, b, and c) (2 + 2 + 2=6 points): For each of the three subset selection methods discussed in class ((a) best subset selection, (b) forward selection, and (c) backward selection), do the following

    • Describe the steps taken in the algorithm to arrive at a conclusion for the best possible model.

    • Be sure to say what \(M_k\) is for \(k= 0 , 1, \cdots, 4\).

    • What is the best model returned by the algorithm?

    • How many models do you have to train to arrive at the conclusion?

  • (d)(2 points): Are your answers to (a), (b), and (c) the same? Do we expect them to be?

Note that I am not assuming you need to code any of these options, you only need to calculate by hand.

  • (a) best subset selection

#YOUR ANSWER HERE##

  • (b) forward selection

###YOUR ANSWER HERE###

  • (c) backward selection

###YOUR ANSWER HERE###

  • (d) Are your answers to (a), (b), and (c) the same? Do we expect them to be?

###YOUR ANSWER HERE###