Homework 2 Spring 2026#
Wed Jan 21 and Friday Jan 23, we covered Sec 3.1 - Simple linear Regression
3.7.8 (a)(i,ii,iii, modified iv below) and (b)
Modified version of a.iv:
What are the predicted values for the inputs?
Compute the RSS and MSE using these predicted values.
3.7.13
Warning for part (b),
np.random.normaltakes standard deviation as input forscale, not variance.
A note on code. The book’s use of the
statsmodelspackage is slightly different from the examples provided in the Jupyter notebooks in class. In particular, in the book’s lab examples and in the homework statement they implicitly useimport statsmodels.api as sm
while we use
import statsmodels.formula.api as smf
This results in slight differences in code, in particular whether the function call you use is
OLSorols. You may use whichever works for you, the answers should be the same.
Mon Jan 26, we cover Sec 3.2 - Multiple Linear Regression
3.7.1
(Modified version of 3.7.9): Using the
Autodata set, we will predict \(Y = \texttt{mpg}\) using all other variables except name and origin.Generate the correlation matrix between all variables. Are there any pairs that are particularly highly correlated?
Using
statsmodel, create a linear model predictingmpgfrom all other variables exceptnameandorigin.Is there a relationship between the predictors and the response? Justify your answer.
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?
Grading distribution#
You should be able to break down the innards however you want.
3.7.8 - 21 points
3.7.13 - 38 points
3.7.1 - 6 points
3.7.9ish - 11 points
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
✅ Q 3.7.8 (21 points): This question involves the use of simple linear regression on the Auto data set.
# Load data
# First, we're going to do all the data loading and cleanup we figured out last time.
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
auto = pd.read_csv(url)
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')
auto.shape
(392, 9)
✅ Q 3.7.8 (a) (5 points):
(a) Use the sm.OLS() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.
###YOUR sm.OLS() CODE HERE###
✅ Q 3.7.8 (a)(i,ii,iii) (6 points): Comment on the (a) output. For example:
- i. Is there a relationship between the predictor and the response?
- ii. How strong is the relationship between the predictor and the response?
- iii. Is the relationship between the predictor and the response positive or negative?
YOUR ANSWER TO (i),(ii), and (iii)
(i)
(ii)
(iii)
✅ Q 3.7.8 (a) (7 points):(modified iv below)
What are the predicted values for the inputs?
Compute the RSS and MSE using these predicted values.
###YOUR CODE OR ANSWER HERE###
✅ Q 3.7.8 (b) (3points):
(b) Plot the response and the predictor in a new set of axes ax. Use the ax.axline() method or the abline() function defined in the lab to display the least squares regression line.
###YOUR CODE HERE####
✅ Question 3.7.8 Documenting Your Solution Pathway (3 points)#
Answer the following questions in detail:
Indicate which portions of Question 3.7.8 you answered with prior knowledge, and also indicate which portions of Question 3.7.8 you answered with the help of outside sources (e.g. generative AI, past assignments, Stack Overflow, Google, etc.) to help you? (1 point)
For the parts were you DID use external resources, document what sources you used and how they informed your approach. If you used generative AI, be sure to include the most important prompts and outputs. For the parts where you DID NOT use external resources, what prior knowledge did you recall to solve the problem? (2 points)
###YOUR ANSWER HERE (You can use markdown)###
# 1.
# 2.
✅ 3.7.13 (a) (2 points):
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use the default random number generator with seed set to 1 prior to starting part (a) to ensure consistent results.
(a) Using the normal() method of your random number generator, create a vector, x, containing 100 observations drawn from a \(N(0, 1)\) distribution. This represents a feature, \(X\).
###YOUR CODE HERE####
✅ 3.7.13 (b) (2 points): Using the normal() method, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution—a normal distribution with mean zero and variance 0.25.
####YOUR CODE HERE####
✅ 3.7.13 (c) (3points): Using x and eps, generate a vector y according to the model
What is the length of the vector \(y\)? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?
###YOUR CODE HERE###
✅ 3.7.13 (d) (2 points): Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
####YOUR CODE HERE###
✅ 3.7.13 (e) (4 points): Fit a least squares linear model to predict \(y\) using \(x\). Comment on the model obtained. How do \(\hat\beta_0\) and \(\hat\beta_1\) compare to \(\beta_0\) and \(\beta_1\)?
####YOUR CODE HERE####
✅ 3.7.13 (f) (5 points): Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() method of the axes to create an appropriate legend.
###YOUR CODE HERE###
✅ 3.7.13 (g) (5 points): Now fit a polynomial regression model that predicts y using x and \(x^2\) . Is there evidence that the quadratic term improves the model fit? Explain your answer.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (h) (4 points): Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) (Equation in 3.7.13 (c)) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\varepsilon\) in (b). Describe your results.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (i) (4 points): Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ” in (b). Describe your results.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (j) (2 points): 1)What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (j) (2points): 2)What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the noisier data set.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (j) (2points): 3)What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the less noisy data set? Comment on your results.
###YOUR CODE AND ANSWER HERE###
✅ 3.7.13 (j) (1 point): 4)Please comment on your above three different confidence interval results.
###YOUR ANSWER HERE###
✅ 3.7.1 (6 points):
Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

###YOUR ANSWER HERE###
✅ Question 3.7.1 Documenting Your Solution Pathway (3 points)#
Answer the following questions in detail:
Indicate which portions of question 3.7.1 you answered with prior knowledge, and also indicate which portions of question 3.7.1 you answered with the help of outside sources (e.g. generative AI, past assignments, Stack Overflow, Google, etc.) to help you? (1 point)
For the parts were you DID use external resources, document what sources you used and how they informed your approach. If you used generative AI, be sure to include the most important prompts and outputs. For the parts where you DID NOT use external resources, what prior knowledge did you recall to solve the problem? (2 points)
###YOUR ANSWER HERE (You can use markdown)###
### 1.
### 2.
✅ Modified version of 3.7.9 (11 points): Using the Auto data set, we will predict \(Y = \texttt{mpg}\) using all other variables except name.
# Load data
# First, we're going to do all the data loading and cleanup we figured out last time.
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
auto = pd.read_csv(url)
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')
auto.shape
(392, 9)
✅ Modified version of 3.7.9 (i) (3 points):
- Generate the correlation matrix between all variables. Are there any pairs that are particularly highly correlated?
####YOUR CODE HERE###
✅ Modified version of 3.7.9 (ii) (2 points):
- Using statsmodel, create a linear model predicting mpg from all other variables except name.
###YOUR CODE HERE##
✅ Modified version of 3.7.9 (iii) (1 point):
Is there a relationship between the predictors and the response? Justify your answer.
##YOUR ANSWER HERE###
✅ Modified version of 3.7.9 (iv) (1 point):
- Which predictors appear to have a statistically significant relationship to the response?
###YOUR ANSWER HERE###
✅ Modified version of 3.7.9 (iii) (1 point):
- What does the coefficient for the year variable suggest?
###YOUR ANSWER HERE###
✅ Question 3.7.9 Documenting Your Solution Pathway (3 points)#
Answer the following questions in detail:
Indicate which portions of Question 3.7.9 you answered with prior knowledge, and also indicate which portions of Question 3.7.9 you answered with the help of outside sources (e.g. generative AI, past assignments, Stack Overflow, Google, etc.) to help you? (1 point)
For the parts were you DID use external resources, document what sources you used and how they informed your approach. If you used generative AI, be sure to include the most important prompts and outputs. For the parts where you DID NOT use external resources, what prior knowledge did you recall to solve the problem? (2 points)
### YOUR ANSWER HERE###
#1.
#2.