Jupyter Notebook#

Lecture 6 - Multiple Linear Regression#

In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

  • Scikit learn is arguably the most used tool for machine learning in python

  • Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Multiple linear regression#

Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0
... ... ... ... ... ... ... ... ... ... ... ...
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018114 0.044485 104.0
439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 -0.011080 -0.046883 0.015491 132.0
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044529 -0.025930 220.0
441 -0.045472 -0.044642 -0.073030 -0.081413 0.083740 0.027809 0.173816 -0.039493 -0.004222 0.003064 57.0

442 rows × 11 columns

We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.

X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)

print(multireg.coef_)
print(multireg.intercept_)
[-175.71107989 1006.71695008]
152.13348416289585

Q: What are the values for \(\beta_0\), \(\beta_1\), and \(\beta_2\)?

###Your answer here###

Q: Write an interpretation for the \(\beta_2\) value in this data set.

###Your Answer Here###

Q: We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?

# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()
OLS Regression Results
Dep. Variable: target R-squared: 0.329
Model: OLS Adj. R-squared: 0.326
Method: Least Squares F-statistic: 107.6
Date: Sun, 25 Jan 2026 Prob (F-statistic): 9.63e-39
Time: 17:22:44 Log-Likelihood: -2459.0
No. Observations: 442 AIC: 4924.
Df Residuals: 439 BIC: 4936.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 152.1335 3.011 50.528 0.000 146.216 158.051
s1 -175.7111 73.872 -2.379 0.018 -320.898 -30.524
s5 1006.7170 73.872 13.628 0.000 861.530 1151.904
Omnibus: 10.295 Durbin-Watson: 2.025
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.497
Skew: 0.356 Prob(JB): 0.00526
Kurtosis: 2.748 Cond. No. 30.2


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

####YOUR ANSWER HERE###

Q: What is the predicted model?

##Your ANSWER HERE##

Q: How much trust can we place in the estimates?

###YOUR ANSWER HERE### 

Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?

##Your code here###
###YOUR ANSWER HERE###

Q: What do you notice about the different terms? Are some more related than others?

###YOUR ANSWER HERE###

Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?

### Your code here for p-value###
##YOUR ANSWER HERE##

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

# We're going to use the (fake) data set used in the beginning of the book. 
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/27085b33898cca7fb57dc6897c2deccd/Advertising.csv"
advertising_df = pd.read_csv(url, index_col = 0)
advertising_df
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
... ... ... ... ...
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

200 rows × 4 columns

Q1: Hypothesis Test#

Do this: Use the statsmodels package to fit the model $\( \texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper} \)$ What is the equation for the model learned?

##Your code here###
##YOUR ANSWER HERE##

Do this: Use the summary command for the trained model class to determine the F-statistic for this model.

##YOUR CODE HERE##

Do this: What are the null and alternative hypotheses for the test this statistic is used for?

###YOUR ANSWER HERE##

Do this: What is your conclusion of the test given this F-score?

##YOUR ANSWER HERE##

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Q2: Subsets of variables#

Q: List all 6 subsets of the three variables being used.

Your answer here

Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?

def statsmodelRSS(est):
    # Returns the RSS for the statsmodel ols class
    return np.sum(est.resid**2)
est.ssr
# print(statsmodelRSS(est))
556.8252629021874
##YOUR ANSWER HERE##

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr.Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.