Jupyter Notebook#
Lecture 6 - Multiple Linear Regression#
In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form
In this lab, we will continue to use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
Multiple linear regression#
Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)
diabetes_df
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 437 | 0.041708 | 0.050680 | 0.019662 | 0.059744 | -0.005697 | -0.002566 | -0.028674 | -0.002592 | 0.031193 | 0.007207 | 178.0 |
| 438 | -0.005515 | 0.050680 | -0.015906 | -0.067642 | 0.049341 | 0.079165 | -0.028674 | 0.034309 | -0.018114 | 0.044485 | 104.0 |
| 439 | 0.041708 | 0.050680 | -0.015906 | 0.017293 | -0.037344 | -0.013840 | -0.024993 | -0.011080 | -0.046883 | 0.015491 | 132.0 |
| 440 | -0.045472 | -0.044642 | 0.039062 | 0.001215 | 0.016318 | 0.015283 | -0.028674 | 0.026560 | 0.044529 | -0.025930 | 220.0 |
| 441 | -0.045472 | -0.044642 | -0.073030 | -0.081413 | 0.083740 | 0.027809 | 0.173816 | -0.039493 | -0.004222 | 0.003064 | 57.0 |
442 rows × 11 columns
We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.
X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values
multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)
print(multireg.coef_)
print(multireg.intercept_)
[-175.71107989 1006.71695008]
152.13348416289585
✅ Q: What are the values for \(\beta_0\), \(\beta_1\), and \(\beta_2\)?
###Your answer here###
✅ Q: Write an interpretation for the \(\beta_2\) value in this data set.
###Your Answer Here###
✅ Q: We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?
# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()
| Dep. Variable: | target | R-squared: | 0.329 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.326 |
| Method: | Least Squares | F-statistic: | 107.6 |
| Date: | Sun, 25 Jan 2026 | Prob (F-statistic): | 9.63e-39 |
| Time: | 17:22:44 | Log-Likelihood: | -2459.0 |
| No. Observations: | 442 | AIC: | 4924. |
| Df Residuals: | 439 | BIC: | 4936. |
| Df Model: | 2 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 152.1335 | 3.011 | 50.528 | 0.000 | 146.216 | 158.051 |
| s1 | -175.7111 | 73.872 | -2.379 | 0.018 | -320.898 | -30.524 |
| s5 | 1006.7170 | 73.872 | 13.628 | 0.000 | 861.530 | 1151.904 |
| Omnibus: | 10.295 | Durbin-Watson: | 2.025 |
|---|---|---|---|
| Prob(Omnibus): | 0.006 | Jarque-Bera (JB): | 10.497 |
| Skew: | 0.356 | Prob(JB): | 0.00526 |
| Kurtosis: | 2.748 | Cond. No. | 30.2 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
####YOUR ANSWER HERE###
✅ Q: What is the predicted model?
##Your ANSWER HERE##
✅ Q: How much trust can we place in the estimates?
###YOUR ANSWER HERE###
✅ Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?
##Your code here###
###YOUR ANSWER HERE###
✅ Q: What do you notice about the different terms? Are some more related than others?
###YOUR ANSWER HERE###
✅ Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?
### Your code here for p-value###
##YOUR ANSWER HERE##
![]()
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
# We're going to use the (fake) data set used in the beginning of the book.
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/27085b33898cca7fb57dc6897c2deccd/Advertising.csv"
advertising_df = pd.read_csv(url, index_col = 0)
advertising_df
| TV | Radio | Newspaper | Sales | |
|---|---|---|---|---|
| 1 | 230.1 | 37.8 | 69.2 | 22.1 |
| 2 | 44.5 | 39.3 | 45.1 | 10.4 |
| 3 | 17.2 | 45.9 | 69.3 | 9.3 |
| 4 | 151.5 | 41.3 | 58.5 | 18.5 |
| 5 | 180.8 | 10.8 | 58.4 | 12.9 |
| ... | ... | ... | ... | ... |
| 196 | 38.2 | 3.7 | 13.8 | 7.6 |
| 197 | 94.2 | 4.9 | 8.1 | 9.7 |
| 198 | 177.0 | 9.3 | 6.4 | 12.8 |
| 199 | 283.6 | 42.0 | 66.2 | 25.5 |
| 200 | 232.1 | 8.6 | 8.7 | 13.4 |
200 rows × 4 columns
Q1: Hypothesis Test#
✅ Do this: Use the statsmodels package to fit the model
$\(
\texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper}
\)$
What is the equation for the model learned?
##Your code here###
##YOUR ANSWER HERE##
✅ Do this: Use the summary command for the trained model class to determine the F-statistic for this model.
##YOUR CODE HERE##
✅ Do this: What are the null and alternative hypotheses for the test this statistic is used for?
###YOUR ANSWER HERE##
✅ Do this: What is your conclusion of the test given this F-score?
##YOUR ANSWER HERE##
![]()
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
Q2: Subsets of variables#
✅ Q: List all 6 subsets of the three variables being used.
Your answer here
✅ Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?
def statsmodelRSS(est):
# Returns the RSS for the statsmodel ols class
return np.sum(est.resid**2)
est.ssr
# print(statsmodelRSS(est))
556.8252629021874
##YOUR ANSWER HERE##
Congratulations, we’re done!#
Initially created by Dr. Liz Munch, modified by Dr.Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.