Jupyter Notebook

Jupyter Notebook#

Lecture 6 - Multiple Linear Regression#

In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Multiple linear regression#

Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641	135.0
...	...	...	...	...	...	...	...	...	...	...	...
437	0.041708	0.050680	0.019662	0.059744	-0.005697	-0.002566	-0.028674	-0.002592	0.031193	0.007207	178.0
438	-0.005515	0.050680	-0.015906	-0.067642	0.049341	0.079165	-0.028674	0.034309	-0.018114	0.044485	104.0
439	0.041708	0.050680	-0.015906	0.017293	-0.037344	-0.013840	-0.024993	-0.011080	-0.046883	0.015491	132.0
440	-0.045472	-0.044642	0.039062	0.001215	0.016318	0.015283	-0.028674	0.026560	0.044529	-0.025930	220.0
441	-0.045472	-0.044642	-0.073030	-0.081413	0.083740	0.027809	0.173816	-0.039493	-0.004222	0.003064	57.0

442 rows × 11 columns

We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.

X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)

print(multireg.coef_)
print(multireg.intercept_)

[-175.71107989 1006.71695008]
152.13348416289585

✅ Q: What are the values for $\beta_0$, $\beta_1$, and $\beta_2$?

###Your answer here###

✅ Q: Write an interpretation for the $\beta_2$ value in this data set.

###Your Answer Here###

✅ Q: We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?

# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()

OLS Regression Results
Dep. Variable:	target	R-squared:	0.329
Model:	OLS	Adj. R-squared:	0.326
Method:	Least Squares	F-statistic:	107.6
Date:	Sun, 25 Jan 2026	Prob (F-statistic):	9.63e-39
Time:	17:22:44	Log-Likelihood:	-2459.0
No. Observations:	442	AIC:	4924.
Df Residuals:	439	BIC:	4936.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	152.1335	3.011	50.528	0.000	146.216	158.051
s1	-175.7111	73.872	-2.379	0.018	-320.898	-30.524
s5	1006.7170	73.872	13.628	0.000	861.530	1151.904

Omnibus:	10.295	Durbin-Watson:	2.025
Prob(Omnibus):	0.006	Jarque-Bera (JB):	10.497
Skew:	0.356	Prob(JB):	0.00526
Kurtosis:	2.748	Cond. No.	30.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

####YOUR ANSWER HERE###

✅ Q: What is the predicted model?

##Your ANSWER HERE##

✅ Q: How much trust can we place in the estimates?

###YOUR ANSWER HERE###

✅ Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?

##Your code here###

###YOUR ANSWER HERE###

✅ Q: What do you notice about the different terms? Are some more related than others?

###YOUR ANSWER HERE###

✅ Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?

### Your code here for p-value###

##YOUR ANSWER HERE##

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

# We're going to use the (fake) data set used in the beginning of the book. 
url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/27085b33898cca7fb57dc6897c2deccd/Advertising.csv"
advertising_df = pd.read_csv(url, index_col = 0)
advertising_df

	TV	Radio	Newspaper	Sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9
...	...	...	...	...
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

200 rows × 4 columns

Q1: Hypothesis Test#

✅ Do this: Use the statsmodels package to fit the model $$ \texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper} $$ What is the equation for the model learned?

##Your code here###

##YOUR ANSWER HERE##

✅ Do this: Use the summary command for the trained model class to determine the F-statistic for this model.

##YOUR CODE HERE##

✅ Do this: What are the null and alternative hypotheses for the test this statistic is used for?

###YOUR ANSWER HERE##

✅ Do this: What is your conclusion of the test given this F-score?

##YOUR ANSWER HERE##

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Q2: Subsets of variables#

✅ Q: List all 6 subsets of the three variables being used.

Your answer here

✅ Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?

def statsmodelRSS(est):
    # Returns the RSS for the statsmodel ols class
    return np.sum(est.resid**2)

est.ssr
# print(statsmodelRSS(est))

556.8252629021874

##YOUR ANSWER HERE##

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr.Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.