Jupyter Notebook

Jupyter Notebook#

Lecture 6 - Multiple Linear Regression#

In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form

Y = β_{0} + β_{1} X + ε

In this lab, we will continue to use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Multiple linear regression#

Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} + ε

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.

X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)

print(multireg.coef_)
print(multireg.intercept_)

✅ Q: What are the values for $β_{0}$ , $β_{1}$ , and $β_{2}$ ? Write an interpretation for the $β_{2}$ value in this data set.

Your answer here

We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?

# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()

✅ Q: What is the predicted model? How much trust can we place in the estimates?

Your answer here

✅ Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?

Your answer here

✅ Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?

# Your answer here

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

# We're going to use the (fake) data set used in the beginning of the book. 
advertising_df = pd.read_csv('../../DataSets/Advertising.csv', index_col = 0)
advertising_df

Q1: Hypothesis Test#

✅ Do this: Use the statsmodels package to fit the model $ $Sales = β_{0} + β_{1} \cdot TV + β_{2} \cdot Radio + β_{3} \cdot Newspaper$ $ What is the equation for the model learned?

# Your code here

✅ Do this: Use the summary command for the trained model class to determine the F-statistic for this model.

What are the null and alternative hypotheses for the test this statistic is used for?
What is your conclusion of the test given this F-score?

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Q2: Subsets of variables#

✅ Q: List all 6 subsets of the three variables being used.

Your answer here

✅ Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?

def statsmodelRSS(est):
    # Returns the RSS for the statsmodel ols class
    return np.sum(est.resid**2)

est.ssr
# print(statsmodelRSS(est))

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Jupyter Notebook

Contents

Jupyter Notebook#

Lecture 6 - Multiple Linear Regression#

Multiple linear regression#

Q1: Hypothesis Test#

Q2: Subsets of variables#

Congratulations, we’re done!#