Jupyter Notebook#

Lecture 6 - Multiple Linear Regression#

In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

  • Scikit learn is arguably the most used tool for machine learning in python

  • Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Multiple linear regression#

Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.

X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)

print(multireg.coef_)
print(multireg.intercept_)

Q: What are the values for \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Write an interpretation for the \(\beta_2\) value in this data set.

Your answer here

We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?

# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()

Q: What is the predicted model? How much trust can we place in the estimates?

Your answer here

Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?

Your answer here

Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?

# Your answer here

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

# We're going to use the (fake) data set used in the beginning of the book. 
advertising_df = pd.read_csv('../../DataSets/Advertising.csv', index_col = 0)
advertising_df

Q1: Hypothesis Test#

Do this: Use the statsmodels package to fit the model $\( \texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper} \)$ What is the equation for the model learned?

# Your code here

Do this: Use the summary command for the trained model class to determine the F-statistic for this model.

  • What are the null and alternative hypotheses for the test this statistic is used for?

  • What is your conclusion of the test given this F-score?

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Q2: Subsets of variables#

Q: List all 6 subsets of the three variables being used.

Your answer here

Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?

def statsmodelRSS(est):
    # Returns the RSS for the statsmodel ols class
    return np.sum(est.resid**2)
est.ssr
# print(statsmodelRSS(est))

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.