Jupyter Notebook#
Lecture 6 - Multiple Linear Regression#
In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form
In this lab, we will continue to use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
Multiple linear regression#
Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)
diabetes_df
We first model target = beta_0 + beta_1 *s1 + beta_2 * s5
using scikitlearn
.
X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values
multireg = LinearRegression() #<----- notice I'm using exactly the same command as above
multireg.fit(X,y)
print(multireg.coef_)
print(multireg.intercept_)
✅ Q: What are the values for \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Write an interpretation for the \(\beta_2\) value in this data set.
Your answer here
We next model target = beta_0 + beta_1 *s1 + beta_2 * s5
using statsmodels
. Do you get the same model?
# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()
multiple_est.summary()
✅ Q: What is the predicted model? How much trust can we place in the estimates?
Your answer here
✅ Q: Run the linear regression to predict target
using all the other variables. What do you notice about the different terms? Are some more related than others?
Your answer here
✅ Q: Earlier you determined the p-value for the s1
variable when we only used s1
to predict target
. What changed about the p-value for s1
now where it is part of a regression using all the variables. Why?
# Your answer here
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
# We're going to use the (fake) data set used in the beginning of the book.
advertising_df = pd.read_csv('../../DataSets/Advertising.csv', index_col = 0)
advertising_df
Q1: Hypothesis Test#
✅ Do this: Use the statsmodels
package to fit the model
$\(
\texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper}
\)$
What is the equation for the model learned?
# Your code here
✅ Do this: Use the summary
command for the trained model class to determine the F-statistic for this model.
What are the null and alternative hypotheses for the test this statistic is used for?
What is your conclusion of the test given this F-score?
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
Q2: Subsets of variables#
✅ Q: List all 6 subsets of the three variables being used.
Your answer here
✅ Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?
def statsmodelRSS(est):
# Returns the RSS for the statsmodel ols class
return np.sum(est.resid**2)
est.ssr
# print(statsmodelRSS(est))
Congratulations, we’re done!#
Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.