Jupyter - Day 6 - Section 001#

Lecture 6 - Multiple Linear Regression#

In the last few lectures, we have focused on single input linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

  • Scikit learn is arguably the most used tool for machine learning in python

  • Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

Multiple linear regression#

Next we get some code up and running that can do linear regression with multiple input variables, that is when the model is of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)


We first model target = beta_0 + beta_1 *s1 + beta_2 * s5 using scikitlearn.

X = diabetes_df[['s1','s5']].values
y = diabetes_df['target'].values

multireg = LinearRegression() #<----- notice I'm using exactly the same command as above


Q: What are the values for \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Write an interpretation for the \(\beta_2\) value in this data set.

Your answer here

We next model target = beta_0 + beta_1 *s1 + beta_2 * s5 using statsmodels. Do you get the same model?

# multiple least squares with statsmodel
multiple_est = smf.ols('target ~ s1 + s5', diabetes_df).fit()

Q: What is the predicted model? How much trust can we place in the estimates?

Your answer here

Q: Run the linear regression to predict target using all the other variables. What do you notice about the different terms? Are some more related than others?

Your answer here

Q: Earlier you determined the p-value for the s1 variable when we only used s1 to predict target. What changed about the p-value for s1 now where it is part of a regression using all the variables. Why?

# Your answer here

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

# We're going to use the (fake) data set used in the beginning of the book. 
# you may need to find the data on the course website and change the path below!
advertising_df = pd.read_csv('../../../DataSets/Advertising.csv', index_col = 0)

Q1: Hypothesis Test#

Do this: Use the statsmodels package to fit the model $\( \texttt{Sales} = \beta_0 + \beta_1 \cdot \texttt{TV} + \beta_2\cdot \texttt{Radio} + \beta_3\cdot \texttt{Newspaper} \)$ What is the equation for the model learned?

# Your code here

Do this: Use the summary command for the trained model class to determine the F-statistic for this model.

  • What are the null and alternative hypotheses for the test this statistic is used for?

  • What is your conclusion of the test given this F-score?

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Q2: Subsets of variables#

Q: List all 6 subsets of the three variables being used.

Your answer here

Do this: Below is a command to get the RSS for the statsmodel linear fit. For each of the subsets listed above, what is the RSS for the learned model? Which is smallest?


Congratulations, we’re done!#

