Jupyter Notebook#
Lecture 5 - Assessing Coefficient Accuracy#
In the today’s lectures, we are starting focused on simple linear regression, that is, fitting models of the form
In this lab, we will use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Importing the Linear Regression we learned last time from sklearn
from sklearn.linear_model import LinearRegression
Simulating data#
Ok, let’s run an example like was shown in class where we see the distribution of possible values.
# Here's code that decides on my function
def myFunc(x, b0=2, b1=5):
return b0 + b1*x
# Here's a command that generates 100 random data points from f(x) + epsilon
def makeData(n = 100):
X = np.random.uniform(-2,2,n)
y = myFunc(X) + np.random.normal(size = n)
return X,y
# Everytime you run this cell, you get slightly different data
X,y = makeData()
plt.scatter(X,y)
# Which means that every time you run this cell, you get a slightly different choice of coefficients
# for the model learned
X,y = makeData()
X = X.reshape([len(X),1])
y = y.reshape([len(y),1])
reg = LinearRegression()
reg.fit(X,y)
print( 'y = '+ str(round(reg.intercept_[0],4)) +' + ' + str(round(reg.coef_[0,0],4)) + " * x_1" )
✅ Q: If we assume that our data is coming from the setting
what is \(\mathrm{Var}(\varepsilon)\)?
Your answer here
# So now, lets just train our linear model lots of times, and collect the resulting coefficients
beta0_list = []
beta1_list = []
for i in range(100):
X,y = makeData()
X = X.reshape([len(X),1])
y = y.reshape([len(y),1])
reg = LinearRegression()
reg.fit(X,y)
beta1_list.append(reg.coef_[0,0])
beta0_list.append(reg.intercept_[0])
print(beta1_list)
✅ Q:
Make a histogram of beta1_list
and separately, beta0_list
.
What is the mean of each list. How does this compare to the acutal line we used to generate the data?
What is the standard deviation of each list?
# Your code here
Variance in estimation#
Now let’s figure out the variance of the linear regression estimates. First off, we know that \(\sigma^2 = \mathrm{Var}(\varepsilon) = 1\), but let’s pretend we didn’t make up our own fake data.
# Start with a single linear regression model
reg = LinearRegression()
reg.fit(X,y)
print( 'y = '+ str(round(reg.intercept_[0],4)) +' + ' + str(round(reg.coef_[0,0],4)) + " * x_1" )
We can estimate \(\sigma\) using residual standard error:
predicted = reg.predict(X)
residuals = y - predicted
RSS = np.sum(residuals**2)
RSE = np.sqrt(RSS/(len(y)-2))
Then the following code can compute the standard error of each coefficient
# We're estimating sigma^2 by RSE^2
sigma_sq = RSE**2
# We have n = 100 data points
n = 100
# We can calculate the standard error of beta_0 and beta_1 using the formulas we learned in class
x_bar = np.mean(X)
denom = np.sum((X - x_bar)**2)
beta0_var = sigma_sq * (1/n + x_bar**2/denom)
SE_beta0 = np.sqrt(beta0_var)
print(f"Standard error of beta0: {SE_beta0}")
beta1_var = sigma_sq/denom
SE_beta1 = np.sqrt(beta1_var)
print(f"Standard error of beta1: {SE_beta1}")
While we had to work a bit to get this to write out the standard errors, we can use the statsmodels
library instead of sklearn
to get these values directly.
import statsmodels.formula.api as smf
mydata = pd.DataFrame({'X':X.flatten(), 'y':y.flatten()})
linreg_smf = smf.ols('y ~ X', data = mydata).fit()
linreg_smf.summary().tables[1]
✅ Q: What is \(SE(\hat \beta_0)\) and \(SE(\hat \beta_1)\)? Are they the same as what we calculated above?
Your answer here.
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
2. Assessing Coefficient Estimate Accuracy#
The Dataset#
We will be using the Diabetes
data set again, which you looked into from the last class. In case you’ve forgotten, there is information about the data set in the documentation.
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)
diabetes_df
Like last time, we’re now going to fit to a simple linear regression to the models $\( \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s1} \)\( and \)\( \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s5} \)$ where the variables are
\(\texttt{s1}\): tc, total serum cholesterol
\(\texttt{s5}\): ltg, possibly log of serum triglycerides level.
Let’s start by looking at using s5
to predict target
.
Just for completeness, here’s our code to do linear regression from last time using sklearn
.
from sklearn.linear_model import LinearRegression
# sklearn actually likes being handed numpy arrays more than
# pandas dataframes, so we'll extract the bits we want and just pass it that.
X = diabetes_df['s5'].values
X = X.reshape([len(X),1])
y = diabetes_df['target'].values
y = y.reshape([len(y),1])
# This code works by first creating an instance of
# the linear regression class
reg = LinearRegression()
# Then we pass in the data we want it to use to fit.
reg.fit(X,y)
# and we can get the coefficients we want out of the model from the following code.
print(reg.coef_)
print(reg.intercept_)
# I can do some fancy printing if I really want to
lineString = str(round(reg.coef_[0,0],4)) + "x_1 + " + str(round(reg.intercept_[0],4))
print( 'y = ', lineString)
However today we’re interested in statistical tests so we’ll be using the statsmodel
package. It has more options for statistical tests when available, however it has fewer models available which is why we will using a bit of both in this class.
import statsmodels.formula.api as smf
Instructor note: There is a difference in here where the book uses
import statsmodels.api as sm
model = sm.OLS(y, X)
which has some weird stuff with needing to include an intercept column and such. On homework problems and the like you can use this code or follow the book, either is acceptable.
# Notice that the code is intentially written to look
# more like R than like python, but it still works!
# Double check..... the coefficients here should be
# about the same as those found by scikit-learn
est = smf.ols('target ~ s5', diabetes_df).fit()
est.summary().tables[1]
✅ Q: What is \(SE(\hat \beta_0)\) and \(SE(\hat \beta_1)\)?
Your answer here.
✅ Q: If we instead use s1
to predict the target, are \(SE(\hat \beta_0)\) and \(SE(\hat \beta_1)\) higher or lower than what you found for the s5
prediction? Is this reasonable? Try plotting your predictions against scatter plots of the data to compare.
# Your code here.
✅ Q: What are the confidence intervals for \(\hat \beta_1\) in the two cases (the prediction using s1
and the prediction using s5
)? Which is wider and why?
Hint: Check out the est.conf_int
command or you can find this in the summary tables you’ve been using earlier.
Your answer here.
✅ Q: What is the conclusion of the hypothesis test
at a confidence level of \(\alpha = 0.05\)?
Your answer here
Oh hey look, there’s another table with information stored by the statsmodel class.
est.summary().tables[0]
✅ Q: What is \(R^2\) for the two models?
Your answer here
Congratulations, we’re done!#
Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.