Jupyter Notebook

Jupyter Notebook#

Lecture 5 - Assessing Coefficient Accuracy#

In the today’s lectures, we are starting focused on simple linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \varepsilon \]

In this lab, we will use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

# Importing the Linear Regression we learned last time from sklearn
from sklearn.linear_model import LinearRegression

Simulating data#

Ok, let’s run an example like was shown in class where we see the distribution of possible values.

# Here's code that decides on my function 
def myFunc(x, b0=2, b1=5): 
    return b0 + b1*x


# Here's a command that generates 100 random data points from f(x) + epsilon
def makeData(n = 100):
    X = np.random.uniform(-2,2,n)
    y = myFunc(X) + np.random.normal(size = n)
    return X,y

# Everytime you run this cell, you get slightly different data

X,y = makeData()

plt.scatter(X,y)

# Which means that every time you run this cell, you get a slightly different choice of coefficients
# for the model learned

X,y = makeData()
X = X.reshape([len(X),1])
y = y.reshape([len(y),1])
reg = LinearRegression()
reg.fit(X,y)
print( 'y = '+ str(round(reg.intercept_[0],4)) +' + ' + str(round(reg.coef_[0,0],4)) +  " * x_1"  )

✅ Q: If we assume that our data is coming from the setting

\[ Y = f(X) + \varepsilon \]

what is $\mathrm{Var}(\varepsilon)$?

Your answer here

# So now, lets just train our linear model lots of times, and collect the resulting coefficients

beta0_list = []
beta1_list = []
for i in range(100):
    X,y = makeData()
    X = X.reshape([len(X),1])
    y = y.reshape([len(y),1])
    reg = LinearRegression()
    reg.fit(X,y)
    beta1_list.append(reg.coef_[0,0])
    beta0_list.append(reg.intercept_[0])

print(beta1_list)

✅ Q: Make a histogram of beta1_list and separately, beta0_list.

What is the mean of each list. How does this compare to the acutal line we used to generate the data?
What is the standard deviation of each list?

# Your code here

Variance in estimation#

Now let’s figure out the variance of the linear regression estimates. First off, we know that $\sigma^2 = \mathrm{Var}(\varepsilon) = 1$, but let’s pretend we didn’t make up our own fake data.

# Start with a single linear regression model
reg = LinearRegression()
reg.fit(X,y)
print( 'y = '+ str(round(reg.intercept_[0],4)) +' + ' + str(round(reg.coef_[0,0],4)) +  " * x_1"  )

We can estimate $\sigma$ using residual standard error:

\[ \mathrm{RSE} = \sqrt{\mathrm{RSS}/(n-2)} \]

predicted = reg.predict(X)
residuals = y - predicted

RSS = np.sum(residuals**2)
RSE = np.sqrt(RSS/(len(y)-2))

Then the following code can compute the standard error of each coefficient

# We're estimating sigma^2 by RSE^2
sigma_sq = RSE**2

# We have n = 100 data points
n = 100

# We can calculate the standard error of beta_0 and beta_1 using the formulas we learned in class
x_bar = np.mean(X)
denom = np.sum((X - x_bar)**2)
beta0_var = sigma_sq * (1/n + x_bar**2/denom)
SE_beta0 = np.sqrt(beta0_var)
print(f"Standard error of beta0: {SE_beta0}")


beta1_var = sigma_sq/denom
SE_beta1 = np.sqrt(beta1_var)
print(f"Standard error of beta1: {SE_beta1}")

While we had to work a bit to get this to write out the standard errors, we can use the statsmodels library instead of sklearn to get these values directly.

import statsmodels.formula.api as smf
mydata = pd.DataFrame({'X':X.flatten(), 'y':y.flatten()})
linreg_smf = smf.ols('y ~ X', data = mydata).fit()
linreg_smf.summary().tables[1]

✅ Q: What is $SE(\hat \beta_0)$ and $SE(\hat \beta_1)$? Are they the same as what we calculated above?

Your answer here.

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

2. Assessing Coefficient Estimate Accuracy#

The Dataset#

We will be using the Diabetes data set again, which you looked into from the last class. In case you’ve forgotten, there is information about the data set in the documentation.

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

Like last time, we’re now going to fit to a simple linear regression to the models $$ \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s1} $$ and $$ \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s5} $$ where the variables are

$\texttt{s1}$: tc, total serum cholesterol
$\texttt{s5}$: ltg, possibly log of serum triglycerides level.

Let’s start by looking at using s5 to predict target.

Just for completeness, here’s our code to do linear regression from last time using sklearn.

from sklearn.linear_model import LinearRegression


# sklearn actually likes being handed numpy arrays more than 
# pandas dataframes, so we'll extract the bits we want and just pass it that. 
X = diabetes_df['s5'].values
X = X.reshape([len(X),1])
y = diabetes_df['target'].values
y = y.reshape([len(y),1])

# This code works by first creating an instance of 
# the linear regression class
reg = LinearRegression()
# Then we pass in the data we want it to use to fit.
reg.fit(X,y)

# and we can get the coefficients we want out of the model from the following code.

print(reg.coef_)
print(reg.intercept_)

# I can do some fancy printing if I really want to
lineString = str(round(reg.coef_[0,0],4)) +  "x_1 + " +  str(round(reg.intercept_[0],4))
print( 'y = ', lineString)

However today we’re interested in statistical tests so we’ll be using the statsmodel package. It has more options for statistical tests when available, however it has fewer models available which is why we will using a bit of both in this class.

import statsmodels.formula.api as smf

Instructor note: There is a difference in here where the book uses

import statsmodels.api as sm
model = sm.OLS(y, X)

which has some weird stuff with needing to include an intercept column and such. On homework problems and the like you can use this code or follow the book, either is acceptable.

# Notice that the code is intentially written to look
# more like R than like python, but it still works!
# Double check..... the coefficients here should be
# about the same as those found by scikit-learn
est = smf.ols('target ~ s5', diabetes_df).fit()
est.summary().tables[1]

✅ Q: What is $SE(\hat \beta_0)$ and $SE(\hat \beta_1)$?

Your answer here.

✅ Q: If we instead use s1 to predict the target, are $SE(\hat \beta_0)$ and $SE(\hat \beta_1)$ higher or lower than what you found for the s5 prediction? Is this reasonable? Try plotting your predictions against scatter plots of the data to compare.

# Your code here.

✅ Q: What are the confidence intervals for $\hat \beta_1$ in the two cases (the prediction using s1 and the prediction using s5)? Which is wider and why?

Hint: Check out the est.conf_int command or you can find this in the summary tables you’ve been using earlier.

Your answer here.

✅ Q: What is the conclusion of the hypothesis test

\[H_0: \text{ There is no relationship between $X$ and $Y$}\]

\[H_a: \text{ There is some relationship between $X$ and $Y$}\]

at a confidence level of $\alpha = 0.05$?

Your answer here

Oh hey look, there’s another table with information stored by the statsmodel class.

est.summary().tables[0]

✅ Q: What is $R^2$ for the two models?

Your answer here

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Jupyter Notebook

Contents

Jupyter Notebook#

Lecture 5 - Assessing Coefficient Accuracy#

Simulating data#

Variance in estimation#

2. Assessing Coefficient Estimate Accuracy#

The Dataset#

Congratulations, we’re done!#