Jupyter - Lecture 7

Jupyter - Lecture 7#

Even More Linear Regression#

In the last few lectures, we have focused on linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

This lab will cover two ideas:

Categorical variables and how to represent them as dummy variables.
How to build interaction terms and pass them into your favorite model.

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

More questions to ask of your model (Continued from last time)#

Q3: How well does the model fit?#

This section talks about interpretation of $R^2$ and RSE, but is on slides only.

Q4: Making predictions#

advertising_df = pd.read_csv('../../DataSets/Advertising.csv', index_col = 0)
# I need to sort the rows by TV to make plotting work better later 
advertising_df= advertising_df.sort_values(by=['TV'])
advertising_df.head()

# I want to just learn Sales using TV
est = smf.ols('Sales ~ TV', advertising_df).fit()
est.params

# Here is a table giving us the CI and PI information 
alpha = 0.1
# alpha = 0.05
# alpha = 0.01
# alpha = 0.001
advert_summary = est.get_prediction(advertising_df).summary_frame(alpha=alpha)
advert_summary.head()

# And here is some code that will draw these beasts for us....
plt.rcParams['figure.figsize'] = [15, 15]
plt.rcParams['font.size'] = 16


x = advertising_df['TV']
y = advertising_df['Sales']


# Plot the original data 
plt.scatter(x,y, label = 'Data')

# Plot the fitted values, AKA f_hat
# you can get this in two different places, same answer
# plt.plot(x,advert_summary['mean'], color = 'orange', label = 'f hat')
plt.plot(x,est.fittedvalues, color = 'orange', label = 'f hat')



plt.plot(x,advert_summary['obs_ci_lower'], 'r', lw=2, 
         label = r'Prediction Band')
plt.plot(x,advert_summary['obs_ci_upper'], 'r', lw=2)

plt.plot(x, advert_summary['mean_ci_lower'],'g--', lw=1, 
         label = r'Confidence Region')
plt.plot(x, advert_summary['mean_ci_upper'], 'g--', lw=1)

plt.title('Alpha = '+ str(alpha))



plt.legend()

Playing with multi-level variables#

The wrong way#

Ok, we’re going to do this incorrectly to start. You pull in the Auto data set. You were so proud of yourself for remembering to fix the problems with the horsepower column that you conveniently forgot that the column with information about country of origin (origin) has a bunch of integers in it, representing:

1: American
2: European
3: Japanese.

Auto_df = pd.read_csv('../../DataSets/Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')


Auto_df.columns

You then go on your merry way building the model $$ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. $$

from sklearn.linear_model import LinearRegression

X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values

regr = LinearRegression()

regr.fit(X,y)

print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)

✅ Q: What does your model predict for each of the three types of cars?

# Your code here

✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg below European cars?

Your answer here.

The right way#

Ok, so you figure out your problem and decide to load in your data and fix the origin column to have names as entries.

convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df

Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!

origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df

✅ Q: What is the interpretation of each column in the origin_dummies_df data frame?

Your answer here

I pass these new dummy variables into my scikit-learn linear regression model and get the following coefficients

X = origin_dummies_df.values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

✅ Q: Now what does your model predict for each of the three types of cars?

# Your code here

Ooops#

✅ Q: Aw man, I didn’t quite do what we said for the dummy variables in class. We talked about having only two dummy variables for a three level variable. Copy my code below here and fix it to have two variables instead of three.

Are your coefficients different now?
Are your predictions for each of the three origins different now?
Does it matter which two levels you used for your dummy variables?

# Your code here

Another right way#

Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)

First off, we can force sklearn to drop the first variable, so you don’t have to do it manually every time. But you do need to know how to interpret the outputs!

# Even easier right way.... Note the only difference is the drop_first=True
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin',drop_first=True)
print(origin_dummies_df.head())

y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)

In statsmodels, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df data frame, no processing of the categorical variables on my end at all.

est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.

Your answer here

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.