Jupyter Notebook#
Lecture 8: The Last of the Linear Regression#
In the last few lectures, we have focused on linear regression, that is, fitting models of the form
In this lab, we will continue to use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
Dummy Variables for Multi-level Categorical Inputs#
The wrong way#
Ok, we’re going to do this incorrectly to start. You pull in the Auto
data set. You were so proud of yourself for remembering to fix the problems with the horsepower
column that you conveniently forgot that the column with information about country of origin (origin
) has a bunch of integers in it, representing:
1:
American
2:
European
3:
Japanese
.
Auto_df = pd.read_csv('../../DataSets/Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
Auto_df.head()
# Take a look at the origin column below. It's currently numbers.
You then go on your merry way building the model $\( \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. \)$
from sklearn.linear_model import LinearRegression
X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values
regr = LinearRegression()
regr.fit(X,y)
print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)
✅ Q: What does your model predict for each of the three types of cars?
# Your code here
✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg
below European cars?
Your answer here.
The right way#
Ok, so you figure out your problem and decide to load in your data and fix the origin
column to have names as entries.
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}
# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df.head()
# Check to see this in the origin column below
Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df
Just for fun let’s check this on a data point:
# Here's the original data row 393. What is the origin of the car?
Auto_df.loc[393,:]
# Here's the entry in the dummy variable data frame. Check that this matches above!
origin_dummies_df.loc[393,:]
✅ Q: What is the interpretation of each column in the origin_dummies_df
data frame?
Your answer here
We mentioned in class that you really only need 2 dummy variables to encode a 3 level categorical variable. It turns out it is also easy to do this with the get_dummies
command.
# All I'm adding is the drop_first=True argument. This will drop the first column
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin', drop_first=True)
origin_dummies_df
# Check a few rows to make sure you understand how to interpret the entires for the dummy variables.
row_no = 12
print('Dummy variables:')
print(origin_dummies_df.loc[row_no,:])
print('\nOriginal data:')
print(Auto_df.loc[row_no,:])
I pass these new dummy variables into my scikit-learn
linear regression model and get the following coefficients
X = origin_dummies_df.values
y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)
print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
✅ Q: What model is learned? Be specific about what the input variables are in your model.
Your answer here
✅ Q: Now what does your model predict for each of the three types of cars?
# Your code here
Another right way#
Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)
In statsmodels
, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df
data frame, no processing of the categorical variables on my end at all. Just a warning, though. If we didn’t do that preprocessing at the beginning and the entries of origin were still 1’s, 2’s, and 3’s, statsmodels
still wouldn’t know any better and would just treat the column like a quantitative input.
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.
Your answer here
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
Interaction Terms#
We’re going to use the auto data set to train the model
# Reloading teh data set just in case.
Auto_df = pd.read_csv('../../DataSets/Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df.head()
First, I’m going to generate a column that is weight times horsepower and add it to the data frame.
Auto_df['horse_x_wt'] = Auto_df.horsepower * Auto_df.weight
Auto_df.head()
regr = LinearRegression()
X = Auto_df[['horsepower', 'weight', 'horse_x_wt']]
y = Auto_df.mpg
regr.fit(X,y)
print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
✅ Do this: What is the model learned? Be specific about your variables in the equation.
Your answer here
print('y = ', round(regr.intercept_,2),
' + ', round(regr.coef_[0],4), '*x_horse + ',
round(regr.coef_[1],4), '*x_weight + ',
round(regr.coef_[2],4), '*x_Horse*Weight',)
Let’s do this with stats models now instead. One option, is I can pass in the data frame that I build above and it already has my multiplied column in it.
# train the model
est = smf.ols('mpg ~ weight + horsepower + horse_x_wt', Auto_df).fit()
est.summary().tables[1]
However, it will also let me tell the model to build the interaction term without having to build the column myself.
# Taking the interaction column out of the data frame
Auto_df_no_interact = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df_no_interact.head()
est = smf.ols('mpg ~ weight + horsepower + weight*horsepower', Auto_df_no_interact).fit()
est.summary().tables[1]
I’m going to reload the data for you and keep the acceleration column too.
# Reloading the data set again.
Auto_df = pd.read_csv('../../DataSets/Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'acceleration', 'mpg']]
Auto_df.head()
✅ Do this: Now use stats models to build the model
Which interaction terms are adding value to the model?
# Your code here
Congratulations, we’re done!#
Written by Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.