Jupyter - Lecture 7#
Even More Linear Regression#
In the last few lectures, we have focused on linear regression, that is, fitting models of the form
In this lab, we will continue to use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
This lab will cover two ideas:
Categorical variables and how to represent them as dummy variables.
How to build interaction terms and pass them into your favorite model.
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
More questions to ask of your model (Continued from last time)#
Q3: How well does the model fit?#
This section talks about interpretation of \(R^2\) and RSE, but is on slides only.
Q4: Making predictions#
advertising_df = pd.read_csv('../../DataSets/Advertising.csv', index_col = 0)
# I need to sort the rows by TV to make plotting work better later
advertising_df= advertising_df.sort_values(by=['TV'])
advertising_df.head()
# I want to just learn Sales using TV
est = smf.ols('Sales ~ TV', advertising_df).fit()
est.params
# Here is a table giving us the CI and PI information
alpha = 0.1
# alpha = 0.05
# alpha = 0.01
# alpha = 0.001
advert_summary = est.get_prediction(advertising_df).summary_frame(alpha=alpha)
advert_summary.head()
# And here is some code that will draw these beasts for us....
plt.rcParams['figure.figsize'] = [15, 15]
plt.rcParams['font.size'] = 16
x = advertising_df['TV']
y = advertising_df['Sales']
# Plot the original data
plt.scatter(x,y, label = 'Data')
# Plot the fitted values, AKA f_hat
# you can get this in two different places, same answer
# plt.plot(x,advert_summary['mean'], color = 'orange', label = 'f hat')
plt.plot(x,est.fittedvalues, color = 'orange', label = 'f hat')
plt.plot(x,advert_summary['obs_ci_lower'], 'r', lw=2,
label = r'Prediction Band')
plt.plot(x,advert_summary['obs_ci_upper'], 'r', lw=2)
plt.plot(x, advert_summary['mean_ci_lower'],'g--', lw=1,
label = r'Confidence Region')
plt.plot(x, advert_summary['mean_ci_upper'], 'g--', lw=1)
plt.title('Alpha = '+ str(alpha))
plt.legend()
Playing with multi-level variables#
The wrong way#
Ok, we’re going to do this incorrectly to start. You pull in the Auto
data set. You were so proud of yourself for remembering to fix the problems with the horsepower
column that you conveniently forgot that the column with information about country of origin (origin
) has a bunch of integers in it, representing:
1:
American
2:
European
3:
Japanese
.
Auto_df = pd.read_csv('../../DataSets/Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
Auto_df.columns
You then go on your merry way building the model $\( \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. \)$
from sklearn.linear_model import LinearRegression
X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values
regr = LinearRegression()
regr.fit(X,y)
print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)
✅ Q: What does your model predict for each of the three types of cars?
# Your code here
✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg
below European cars?
Your answer here.
The right way#
Ok, so you figure out your problem and decide to load in your data and fix the origin
column to have names as entries.
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}
# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df
Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df
✅ Q: What is the interpretation of each column in the origin_dummies_df
data frame?
Your answer here
I pass these new dummy variables into my scikit-learn
linear regression model and get the following coefficients
X = origin_dummies_df.values
y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)
print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
✅ Q: Now what does your model predict for each of the three types of cars?
# Your code here
Ooops#
✅ Q: Aw man, I didn’t quite do what we said for the dummy variables in class. We talked about having only two dummy variables for a three level variable. Copy my code below here and fix it to have two variables instead of three.
Are your coefficients different now?
Are your predictions for each of the three origins different now?
Does it matter which two levels you used for your dummy variables?
# Your code here
Another right way#
Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)
First off, we can force sklearn
to drop the first variable, so you don’t have to do it manually every time. But you do need to know how to interpret the outputs!
# Even easier right way.... Note the only difference is the drop_first=True
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin',drop_first=True)
print(origin_dummies_df.head())
y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)
In statsmodels
, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df
data frame, no processing of the categorical variables on my end at all.
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.
Your answer here
Congratulations, we’re done!#
Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.