Jupyter Notebook#

Lecture 8: The Last of the Linear Regression#

In the last few lectures, we have focused on linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

  • Scikit learn is arguably the most used tool for machine learning in python

  • Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

Dummy Variables for Multi-level Categorical Inputs#

The wrong way#

Ok, we’re going to do this incorrectly to start. You pull in the Auto data set. You were so proud of yourself for remembering to fix the problems with the horsepower column that you conveniently forgot that the column with information about country of origin (origin) has a bunch of integers in it, representing:

  • 1: American

  • 2: European

  • 3: Japanese.

url ="https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')


Auto_df.head()

# Take a look at the origin column below. It's currently numbers.
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

You then go on your merry way building the model $\( \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. \)$

from sklearn.linear_model import LinearRegression

X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values

regr = LinearRegression()

regr.fit(X,y)

print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)
beta_1 =  5.476547480191433
beta_0 =  14.811973615412484
##ANSWER##
# this is the statsmodels version, but i want to go use scikit learn
# because statsmodels hides the dummy variable stuff
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 14.8120 0.716 20.676 0.000 13.404 16.220
origin 5.4765 0.405 13.531 0.000 4.681 6.272

Q: What does your model predict for each of the three types of cars?

# Your code here

Q: Is it possible for your model to predict that both American and Japanese cars have mpg below European cars?

Your answer here.

The right way#

Ok, so you figure out your problem and decide to load in your data and fix the origin column to have names as entries.

convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df.head()

# Check to see this in the origin column below
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 American chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 American buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 American plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 American amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 American ford torino

Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!

origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df
origin_American origin_European origin_Japanese
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
... ... ... ...
392 True False False
393 False True False
394 True False False
395 True False False
396 True False False

392 rows × 3 columns

Just for fun let’s check this on a data point:

# Here's the original data row 393. What is the origin of the car?
Auto_df.loc[393,:]
mpg                  44.0
cylinders               4
displacement         97.0
horsepower             52
weight               2130
acceleration         24.6
year                   82
origin           European
name            vw pickup
Name: 393, dtype: object
# Here's the entry in the dummy variable data frame. Check that this matches above!
origin_dummies_df.loc[393,:]
origin_American    False
origin_European     True
origin_Japanese    False
Name: 393, dtype: bool

Q: What is the interpretation of each column in the origin_dummies_df data frame?

Your answer here

We mentioned in class that you really only need 2 dummy variables to encode a 3 level categorical variable. It turns out it is also easy to do this with the get_dummies command.

# All I'm adding is the drop_first=True argument. This will drop the first column
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin', drop_first=True)
origin_dummies_df
origin_European origin_Japanese
0 False False
1 False False
2 False False
3 False False
4 False False
... ... ...
392 False False
393 True False
394 False False
395 False False
396 False False

392 rows × 2 columns

# Check a few rows to make sure you understand how to interpret the entires for the dummy variables. 
row_no = 12

print('Dummy variables:')
print(origin_dummies_df.loc[row_no,:])
print('\nOriginal data:')
print(Auto_df.loc[row_no,:])
Dummy variables:
origin_European    False
origin_Japanese    False
Name: 12, dtype: bool

Original data:
mpg                              15.0
cylinders                           8
displacement                    400.0
horsepower                        150
weight                           3761
acceleration                      9.5
year                               70
origin                       American
name            chevrolet monte carlo
Name: 12, dtype: object

I pass these new dummy variables into my scikit-learn linear regression model and get the following coefficients

X = origin_dummies_df.values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
Coefs =  [ 7.56947179 10.41716352]
Intercept =  20.033469387755094

Q: What model is learned? Be specific about what the input variables are in your model.

Your answer here

Q: Now what does your model predict for each of the three types of cars?

# Your code here

Another right way#

Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)

In statsmodels, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df data frame, no processing of the categorical variables on my end at all. Just a warning, though. If we didn’t do that preprocessing at the beginning and the entries of origin were still 1’s, 2’s, and 3’s, statsmodels still wouldn’t know any better and would just treat the column like a quantitative input.

est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 20.0335 0.409 49.025 0.000 19.230 20.837
origin[T.European] 7.5695 0.877 8.634 0.000 5.846 9.293
origin[T.Japanese] 10.4172 0.828 12.588 0.000 8.790 12.044

Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.

Your answer here

Stop Icon

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Interaction Terms#

We’re going to use the auto data set to train the model

\[ \texttt{mpg} = \beta_0 + \beta_1\cdot \texttt{weight} + \beta_2\cdot \texttt{horsepower} + \beta_3\cdot \texttt{weight x horsepower}. \]
# Reloading teh data set just in case. 
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')

# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'mpg']]

Auto_df.head()
horsepower weight mpg
0 130 3504 18.0
1 165 3693 15.0
2 150 3436 18.0
3 150 3433 16.0
4 140 3449 17.0

First, I’m going to generate a column that is weight times horsepower and add it to the data frame.

Auto_df['horse_x_wt'] = Auto_df.horsepower * Auto_df.weight
Auto_df.head()
horsepower weight mpg horse_x_wt
0 130 3504 18.0 455520
1 165 3693 15.0 609345
2 150 3436 18.0 515400
3 150 3433 16.0 514950
4 140 3449 17.0 482860
regr = LinearRegression()
X = Auto_df[['horsepower', 'weight', 'horse_x_wt']]
y = Auto_df.mpg
regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
Coefs =  [-2.50838171e-01 -1.07724112e-02  5.35537776e-05]
Intercept =  63.55794002419705

Do this: What is the model learned? Be specific about your variables in the equation.

Your answer here

Let’s do this with stats models now instead. One option, is I can pass in the data frame that I build above and it already has my multiplied column in it.

# train the model
est = smf.ols('mpg ~ weight + horsepower + horse_x_wt', Auto_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 63.5579 2.343 27.127 0.000 58.951 68.164
weight -0.0108 0.001 -13.921 0.000 -0.012 -0.009
horsepower -0.2508 0.027 -9.195 0.000 -0.304 -0.197
horse_x_wt 5.355e-05 6.65e-06 8.054 0.000 4.05e-05 6.66e-05

However, it will also let me tell the model to build the interaction term without having to build the column myself.

# Taking the interaction column out of the data frame
Auto_df_no_interact = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df_no_interact.head()
horsepower weight mpg
0 130 3504 18.0
1 165 3693 15.0
2 150 3436 18.0
3 150 3433 16.0
4 140 3449 17.0
est = smf.ols('mpg ~ weight + horsepower + weight*horsepower', Auto_df_no_interact).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 63.5579 2.343 27.127 0.000 58.951 68.164
weight -0.0108 0.001 -13.921 0.000 -0.012 -0.009
horsepower -0.2508 0.027 -9.195 0.000 -0.304 -0.197
weight:horsepower 5.355e-05 6.65e-06 8.054 0.000 4.05e-05 6.66e-05

I’m going to reload the data for you and keep the acceleration column too.

# Reloading the data set again. 
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')

# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'acceleration', 'mpg']]

Auto_df.head()
horsepower weight acceleration mpg
0 130 3504 12.0 18.0
1 165 3693 11.5 15.0
2 150 3436 11.0 18.0
3 150 3433 12.0 16.0
4 140 3449 10.5 17.0

Do this: Now use stats models to build the model

\[ \texttt{mpg} = \beta_0 + \beta_1\cdot \texttt{wt} + \beta_2\cdot \texttt{horse} + \beta_3 \cdot \texttt{accel} + \beta_3\cdot \texttt{wt x horse} + \beta 5 \cdot \texttt{wt x accel}. \]

Which interaction terms are adding value to the model?

# Your code here

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

##ANSWER##
#This cell runs the converter which removes ANSWER fields, renames the notebook and cleans out output fields. 

from jupyterinstruct import InstructorNotebook
import os
this_notebook = os.path.basename(globals()['__vsc_ipynb_file__'])

studentnotebook = InstructorNotebook.makestudent(this_notebook)

InstructorNotebook.validate(studentnotebook)
Myfilename CMSE381-Lec08-Review-INSTRUCTOR.ipynb
WARNING: Instructor file name is wrong CMSE381-Lec08_Review-INSTRUCTOR.ipynb != CMSE381-Lec08-Review-INSTRUCTOR.ipynb
CMSE381-Lec08_Review.ipynb
Validating Notebook ./CMSE381-Lec08_Review.ipynb
   ERROR: Image LINK not found - http://creativecommons.org/licenses/by-nc/4.0/
1