Jupyter Notebook#
Lecture 8: The Last of the Linear Regression#
In the last few lectures, we have focused on linear regression, that is, fitting models of the form
In this lab, we will continue to use two different tools for linear regression.
Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class
# As always, we start with our favorite standard imports.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
Dummy Variables for Multi-level Categorical Inputs#
The wrong way#
Ok, we’re going to do this incorrectly to start. You pull in the Auto data set. You were so proud of yourself for remembering to fix the problems with the horsepower column that you conveniently forgot that the column with information about country of origin (origin) has a bunch of integers in it, representing:
1:
American2:
European3:
Japanese.
url ="https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
Auto_df.head()
# Take a look at the origin column below. It's currently numbers.
| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
You then go on your merry way building the model $\( \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. \)$
from sklearn.linear_model import LinearRegression
X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values
regr = LinearRegression()
regr.fit(X,y)
print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)
beta_1 = 5.476547480191433
beta_0 = 14.811973615412484
##ANSWER##
# this is the statsmodels version, but i want to go use scikit learn
# because statsmodels hides the dummy variable stuff
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 14.8120 | 0.716 | 20.676 | 0.000 | 13.404 | 16.220 |
| origin | 5.4765 | 0.405 | 13.531 | 0.000 | 4.681 | 6.272 |
✅ Q: What does your model predict for each of the three types of cars?
# Your code here
✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg below European cars?
Your answer here.
The right way#
Ok, so you figure out your problem and decide to load in your data and fix the origin column to have names as entries.
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}
# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df.head()
# Check to see this in the origin column below
| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | American | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | American | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | American | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | American | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | American | ford torino |
Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df
| origin_American | origin_European | origin_Japanese | |
|---|---|---|---|
| 0 | True | False | False |
| 1 | True | False | False |
| 2 | True | False | False |
| 3 | True | False | False |
| 4 | True | False | False |
| ... | ... | ... | ... |
| 392 | True | False | False |
| 393 | False | True | False |
| 394 | True | False | False |
| 395 | True | False | False |
| 396 | True | False | False |
392 rows × 3 columns
Just for fun let’s check this on a data point:
# Here's the original data row 393. What is the origin of the car?
Auto_df.loc[393,:]
mpg 44.0
cylinders 4
displacement 97.0
horsepower 52
weight 2130
acceleration 24.6
year 82
origin European
name vw pickup
Name: 393, dtype: object
# Here's the entry in the dummy variable data frame. Check that this matches above!
origin_dummies_df.loc[393,:]
origin_American False
origin_European True
origin_Japanese False
Name: 393, dtype: bool
✅ Q: What is the interpretation of each column in the origin_dummies_df data frame?
Your answer here
We mentioned in class that you really only need 2 dummy variables to encode a 3 level categorical variable. It turns out it is also easy to do this with the get_dummies command.
# All I'm adding is the drop_first=True argument. This will drop the first column
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin', drop_first=True)
origin_dummies_df
| origin_European | origin_Japanese | |
|---|---|---|
| 0 | False | False |
| 1 | False | False |
| 2 | False | False |
| 3 | False | False |
| 4 | False | False |
| ... | ... | ... |
| 392 | False | False |
| 393 | True | False |
| 394 | False | False |
| 395 | False | False |
| 396 | False | False |
392 rows × 2 columns
# Check a few rows to make sure you understand how to interpret the entires for the dummy variables.
row_no = 12
print('Dummy variables:')
print(origin_dummies_df.loc[row_no,:])
print('\nOriginal data:')
print(Auto_df.loc[row_no,:])
Dummy variables:
origin_European False
origin_Japanese False
Name: 12, dtype: bool
Original data:
mpg 15.0
cylinders 8
displacement 400.0
horsepower 150
weight 3761
acceleration 9.5
year 70
origin American
name chevrolet monte carlo
Name: 12, dtype: object
I pass these new dummy variables into my scikit-learn linear regression model and get the following coefficients
X = origin_dummies_df.values
y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)
print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
Coefs = [ 7.56947179 10.41716352]
Intercept = 20.033469387755094
✅ Q: What model is learned? Be specific about what the input variables are in your model.
Your answer here
✅ Q: Now what does your model predict for each of the three types of cars?
# Your code here
Another right way#
Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)
In statsmodels, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df data frame, no processing of the categorical variables on my end at all. Just a warning, though. If we didn’t do that preprocessing at the beginning and the entries of origin were still 1’s, 2’s, and 3’s, statsmodels still wouldn’t know any better and would just treat the column like a quantitative input.
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 20.0335 | 0.409 | 49.025 | 0.000 | 19.230 | 20.837 |
| origin[T.European] | 7.5695 | 0.877 | 8.634 | 0.000 | 5.846 | 9.293 |
| origin[T.Japanese] | 10.4172 | 0.828 | 12.588 | 0.000 | 8.790 | 12.044 |
✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.
Your answer here
![]()
Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.
Interaction Terms#
We’re going to use the auto data set to train the model
# Reloading teh data set just in case.
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df.head()
| horsepower | weight | mpg | |
|---|---|---|---|
| 0 | 130 | 3504 | 18.0 |
| 1 | 165 | 3693 | 15.0 |
| 2 | 150 | 3436 | 18.0 |
| 3 | 150 | 3433 | 16.0 |
| 4 | 140 | 3449 | 17.0 |
First, I’m going to generate a column that is weight times horsepower and add it to the data frame.
Auto_df['horse_x_wt'] = Auto_df.horsepower * Auto_df.weight
Auto_df.head()
| horsepower | weight | mpg | horse_x_wt | |
|---|---|---|---|---|
| 0 | 130 | 3504 | 18.0 | 455520 |
| 1 | 165 | 3693 | 15.0 | 609345 |
| 2 | 150 | 3436 | 18.0 | 515400 |
| 3 | 150 | 3433 | 16.0 | 514950 |
| 4 | 140 | 3449 | 17.0 | 482860 |
regr = LinearRegression()
X = Auto_df[['horsepower', 'weight', 'horse_x_wt']]
y = Auto_df.mpg
regr.fit(X,y)
print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)
Coefs = [-2.50838171e-01 -1.07724112e-02 5.35537776e-05]
Intercept = 63.55794002419705
✅ Do this: What is the model learned? Be specific about your variables in the equation.
Your answer here
Let’s do this with stats models now instead. One option, is I can pass in the data frame that I build above and it already has my multiplied column in it.
# train the model
est = smf.ols('mpg ~ weight + horsepower + horse_x_wt', Auto_df).fit()
est.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 63.5579 | 2.343 | 27.127 | 0.000 | 58.951 | 68.164 |
| weight | -0.0108 | 0.001 | -13.921 | 0.000 | -0.012 | -0.009 |
| horsepower | -0.2508 | 0.027 | -9.195 | 0.000 | -0.304 | -0.197 |
| horse_x_wt | 5.355e-05 | 6.65e-06 | 8.054 | 0.000 | 4.05e-05 | 6.66e-05 |
However, it will also let me tell the model to build the interaction term without having to build the column myself.
# Taking the interaction column out of the data frame
Auto_df_no_interact = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df_no_interact.head()
| horsepower | weight | mpg | |
|---|---|---|---|
| 0 | 130 | 3504 | 18.0 |
| 1 | 165 | 3693 | 15.0 |
| 2 | 150 | 3436 | 18.0 |
| 3 | 150 | 3433 | 16.0 |
| 4 | 140 | 3449 | 17.0 |
est = smf.ols('mpg ~ weight + horsepower + weight*horsepower', Auto_df_no_interact).fit()
est.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 63.5579 | 2.343 | 27.127 | 0.000 | 58.951 | 68.164 |
| weight | -0.0108 | 0.001 | -13.921 | 0.000 | -0.012 | -0.009 |
| horsepower | -0.2508 | 0.027 | -9.195 | 0.000 | -0.304 | -0.197 |
| weight:horsepower | 5.355e-05 | 6.65e-06 | 8.054 | 0.000 | 4.05e-05 | 6.66e-05 |
I’m going to reload the data for you and keep the acceleration column too.
# Reloading the data set again.
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')
# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'acceleration', 'mpg']]
Auto_df.head()
| horsepower | weight | acceleration | mpg | |
|---|---|---|---|---|
| 0 | 130 | 3504 | 12.0 | 18.0 |
| 1 | 165 | 3693 | 11.5 | 15.0 |
| 2 | 150 | 3436 | 11.0 | 18.0 |
| 3 | 150 | 3433 | 12.0 | 16.0 |
| 4 | 140 | 3449 | 10.5 | 17.0 |
✅ Do this: Now use stats models to build the model
Which interaction terms are adding value to the model?
# Your code here
Congratulations, we’re done!#
Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
##ANSWER##
#This cell runs the converter which removes ANSWER fields, renames the notebook and cleans out output fields.
from jupyterinstruct import InstructorNotebook
import os
this_notebook = os.path.basename(globals()['__vsc_ipynb_file__'])
studentnotebook = InstructorNotebook.makestudent(this_notebook)
InstructorNotebook.validate(studentnotebook)
Myfilename CMSE381-Lec08-Review-INSTRUCTOR.ipynb
WARNING: Instructor file name is wrong CMSE381-Lec08_Review-INSTRUCTOR.ipynb != CMSE381-Lec08-Review-INSTRUCTOR.ipynb
CMSE381-Lec08_Review.ipynb
Validating Notebook ./CMSE381-Lec08_Review.ipynb
ERROR: Image LINK not found - http://creativecommons.org/licenses/by-nc/4.0/
1