Jupyter Notebook

Jupyter Notebook#

Lecture 8: The Last of the Linear Regression#

In the last few lectures, we have focused on linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

Dummy Variables for Multi-level Categorical Inputs#

The wrong way#

Ok, we’re going to do this incorrectly to start. You pull in the Auto data set. You were so proud of yourself for remembering to fix the problems with the horsepower column that you conveniently forgot that the column with information about country of origin (origin) has a bunch of integers in it, representing:

1: American
2: European
3: Japanese.

url ="https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')


Auto_df.head()

# Take a look at the origin column below. It's currently numbers.

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

You then go on your merry way building the model $$ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. $$

from sklearn.linear_model import LinearRegression

X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values

regr = LinearRegression()

regr.fit(X,y)

print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)

beta_1 =  5.476547480191433
beta_0 =  14.811973615412484

##ANSWER##
# this is the statsmodels version, but i want to go use scikit learn
# because statsmodels hides the dummy variable stuff
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	14.8120	0.716	20.676	0.000	13.404	16.220
origin	5.4765	0.405	13.531	0.000	4.681	6.272

✅ Q: What does your model predict for each of the three types of cars?

# Your code here

✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg below European cars?

Your answer here.

The right way#

Ok, so you figure out your problem and decide to load in your data and fix the origin column to have names as entries.

convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df.head()

# Check to see this in the origin column below

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	American	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	American	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	American	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	American	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	American	ford torino

Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!

origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df

	origin_American	origin_European	origin_Japanese
0	True	False	False
1	True	False	False
2	True	False	False
3	True	False	False
4	True	False	False
...	...	...	...
392	True	False	False
393	False	True	False
394	True	False	False
395	True	False	False
396	True	False	False

392 rows × 3 columns

Just for fun let’s check this on a data point:

# Here's the original data row 393. What is the origin of the car?
Auto_df.loc[393,:]

mpg                  44.0
cylinders               4
displacement         97.0
horsepower             52
weight               2130
acceleration         24.6
year                   82
origin           European
name            vw pickup
Name: 393, dtype: object

# Here's the entry in the dummy variable data frame. Check that this matches above!
origin_dummies_df.loc[393,:]

origin_American    False
origin_European     True
origin_Japanese    False
Name: 393, dtype: bool

✅ Q: What is the interpretation of each column in the origin_dummies_df data frame?

Your answer here

We mentioned in class that you really only need 2 dummy variables to encode a 3 level categorical variable. It turns out it is also easy to do this with the get_dummies command.

# All I'm adding is the drop_first=True argument. This will drop the first column
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin', drop_first=True)
origin_dummies_df

	origin_European	origin_Japanese
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False
...	...	...
392	False	False
393	True	False
394	False	False
395	False	False
396	False	False

392 rows × 2 columns

# Check a few rows to make sure you understand how to interpret the entires for the dummy variables. 
row_no = 12

print('Dummy variables:')
print(origin_dummies_df.loc[row_no,:])
print('\nOriginal data:')
print(Auto_df.loc[row_no,:])

Dummy variables:
origin_European    False
origin_Japanese    False
Name: 12, dtype: bool

Original data:
mpg                              15.0
cylinders                           8
displacement                    400.0
horsepower                        150
weight                           3761
acceleration                      9.5
year                               70
origin                       American
name            chevrolet monte carlo
Name: 12, dtype: object

I pass these new dummy variables into my scikit-learn linear regression model and get the following coefficients

X = origin_dummies_df.values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

Coefs =  [ 7.56947179 10.41716352]
Intercept =  20.033469387755094

✅ Q: What model is learned? Be specific about what the input variables are in your model.

Your answer here

✅ Q: Now what does your model predict for each of the three types of cars?

# Your code here

Another right way#

Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)

In statsmodels, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df data frame, no processing of the categorical variables on my end at all. Just a warning, though. If we didn’t do that preprocessing at the beginning and the entries of origin were still 1’s, 2’s, and 3’s, statsmodels still wouldn’t know any better and would just treat the column like a quantitative input.

est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.0335	0.409	49.025	0.000	19.230	20.837
origin[T.European]	7.5695	0.877	8.634	0.000	5.846	9.293
origin[T.Japanese]	10.4172	0.828	12.588	0.000	8.790	12.044

✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.

Your answer here

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

Interaction Terms#

We’re going to use the auto data set to train the model

\[ \texttt{mpg} = \beta_0 + \beta_1\cdot \texttt{weight} + \beta_2\cdot \texttt{horsepower} + \beta_3\cdot \texttt{weight x horsepower}. \]

# Reloading teh data set just in case. 
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')

# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'mpg']]

Auto_df.head()

	horsepower	weight	mpg
0	130	3504	18.0
1	165	3693	15.0
2	150	3436	18.0
3	150	3433	16.0
4	140	3449	17.0

First, I’m going to generate a column that is weight times horsepower and add it to the data frame.

Auto_df['horse_x_wt'] = Auto_df.horsepower * Auto_df.weight
Auto_df.head()

	horsepower	weight	mpg	horse_x_wt
0	130	3504	18.0	455520
1	165	3693	15.0	609345
2	150	3436	18.0	515400
3	150	3433	16.0	514950
4	140	3449	17.0	482860

regr = LinearRegression()
X = Auto_df[['horsepower', 'weight', 'horse_x_wt']]
y = Auto_df.mpg
regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

Coefs =  [-2.50838171e-01 -1.07724112e-02  5.35537776e-05]
Intercept =  63.55794002419705

✅ Do this: What is the model learned? Be specific about your variables in the equation.

Your answer here

Let’s do this with stats models now instead. One option, is I can pass in the data frame that I build above and it already has my multiplied column in it.

# train the model
est = smf.ols('mpg ~ weight + horsepower + horse_x_wt', Auto_df).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	63.5579	2.343	27.127	0.000	58.951	68.164
weight	-0.0108	0.001	-13.921	0.000	-0.012	-0.009
horsepower	-0.2508	0.027	-9.195	0.000	-0.304	-0.197
horse_x_wt	5.355e-05	6.65e-06	8.054	0.000	4.05e-05	6.66e-05

However, it will also let me tell the model to build the interaction term without having to build the column myself.

# Taking the interaction column out of the data frame
Auto_df_no_interact = Auto_df[['horsepower', 'weight', 'mpg']]
Auto_df_no_interact.head()

	horsepower	weight	mpg
0	130	3504	18.0
1	165	3693	15.0
2	150	3436	18.0
3	150	3433	16.0
4	140	3449	17.0

est = smf.ols('mpg ~ weight + horsepower + weight*horsepower', Auto_df_no_interact).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	63.5579	2.343	27.127	0.000	58.951	68.164
weight	-0.0108	0.001	-13.921	0.000	-0.012	-0.009
horsepower	-0.2508	0.027	-9.195	0.000	-0.304	-0.197
weight:horsepower	5.355e-05	6.65e-06	8.054	0.000	4.05e-05	6.66e-05

I’m going to reload the data for you and keep the acceleration column too.

# Reloading the data set again. 
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')

# We only need the horsepower, weight, and mpg columns so I'm dropping everything else.
Auto_df = Auto_df[['horsepower', 'weight', 'acceleration', 'mpg']]

Auto_df.head()

	horsepower	weight	acceleration	mpg
0	130	3504	12.0	18.0
1	165	3693	11.5	15.0
2	150	3436	11.0	18.0
3	150	3433	12.0	16.0
4	140	3449	10.5	17.0

✅ Do this: Now use stats models to build the model

\[ \texttt{mpg} = \beta_0 + \beta_1\cdot \texttt{wt} + \beta_2\cdot \texttt{horse} + \beta_3 \cdot \texttt{accel} + \beta_3\cdot \texttt{wt x horse} + \beta 5 \cdot \texttt{wt x accel}. \]

Which interaction terms are adding value to the model?

# Your code here

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

##ANSWER##
#This cell runs the converter which removes ANSWER fields, renames the notebook and cleans out output fields. 

from jupyterinstruct import InstructorNotebook
import os
this_notebook = os.path.basename(globals()['__vsc_ipynb_file__'])

studentnotebook = InstructorNotebook.makestudent(this_notebook)

InstructorNotebook.validate(studentnotebook)

Myfilename CMSE381-Lec08-Review-INSTRUCTOR.ipynb
WARNING: Instructor file name is wrong CMSE381-Lec08_Review-INSTRUCTOR.ipynb != CMSE381-Lec08-Review-INSTRUCTOR.ipynb

CMSE381-Lec08_Review.ipynb

./CMSE381-Lec08_Review.ipynb

Validating Notebook ./CMSE381-Lec08_Review.ipynb
   ERROR: Image LINK not found - http://creativecommons.org/licenses/by-nc/4.0/

Jupyter Notebook

Contents

Jupyter Notebook#

Lecture 8: The Last of the Linear Regression#

Dummy Variables for Multi-level Categorical Inputs#

The wrong way#

The right way#

Another right way#

Interaction Terms#

Congratulations, we’re done!#