Jupyter - Lecture 7

Jupyter - Lecture 7#

Even More Linear Regression#

In the last few lectures, we have focused on linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_pX_p + \varepsilon \]

In this lab, we will continue to use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statisitcial tests we’ve been learning in class

This lab will cover two ideas:

Categorical variables and how to represent them as dummy variables.
How to build interaction terms and pass them into your favorite model.

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

More questions to ask of your model (Continued from last time)#

Q3: How well does the model fit?#

This section talks about interpretation of $R^2$ and RSE, but is on slides only.

Q4: Making predictions#

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/27085b33898cca7fb57dc6897c2deccd/Advertising.csv"
advertising_df = pd.read_csv(url, index_col = 0)
# I need to sort the rows by TV to make plotting work better later 
advertising_df= advertising_df.sort_values(by=['TV'])
advertising_df.head()

	TV	Radio	Newspaper	Sales
131	0.7	39.6	8.7	1.6
156	4.1	11.6	5.7	3.2
79	5.4	29.9	9.4	5.3
57	7.3	28.1	41.4	5.5
127	7.8	38.9	50.6	6.6

# I want to just learn Sales using TV
est = smf.ols('Sales ~ TV', advertising_df).fit()
est.params

Intercept    7.032594
TV           0.047537
dtype: float64

# Here is a table giving us the CI and PI information 
alpha = 0.1
# alpha = 0.05
# alpha = 0.01
# alpha = 0.001
advert_summary = est.get_prediction(advertising_df).summary_frame(alpha=alpha)
advert_summary.head()

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	7.065869	0.456216	6.311932	7.819806	1.628140	12.503598
1	7.227494	0.448345	6.486566	7.968422	1.791553	12.663435
2	7.289291	0.445348	6.553316	8.025267	1.854023	12.724559
3	7.379611	0.440981	6.650852	8.108370	1.945316	12.813907
4	7.403379	0.439835	6.676515	8.130244	1.969338	12.837421

# And here is some code that will draw these beasts for us....
plt.rcParams['figure.figsize'] = [10, 10]
plt.rcParams['font.size'] = 16


x = advertising_df['TV']
y = advertising_df['Sales']


# Plot the original data 
plt.scatter(x,y, label = 'Data')

# Plot the fitted values, AKA f_hat
# you can get this in two different places, same answer
# plt.plot(x,advert_summary['mean'], color = 'orange', label = 'f hat')
plt.plot(x,est.fittedvalues, color = 'orange', label = 'f hat')



plt.plot(x,advert_summary['obs_ci_lower'], 'r', lw=2, 
         label = r'Prediction Band')
plt.plot(x,advert_summary['obs_ci_upper'], 'r', lw=2)

plt.plot(x, advert_summary['mean_ci_lower'],'g--', lw=1, 
         label = r'Confidence Region')
plt.plot(x, advert_summary['mean_ci_upper'], 'g--', lw=1)

plt.title('Alpha = '+ str(alpha))



plt.legend()

<matplotlib.legend.Legend at 0x15b45b510>

../../../_images/1a939fe9d649a501a1136a0f89df92977f3a8ad819a3552d59a9a483a991b587.png

Playing with multi-level variables#

The wrong way#

Ok, we’re going to do this incorrectly to start. You pull in the Auto data set. You were so proud of yourself for remembering to fix the problems with the horsepower column that you conveniently forgot that the column with information about country of origin (origin) has a bunch of integers in it, representing:

1: American
2: European
3: Japanese.

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/d75c3811a83a66f8c261e5b599ef9e44/Auto.csv"
Auto_df = pd.read_csv(url)
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')


Auto_df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

You then go on your merry way building the model $$ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. $$

from sklearn.linear_model import LinearRegression

X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values

regr = LinearRegression()

regr.fit(X,y)

print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)

beta_1 =  5.476547480191433
beta_0 =  14.811973615412484

✅ Q: What does your model predict for each of the three types of cars?

# Your code here

✅ Q: Is it possible for your model to predict that both American and Japanese cars have mpg below European cars?

Your answer here.

The right way#

Ok, so you figure out your problem and decide to load in your data and fix the origin column to have names as entries.

convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	American	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	American	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	American	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	American	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	American	ford torino
...	...	...	...	...	...	...	...	...	...
392	27.0	4	140.0	86	2790	15.6	82	American	ford mustang gl
393	44.0	4	97.0	52	2130	24.6	82	European	vw pickup
394	32.0	4	135.0	84	2295	11.6	82	American	dodge rampage
395	28.0	4	120.0	79	2625	18.6	82	American	ford ranger
396	31.0	4	119.0	82	2720	19.4	82	American	chevy s-10

392 rows × 9 columns

Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!

origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df

	origin_American	origin_European	origin_Japanese
0	True	False	False
1	True	False	False
2	True	False	False
3	True	False	False
4	True	False	False
...	...	...	...
392	True	False	False
393	False	True	False
394	True	False	False
395	True	False	False
396	True	False	False

392 rows × 3 columns

✅ Q: What is the interpretation of each column in the origin_dummies_df data frame?

Your answer here

I pass these new dummy variables into my scikit-learn linear regression model and get the following coefficients

X = origin_dummies_df.values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

Coefs =  [-1.99394386e+13 -1.99394386e+13 -1.99394386e+13]
Intercept =  19939438589658.2

✅ Q: Now what does your model predict for each of the three types of cars?

# Your code here

Ooops#

✅ Q: Aw man, I didn’t quite do what we said for the dummy variables in class. We talked about having only two dummy variables for a three level variable. Copy my code below here and fix it to have two variables instead of three.

Are your coefficients different now?
Are your predictions for each of the three origins different now?
Does it matter which two levels you used for your dummy variables?

# Your code here

##YOUR ANSWER HERE###

##ANSWER## Coefficients are different (in particular, there’s a different number of them) but the prediction ends up the same…..

Another right way#

Ok, fine, I’ll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it’s not all bad ;)

First off, we can force sklearn to drop the first variable, so you don’t have to do it manually every time. But you do need to know how to interpret the outputs!

# Even easier right way.... Note the only difference is the drop_first=True
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin',drop_first=True)
print(origin_dummies_df.head())

y = Auto_df.mpg
regr = LinearRegression()
regr.fit(X,y)

   origin_European  origin_Japanese
          False            False
          False            False
          False            False
          False            False
          False            False

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In statsmodels, it can automatically split up the categorical variables in a data frame, so it does the hard work for you. Note that here I’m plugging in the original Auto_df data frame, no processing of the categorical variables on my end at all.

est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.0335	0.409	49.025	0.000	19.230	20.837
origin[T.European]	7.5695	0.877	8.634	0.000	5.846	9.293
origin[T.Japanese]	10.4172	0.828	12.588	0.000	8.790	12.044

✅ Q: What is the model learned from the above printout? Be specific in terms of your dummy variables.

##Your code or answer here###

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr.Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.