Jupyter Notebook#

Lec 21 - Polynomial Regression and Step Functions#

In this module we are going to implement polynomial regression and step functions as discussed in class.

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


# ML imports we've used previously
# import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

0. Loading in the data#

We’re going to use the Wage data used in the book, so note that many of your plots can be checked by looking at figures in the book.

df = pd.read_csv('../../DataSets/Wage.csv', index_col =0 )
df.head()
df.info()
df.describe()

Here’s the plot we used multiple times in class to look at a single variable: age vs wage. I’ve also added some splits so that the people making above and below $250,000 are drawn in a different color.

plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()

plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()

1. Linear Regression#

Before we do anything fancy, let’s just do some linear regression. It’s not going to be a great fit to our data, this is just to see how we can draw the function learned.

If I want to learn a linear model predicting wage from age, I can do the same thing we’ve done for the last month.

from sklearn.linear_model import LinearRegression
X = df.age.values.reshape(-1,1)
y = df.wage

linreg = LinearRegression()

linreg.fit(X,y)

Do this: What is the equation learned by the model?

# your code and or answer here

Now I could plot this by taking the equation I just learned and applying it to some vector of values. However, it is even easier to do this by simply predicting the outputs for a lot of the \(x\)-inputs and then drawing the function on top of the data.

Do this: Use the predict function to get the function outputs and then plot this using the code below.

t_age = np.linspace(20,80, 100).reshape(-1,1)

# your code here for y_wage
y_wage = np.zeros(100) # <----- replace this with your code. 
                            # The zeros are just there so it 
                            # runs before you fix it. 
# If you edit the y_wage stuff above, you'll get the linear regression line drawn here.


# This is all the stuff to plot the data
plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()

plt.xlabel('Age')
plt.ylabel('Wage')

# This is what I'm adding to draw the line 
plt.plot(t_age, y_wage, color = 'red')
plt.show()

1. Polynomial Regression#

Our first step is to build a polynomial regression model using the age data to predict wage. So, as in class, we are in \(p=1\) world here where we are going to fit the model $\( \texttt{wage} = \beta_0 + \beta_1 \texttt{age} + \beta_2 \texttt{age}^2 + \cdots + \beta_p \texttt{age}^p +\varepsilon. \)$

The trick here is to build a matrix \(X\) which has a column containing age, one with age^2, one with age^3, etc. Then we hand this to your favorite regression tool (it doesn’t need to know it’s getting polynomial matrix inputs, it just sees a matrix of features and does it’s thing).

Here’s the code we learned in Lecture 14 for building the data frame of powers of input \(X\).

p = 3
poly = PolynomialFeatures(p, include_bias = False)
X_powers = poly.fit_transform(X)

# X_powers

model = LinearRegression()
model.fit(X_powers,y)
print(f"Coefs: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Do this: What is the equation of the model learned?

your answer here

Do this: As before, in order to plot we want to determine y_wage from the t_age input below by using the predict command. However, note that to do this, you’re going to have to pass in the polynomial matrix for the t_age column.

t_age = np.linspace(10,90,100).reshape(-1,1)

# Make the polynomial features from t_age 
t_powers = None # <--- Replace this with your code to make the polynomial features of t_age

# Now predict the y values from the model and the t_powers
y_wage = None # <--- Replace this with your code to predict the y values from the model and the t_powers
# Note: this should look like slide from class.
plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250])

plt.xlabel('Age')
plt.ylabel('Wage')

plt.plot(t_age, y_wage, c= 'red')
plt.show()

2. Step functions#

Now let’s try to use step functions to learn a model using age to predict wage. Like with the polynomial example from last time, all we’re going to do is build a data frame or feature matrix that has the step function values in each column, and then pass that matrix to our favorite linear modeling function.

First, we want to get a dataframe with the cuts.

df_cut, bins = pd.cut(df.age, 4, retbins = True, right = False)

Note that the df_cut is a pandas series with each data point now represented as the interval it’s contained in.

df_cut

Here I’m just printing it out in a column next to the age information that was used to generate it.

pd.DataFrame({'age': df.age, 'df_cut': df_cut}).head(10)

The bins output gives me the \(c_i\) knots as follows.

print(bins)
# This is how it matches with our notation.
print(r'c_1 = ', bins[0])
print(r'c_2 = ', bins[1])
print(r'c_3 = ', bins[2])
print(r'c_4 = ', bins[3])
print(r'c_5 = ', bins[4])

Do this: For each of the functions \(C_0(X)\), \(C_1(X)\), \(C_2(X)\), \(C_3(X)\), \(C_4(X)\), \(C_5(X)\) (following our notation in class), determine the domains where they have value 1.

Your answer here

  • \(C_0(X)\):

  • \(C_1(X)\):

  • \(C_2(X)\):

  • \(C_3(X)\):

  • \(C_4(X)\):

  • \(C_5(X)\):

We can use the dummy variable trick to turn the df_cut output into something closer to what we are using. Below is my code that generates the data frame storing \(C_i(X)\) for all our entries.

df_steps_dummies = pd.get_dummies(df_cut) # This gives us entries with true/false
df_steps = df_steps_dummies.apply(lambda x: x * 1) # This converts those to either 0 or 1.
df_steps

Q: Which of the functions \(C_i(X)\) for \(i=0,\cdots, 5\) have columns represented in this matrix? Note: it’s not all of them

  • Your answer here*

Do this: Pass this matrix to a linear regression model and use it to predict wage. What is the equation for your learned model? Be specific in terms of the \(C_i\) functions you learned earlier.

# Your code here #

Assuming you stored your linear regression model as linreg, the following code will plot the learned function. Check that the answers you got in the table above match with what you’re seeing in the graph.

t_age = pd.Series(np.linspace(20,80,100))
t_df_cut = pd.cut(t_age, bins, right = False) #<-- I'm explicitly passing the same bins learned above so tha the procedure is the same. 

t_dummies = pd.get_dummies(t_df_cut)
t_step = t_dummies.apply(lambda x: x * 1)

t_step.head()

Do this: Above, I figured out the transformation of the input t_age values to do the same transformation as our step function. Now use the linear regression model learned to predict y_wage, then we can graph it.

# your code here

# y_wage = .......
plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250])
plt.xlabel('Age')
plt.ylabel('Wage')

plt.plot(t_age, y_wage,color='red')
plt.show()

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.