Jupyter - Day 22 - Secion 001

Jupyter - Day 22 - Secion 001#

Lec 22 - Step Functions for Classification#

Today we will play with the step functions again! But for classification!

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

Loading in the data#

We’re going to use the Wage data used in the book, so note that many of your plots can be checked by looking at figures in the book.

df = pd.read_csv('../../../DataSets/Wage.csv', index_col =0 )
df.head()

df.info()

df.describe()

Here’s the plot we used multiple times in class to look at a single variable: age vs wage

plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()

plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()

Classification version of step functions#

Now we can try out the classification version of the problem. Let’s build the classifier that predicts whether a person of a given age will make more than $250,000. You already made the matrix of step function features, so we just have to hand it to LogisticRegression to do its thing.

✅ Do this: You will need to first create the dummy variables that represent the step functions. You will need to use pd.cut and pd.get_dummies, or you can copy the relevant code from Day 21’s notebook!

# put your code here

✅ Do this: Pass the dummy variables to a logistic regression model and use it to predict the probability of wage being greater than 250. What is the equation for your learned model? Be specific in terms of the $C_i$ functions you learned earlier. Complete the code below.

from sklearn.linear_model import LogisticRegression
y = np.array(df.wage>250) #<--- this makes sure I 
                          #     just have true/false input
                          #     so that we're doing classification

# put your code below to fit a logistic regression model #

If all goes well, you should be able to run the below code and plat the prediction.

# Build the same step features for the x-values we want to draw
t_age = pd.Series(np.linspace(20,80,100))
t_df_cut = pd.cut(t_age, bins, right = False) #<-- the `bins`` here is from the initial cut
t_dummies = pd.get_dummies(t_df_cut)
t_step = t_dummies.apply(lambda x: x * 1)

# Predict on these to get the line we can draw
f = clf.predict_proba(t_step)

below = df.age[df.wage <=250]
above = df.age[df.wage >250]

# Comment this out to see the function better
# plt.scatter(above,np.ones(above.shape[0]),marker = '|', color = 'orange')
# plt.scatter(below,np.zeros(below.shape[0]),marker = '|', color = 'blue')

plt.xlabel('Age')
plt.ylabel('P[Wage >= 250]')
plt.plot(t_age,f[:,1])
plt.show()

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, adapted by Dr. Mengsen Zhang, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.