Jupyter - Day 22 - Secion 001#
Lec 22 - Step Functions for Classification#
Today we will play with the step functions again! But for classification!
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
Loading in the data#
We’re going to use the Wage
data used in the book, so note that many of your plots can be checked by looking at figures in the book.
df = pd.read_csv('../../../DataSets/Wage.csv', index_col =0 )
df.head()
df.info()
df.describe()
Here’s the plot we used multiple times in class to look at a single variable: age
vs wage
plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()
plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()
Classification version of step functions#
Now we can try out the classification version of the problem. Let’s build the classifier that predicts whether a person of a given age will make more than $250,000. You already made the matrix of step function features, so we just have to hand it to LogisticRegression
to do its thing.
✅ Do this:
You will need to first create the dummy variables that represent the step functions. You will need to use pd.cut
and pd.get_dummies
, or you can copy the relevant code from Day 21’s notebook!
# put your code here
✅ Do this: Pass the dummy variables to a logistic regression model and use it to predict the probability of wage
being greater than 250. What is the equation for your learned model? Be specific in terms of the \(C_i\) functions you learned earlier. Complete the code below.
from sklearn.linear_model import LogisticRegression
y = np.array(df.wage>250) #<--- this makes sure I
# just have true/false input
# so that we're doing classification
# put your code below to fit a logistic regression model #
If all goes well, you should be able to run the below code and plat the prediction.
# Build the same step features for the x-values we want to draw
t_age = pd.Series(np.linspace(20,80,100))
t_df_cut = pd.cut(t_age, bins, right = False) #<-- the `bins`` here is from the initial cut
t_dummies = pd.get_dummies(t_df_cut)
t_step = t_dummies.apply(lambda x: x * 1)
# Predict on these to get the line we can draw
f = clf.predict_proba(t_step)
below = df.age[df.wage <=250]
above = df.age[df.wage >250]
# Comment this out to see the function better
# plt.scatter(above,np.ones(above.shape[0]),marker = '|', color = 'orange')
# plt.scatter(below,np.zeros(below.shape[0]),marker = '|', color = 'blue')
plt.xlabel('Age')
plt.ylabel('P[Wage >= 250]')
plt.plot(t_age,f[:,1])
plt.show()
Congratulations, we’re done!#
Initially created by Dr. Liz Munch, adapted by Dr. Mengsen Zhang, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.