Jupyter - Day 22 - Secion 001

Jupyter - Day 22 - Secion 001#

Lec 22 - Step Functions for Classification#

Today we will play with the step functions again! But for classification!

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

Loading in the data#

We’re going to use the Wage data used in the book, so note that many of your plots can be checked by looking at figures in the book.

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/2d664388c56e51af2d83cea1fe3027f4/Wage.csv"
df = pd.read_csv('../../../DataSets/Wage.csv', index_col =0 )
df.head()

	year	age	sex	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
231655	2006	18	1. Male	1. Never Married	1. White	1. < HS Grad	2. Middle Atlantic	1. Industrial	1. <=Good	2. No	4.318063	75.043154
86582	2004	24	1. Male	1. Never Married	1. White	4. College Grad	2. Middle Atlantic	2. Information	2. >=Very Good	2. No	4.255273	70.476020
161300	2003	45	1. Male	2. Married	1. White	3. Some College	2. Middle Atlantic	1. Industrial	1. <=Good	1. Yes	4.875061	130.982177
155159	2003	43	1. Male	2. Married	3. Asian	4. College Grad	2. Middle Atlantic	2. Information	2. >=Very Good	1. Yes	5.041393	154.685293
11443	2005	50	1. Male	4. Divorced	1. White	2. HS Grad	2. Middle Atlantic	2. Information	1. <=Good	1. Yes	4.318063	75.043154

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3000 entries, 231655 to 453557
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   year        3000 non-null   int64  
 1   age         3000 non-null   int64  
 2   sex         3000 non-null   object 
 3   maritl      3000 non-null   object 
 4   race        3000 non-null   object 
 5   education   3000 non-null   object 
 6   region      3000 non-null   object 
 7   jobclass    3000 non-null   object 
 8   health      3000 non-null   object 
 9   health_ins  3000 non-null   object 
 10  logwage     3000 non-null   float64
 11  wage        3000 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 304.7+ KB

df.describe()

	year	age	logwage	wage
count	3000.000000	3000.000000	3000.000000	3000.000000
mean	2005.791000	42.414667	4.653905	111.703608
std	2.026167	11.542406	0.351753	41.728595
min	2003.000000	18.000000	3.000000	20.085537
25%	2004.000000	33.750000	4.447158	85.383940
50%	2006.000000	42.000000	4.653213	104.921507
75%	2008.000000	51.000000	4.857332	128.680488
max	2009.000000	80.000000	5.763128	318.342430

Here’s the plot we used multiple times in class to look at a single variable: age vs wage

plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()

plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()

../../../_images/e30cd0affc0dd3667bfa8c7c166175fd350c2491937a3c951095ac9e45537f1a.png

Classification version of step functions#

Now we can try out the classification version of the problem. Let’s build the classifier that predicts whether a person of a given age will make more than $250,000. You already made the matrix of step function features, so we just have to hand it to LogisticRegression to do its thing.

✅ Do this: You will need to first create the dummy variables that represent the step functions. You will need to use pd.cut and pd.get_dummies, or you can copy the relevant code from Day 21’s notebook!

# put your code here

✅ Do this: Pass the dummy variables to a logistic regression model and use it to predict the probability of wage being greater than 250. What is the equation for your learned model? Be specific in terms of the $C_i$ functions you learned earlier. Complete the code below.

from sklearn.linear_model import LogisticRegression
y = np.array(df.wage>250) #<--- this makes sure I 
                          #     just have true/false input
                          #     so that we're doing classification

# put your code below to fit a logistic regression model #

If all goes well, you should be able to run the below code and plat the prediction.

# Build the same step features for the x-values we want to draw
t_age = pd.Series(np.linspace(20,80,100))
t_df_cut = pd.cut(t_age, bins, right = False) #<-- the `bins`` here is from the initial cut
t_dummies = pd.get_dummies(t_df_cut)
t_step = t_dummies.apply(lambda x: x * 1)

# Predict on these to get the line we can draw
f = clf.predict_proba(t_step)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 3
      1 # Build the same step features for the x-values we want to draw
      2 t_age = pd.Series(np.linspace(20,80,100))
----> 3 t_df_cut = pd.cut(t_age, bins, right = False) #<-- the `bins`` here is from the initial cut
      4 t_dummies = pd.get_dummies(t_df_cut)
      5 t_step = t_dummies.apply(lambda x: x * 1)

NameError: name 'bins' is not defined

below = df.age[df.wage <=250]
above = df.age[df.wage >250]

# Comment this out to see the function better
# plt.scatter(above,np.ones(above.shape[0]),marker = '|', color = 'orange')
# plt.scatter(below,np.zeros(below.shape[0]),marker = '|', color = 'blue')

plt.xlabel('Age')
plt.ylabel('P[Wage >= 250]')
plt.plot(t_age,f[:,1])
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 10
      8 plt.xlabel('Age')
      9 plt.ylabel('P[Wage >= 250]')
---> 10 plt.plot(t_age,f[:,1])
     11 plt.show()

NameError: name 'f' is not defined

../../../_images/0a25ebef47f204dc2b4e5a5917aebce2176b2f8fc05041447a8c27eb51a75230.png

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.