Jupyter - Day 11 - Section 001

Jupyter - Day 11 - Section 001#

Lec 11: More Logistic Regression#

In this module we are going to test out the logistic regression classification method we discussed in class, but now we have:

more than one input variable (multiple logistic regression) and,
more than one level for the output variable (multinomial logistic regression).

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Same as last time, we’re going to use the Default data set from the ISLR book as included in their R package. I’ve included a csv on the DataSets page for you to use.

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/0bf0b0b65f603971cd33a04ad934449c/Default.csv"
Default = pd.read_csv(url)
Default.head(10)

	default	student	balance	income
0	No	No	729.526495	44361.625070
1	No	Yes	817.180407	12106.134700
2	No	No	1073.549164	31767.138950
3	No	No	529.250605	35704.493940
4	No	No	785.655883	38463.495880
5	No	Yes	919.588531	7491.558572
6	No	No	825.513331	24905.226580
7	No	Yes	808.667504	17600.451340
8	No	No	1161.057854	37468.529290
9	No	No	0.000000	29275.268290

# Here's all the entries in the "default" column
list(Default['default'].unique())

['No', 'Yes']

Multiple Logistic Regression#

We’re going to be training models of the form

\[ p(X) = \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1+\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p )} \]

This time, we are going to use all three input variables to predict default. First things first, student is a categorical input variable. Let’s deal with that.

✅ Do this: Add a dummy variable column called student_Yes to your data frame and remove the student column. Before moving on, make sure my check below prints out True.

# Your code here

# Check to see if you did that right!
# If your data frame is updated properly, this should print out True
list(Default.columns) == ['default', 'balance', 'income', 'student_Yes']

True

If you did all that right, the following should get us our train/test split of our inputs and outputs.

X = Default[['balance', 'income', 'student_Yes']]
y = Default['default']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

✅ Do this: Set up a logistic regression model using inputs balance, income, and student_Yes.

Train this model on the training set above.

# Your code and such here

✅ Do this: What is the equation of your model?

##YOUR ANSWER HERE###

✅ Do this: Use the test set to evaluate your model. What is the error?

# Your code here

✅ Do this: Take a look at the confusion matrix for this model. What do you notice?

# Your code here

Multinomial Logistic Regression#

Now we’ve got both multiple inputs and multiple levels. In this case we’re going to use the iris data set but we’re just going to use the version from sklearn.

Take a moment to look at the documentation or the wikipedia page so you know what’s going on in this data set.

from sklearn.datasets import load_iris

# just to remind me what the data is about.
iris_df = load_iris(as_frame=True)['frame']
print(iris_df.head())
print(load_iris()['target_names'])

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
              5.1               3.5                1.4               0.2   
              4.9               3.0                1.4               0.2   
              4.7               3.2                1.3               0.2   
              4.6               3.1                1.5               0.2   
              5.0               3.6                1.4               0.2   

   target  
     0  
     0  
     0  
     0  
     0  
['setosa' 'versicolor' 'virginica']

# I'm going to load in the data set and do a bit of processing so that it's in a nice format
iris_df = load_iris(as_frame=True)['frame']
target_names = load_iris()['target_names']
iris_df['species'] = target_names[load_iris()['target']]
iris_df.drop('target', axis=1, inplace=True)
iris_df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

I’ve done all the hard work for you. The cell below does everything:

Splits the data into X and y matrices.
Does a train/test split
Sets up the logistic regression classifier
Predicts the outputs for the test data
Reports an accuracy.

It might spit out a convergence warning for you, don’t worry about it.

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=11)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Y_pred = logreg.predict(X_test)
err = 1-accuracy_score(y_test, Y_pred)
print(f'Error rate: {err*100:.2f}%')

Error rate: 13.33%

/Users/bao/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

✅ Do this:

What do the columns of predict_proba function below correspond to?
Check your answer on the first few entries.

logreg.predict_proba(X_test)

array([[3.23924808e-05, 8.25704662e-02, 9.17397141e-01],
       [7.91126112e-05, 9.53513667e-02, 9.04569521e-01],
       [6.40111553e-04, 5.17334095e-01, 4.82025794e-01],
       [4.25255067e-03, 6.73997132e-01, 3.21750318e-01],
       [1.06305350e-04, 1.80263562e-01, 8.19630133e-01],
       [9.66935618e-01, 3.30642735e-02, 1.08400518e-07],
       [6.00710617e-03, 9.13850378e-01, 8.01425162e-02],
       [9.89089492e-01, 1.09104437e-02, 6.45741115e-08],
       [9.61923265e-01, 3.80765024e-02, 2.32541492e-07],
       [3.35813423e-03, 7.76634637e-01, 2.20007229e-01],
       [4.79867686e-04, 5.64867409e-01, 4.34652723e-01],
       [8.53525793e-03, 9.78920110e-01, 1.25446319e-02],
       [1.07610909e-02, 7.22993337e-01, 2.66245572e-01],
       [3.13141872e-04, 1.90952691e-01, 8.08734167e-01],
       [5.69867064e-05, 1.09670418e-01, 8.90272595e-01]])

##YOUR CODE HERE##

✅ Q: Which of the classes is most often misclassified?

# Your code here

Congratulations, we’re done!#

Initially created by Dr. Liz Munch, modified by Dr. Lianzhang Bao and Dr. Firas Khasawneh, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.