Jupyter Notebook#

Lec 11: More Logistic Regression#

In this module we are going to test out the logistic regression classification method we discussed in class, but now we have:

  • more than one input variable (multiple logistic regression) and,

  • more than one level for the output variable (multinomial logistic regression).

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Same as last time, we’re going to use the Default data set from the ISLR book as included in their R package. I’ve included a csv on the DataSets page for you to use.

Default = pd.read_csv('../../DataSets/Default.csv')
Default.head(10)
# Here's all the entries in the "default" column
list(Default['default'].unique())

Multiple Logistic Regression#

We’re going to be training models of the form

\[ p(X) = \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1+\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p )} \]

This time, we are going to use all three input variables to predict default. First things first, student is a categorical input variable. Let’s deal with that.

Do this: Add a dummy variable column called student_Yes to your data frame and remove the student column. Before moving on, make sure my check below prints out True.

# Your code here
# Check to see if you did that right!
# If your data frame is updated properly, this should print out True
list(Default.columns) == ['default', 'balance', 'income', 'student_Yes']

If you did all that right, the following should get us our train/test split of our inputs and outputs.

X = Default[['balance', 'income', 'student_Yes']]
y = Default['default']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Do this: Set up a logistic regression model using inputs balance, income, and student_Yes.

  • Train this model on the training set above. What is the equation of your model?

# Your code and such here

Do this: Use the test set to evaluate your model. What is the error?

# Your code here

Do this: Take a look at the confusion matrix for this model. What do you notice?

# Your code here
## Note this data is imbalanced, so accuracy is not the best metric
C = confusion_matrix(y_test, Y_pred)
ConfusionMatrixDisplay(confusion_matrix=C, display_labels=logreg.classes_).plot()

# In this case, 
#   75 people who defaulted were predicted to not default, and (company loses money)
#   20 people who didn't default were predicted to default (they don't get a loan).

Multinomial Logistic Regression#

Now we’ve got both multiple inputs and multiple levels. In this case we’re going to use the iris data set but we’re just going to use the version from sklearn.

Take a moment to look at the documentation or the wikipedia page so you know what’s going on in this data set.

from sklearn.datasets import load_iris
# I'm going to load in the data set and do a bit of processing so that it's in a nice format
iris_df = load_iris(as_frame=True)['frame']
iris_df['species'] = target_names[load_iris()['target']]
iris_df.drop('target', axis=1, inplace=True)
iris_df.head()

I’ve done all the hard work for you. The cell below does everything:

  • Splits the data into X and y matrices.

  • Does a train/test split

  • Sets up the logistic regression classifier

  • Predicts the outputs for the test data

  • Reports an accuracy.

It might spit out a convergence warning for you, don’t worry about it.

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris_df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=11)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Y_pred = logreg.predict(X_test)
err = 1-accuracy_score(y_test, Y_pred)
print(f'Error rate: {err*100:.2f}%')

Do this:

  • What do the columns of predict_proba function below correspond to?

  • Check your answer on the first few entries.

logreg.predict_proba(X_test)

Q: Which of the classes is most often misclassified?

# Your code here

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.