Jupyter Notebook

Jupyter Notebook#

Lec 10: Logistic Regression#

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
%matplotlib inline
import seaborn as sns

Getting a feel for the data#

We’re going to use the Default data set from the ISLR book as included in their R package. I’ve included a csv on the DataSets page for you to use.

Default = pd.read_csv('../../DataSets/Default.csv')
Default.head()

# Here's all the entries in the "default" column
list(Default['default'].unique())

Classification using Logistic Regression#

Our goal is to predict default, a categorical variable taking as values the strings No and Yes.

For this module, we will largely use the tools from sklearn for classification. One of the big perks of the sklearn module is that there is a great deal of uniformity in the classes. So once we have a handle on how to interact with one kind of classification tool, very minor tweaks in the code will allow for the use of a new model. In fact, many of the things we’ll do today should look very similar in terms of the syntax to the linear regression lab from a few weeks ago.

For our first try doing classification, we’ll use LogisticRegression from the sklearn.linear_model module. I’m a huge fan of the sklean documentaiton since it includes a great deal of info on the math behind what we’re doing as well as explanations on the code:

from sklearn.linear_model import LogisticRegression 

Lets first predict default using balance. Our first job is to extract the portion of the dataframe that we want to use.

X = Default[['balance']] 
Y = Default['default']

print(X.shape)
print(Y.shape)

Once we have our data, we create an instance of the model class we want, in this case LogisticRegression, and fit the model to the data. Note the random_state=0 code ensures that rerunning the following box will return the same answer every time.

clf = LogisticRegression(random_state=0)
clf.fit(X,Y)

One thing that is helpful for later is the .classes_ variable which stores the possible values of \(Y\) being predicted. Take note of the order of these things, it will matter later!

clf.classes_

Great, that was easy! Once we’ve fit the model, the main task is to understand how to extract information from it.

✅ Do this: Extract the coefficients and intercept from the trained model.

What is the equation, in terms of the variables used, that you are modeling?
Be specific about what probability you are modeling!

(Hint: You might need to take a look at the documentation to figure out how get the coefficients and intercepts, but you should notice that sklearn has a pattern as to how it does this.)

# Your code here

While it’s good to know what equation we’re modeling with, the big perk here is that your sklearn class will evaluate the data points of your model for you. Yay!

✅ Do this: Use the predict_proba function to determine the probabilities \(Pr(Y = \texttt{Down} \mid X)\) for the data set. What shape is the output matrix? Why that shape? What do the columns represent?

# Your code here

Of course this gives us the probability of each each label for a given data, but we really would like to have the prediction itself.

✅ Do this: Use the predict function to determine the predictions for each input data point in the original \(X\) matrix and store the output as Yhat. How many predictions are different than the actual Direction value? Whats the percent error for the model?

# Your code here

✅ Do this: An even easier way of figuring out the error rate is through the score. What does the output of clf.score(X,Y) mean and how is it related to the number you determined above?

# Your code here

Confusion matrix

As we saw in class, the percent error is a rather limited way of evaluating the classification model. Luckily sklearn provides commands for computing the confusion matrix for a given model easily. The confusion_matrix command computes the confusion matrix, and ConfusionMatrixDisplay gives a nice visual representation.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# This code gives the confusion matrix, assuming you stored the predicted values as `Yhat`.
C = confusion_matrix(Y,Yhat)

C

# This code gives a visual representation 
ConfusionMatrixDisplay(C).plot()

✅ Q: The makers of sklearn made a PARTICULARY strange choice when it comes to the confusion matrix representation. What is different about the sklearn confusion matrix from how we saw it in class?

Your answer here

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.