Jupyter Notebook#
Lec 10: Logistic Regression#
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import seaborn as sns
Getting a feel for the data#
We’re going to use the Default
data set from the ISLR book as included in their R package. I’ve included a csv on the DataSets page for you to use.
Default = pd.read_csv('../../DataSets/Default.csv')
Default.head()
# Here's all the entries in the "default" column
list(Default['default'].unique())
Classification using Logistic Regression#
Our goal is to predict default
, a categorical variable taking as values the strings No
and Yes
.
For this module, we will largely use the tools from sklearn
for classification. One of the big perks of the sklearn
module is that there is a great deal of uniformity in the classes. So once we have a handle on how to interact with one kind of classification tool, very minor tweaks in the code will allow for the use of a new model. In fact, many of the things we’ll do today should look very similar in terms of the syntax to the linear regression lab from a few weeks ago.
For our first try doing classification, we’ll use LogisticRegression
from the sklearn.linear_model
module. I’m a huge fan of the sklean
documentaiton since it includes a great deal of info on the math behind what we’re doing as well as explanations on the code:
from sklearn.linear_model import LogisticRegression
Lets first predict default
using balance
.
Our first job is to extract the portion of the dataframe that we want to use.
X = Default[['balance']]
Y = Default['default']
print(X.shape)
print(Y.shape)
Once we have our data, we create an instance of the model class we want, in this case LogisticRegression
, and fit the model to the data. Note the random_state=0
code ensures that rerunning the following box will return the same answer every time.
clf = LogisticRegression(random_state=0)
clf.fit(X,Y)
One thing that is helpful for later is the .classes_
variable which stores the possible values of \(Y\) being predicted. Take note of the order of these things, it will matter later!
clf.classes_
Great, that was easy! Once we’ve fit the model, the main task is to understand how to extract information from it.
✅ Do this: Extract the coefficients and intercept from the trained model.
What is the equation, in terms of the variables used, that you are modeling?
Be specific about what probability you are modeling!
(Hint: You might need to take a look at the documentation to figure out how get the coefficients and intercepts, but you should notice that sklearn
has a pattern as to how it does this.)
# Your code here
While it’s good to know what equation we’re modeling with, the big perk here is that your sklearn
class will evaluate the data points of your model for you. Yay!
✅ Do this: Use the predict_proba
function to determine the probabilities \(Pr(Y = \texttt{Down} \mid X)\) for the data set. What shape is the output matrix? Why that shape? What do the columns represent?
# Your code here
Of course this gives us the probability of each each label for a given data, but we really would like to have the prediction itself.
✅ Do this: Use the predict
function to determine the predictions for each input data point in the original \(X\) matrix and store the output as Yhat
. How many predictions are different than the actual Direction
value? Whats the percent error for the model?
# Your code here
✅ Do this: An even easier way of figuring out the error rate is through the score. What does the output of clf.score(X,Y)
mean and how is it related to the number you determined above?
# Your code here
Confusion matrix
As we saw in class, the percent error is a rather limited way of evaluating the classification model. Luckily sklearn
provides commands for computing the confusion matrix for a given model easily. The confusion_matrix
command computes the confusion matrix, and ConfusionMatrixDisplay
gives a nice visual representation.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# This code gives the confusion matrix, assuming you stored the predicted values as `Yhat`.
C = confusion_matrix(Y,Yhat)
C
# This code gives a visual representation
ConfusionMatrixDisplay(C).plot()
✅ Q: The makers of sklearn
made a PARTICULARY strange choice when it comes to the confusion matrix representation. What is different about the sklearn
confusion matrix from how we saw it in class?
Your answer here
Congratulations, we’re done!#
Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.