# Jupyter - Day 15 - Section 001
# Lecture 15: K-Fold CV for Classification


In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold


# 1. CV for a classification data set
![Palmer Penguins Picture](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)

*Artwork by @allison_horst*


For this lab, we are going to use the <a href = "https://allisonhorst.github.io/palmerpenguins/">Palmer Penguins</a> data set by Allison Horst, Alison Hill, and Kristen Gorman. This data set was originally posted in R, but has helpfully been loaded as an easily readable python data set by installing the `palmerpenguins` package using `pip`. 



In [None]:
# You should only have to do this once:
%pip install palmerpenguins

In [None]:
# If it worked, this should load our dataset
from palmerpenguins import load_penguins
penguins = load_penguins()
penguins.head()


As always, when playing with a new data set, your first job is to just get a feel for what's in the data. We're going to use this data to predict species of the penguin given the other information.

&#9989; **<font color=red>Questions:</font>** 
- How many penguins are in the data set? 
- What are the input variables? 
- What are the possible values of the output variable? 
- Which are categorical varaibales? Which are quantitative? 
- Are there any lines with missing data? How is missing data represented in this data set? 

*Your answers here*

&#9989; **<font color=red>Do this:</font>** Spoiler alert, there are penguins with missing data. Replace the `penguins` dataframe with one where you have removed all those lines. (*Hint: this should be a one line operation*)

In [None]:
# Your code here

Our next favorite thing to do with any data set is to start trying to visualize relationships between the variables. 

In [None]:
sns.pairplot(penguins)

In [None]:
#Here is another nice visualization taken from the palmerpenguins github
g = sns.lmplot(x="flipper_length_mm",
               y="body_mass_g",
               hue="species",
               height=7,
               data=penguins,
               palette=['#FF8C00','#159090','#A034F0'])
g.set_xlabels('Flipper Length')
g.set_ylabels('Body Mass')

## Step 1: Set up your dataframes 

Ok, you have your penguins data frame.  
- Build a dataframe $X$ with `island` and `sex` replaced with dummy variable(s)
- Save an pandas series of the entries in `penguins.species` as $y$. 

In [None]:
# Your code here. 

## Step 2: Run logistic regression

Ok, you have your penguins data with input variables as X and we are going to predict `penguins.species`. While `scikitlearn` cannot handle input variables that are categorical (hence why we had to put in our dummy variables ourselves), it's fine with a output variable that is. The following code will fit a logistic regression on the whole data set. Of course, you know better than to actually do this to return your results, so in a moment we will be modifying this to get $k$-fold CV test errors. 

In [None]:
logisticmodel = LogisticRegression(max_iter = 1400) # Note, I needed to up the interations
                                                   # to get rid of a convergence warning
logisticmodel.fit(X, y)

Also here's some helpful code to remember how to get accuracy/error rates out of classification modules in `scikitlearn`.

In [None]:
# and now we can also get the error rate on the training set. 
from sklearn.metrics import accuracy_score
yhat = logisticmodel.predict(X)
accuracy = accuracy_score(yhat, y)
# Note that accuracy is the percentage correct
print('Accuracy:', accuracy)
# so the percentage incorrect is
print('Error:', 1-accuracy)

# We can get the same info directly from the original model
print('\nAccuracy version 2:', logisticmodel.score(X,y))

&#9989; **<font color=red>Do this:</font>** Ok, your job, should you choose to accept it, is to 
- Train a model predicing `species` from all the input variables using logistic regression. 
- Use $k$-fold cross validation to determine the test error. I would recommend using something like $k=5$ to start building your code, but you can up it to $k=10$ when you want to see better results. 
- *Hint: while I was building my version, I had to set the `max_iter` for Logistic regression pretty high to get the model to converge. However, my error results were still pretty reasonable with lower `max_iter`, ignoring the massive amount of pink warning boxes. Feel free to mess around with this parameter to see how it affects your output.*

In [None]:
# Your code here



-----
### Congratulations, we're done!
Initially developed by Dr. Liz Munch, adapted by Dr. Guanqun Cao, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.