Jupyter Notebook

Jupyter Notebook#

Lecture 12#

Validation set and Cross-Validation#

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Train test splits for today 
from sklearn.model_selection import train_test_split

1. Validation set#

The first thing we will do is try out the validation set approach. We’ve already used this some in class so this should be review.

Validation set on toy data#

First off, let’s generate some toy data just to mess around.

# Set the seed so everyone has the same numbers
np.random.seed(42)

def f(t, m = -3, b = 2):
    return m*t+b

n = 300
X_toy = np.random.uniform(0,5,n)
y_toy = f(X_toy) + np.random.normal(0,2,n)

# reshaping to deal with cranky code later
X_toy = X_toy.reshape(-1,1)
y_toy = y_toy.reshape(-1,1)

plt.scatter(X_toy,y_toy)
plt.plot(X_toy,f(X_toy),c = 'red')

Ok, so now we have our fake data set up. Extracting training and testing sets is as simple as the following single line. Note that I set the random_state variable which means that every time you run this, you (and your neighbor) will have the same train/test split.

randomseed = 48824

X_train, X_test, y_train, y_test = train_test_split(X_toy,y_toy, 
                                                    test_size=0.2, 
                                                    random_state=randomseed)

What this does is extracts two pairs of input/output variables \(X\) and \(y\). The train we will use to train our models, and then we will test (or validate) them on the test data.

One way to see what these sets are is to plot them, although usually we have much higher \(p\) (a.k.a. way more input variables) so we can’t really visualize like this normally.

plt.scatter(X_train,y_train, marker = '+', label = "Training")
plt.scatter(X_test,y_test, marker = '*', label = "Testing")
plt.legend()

✅ Do this: Set up a linear regression model using sklearn, train it on the training set, and test it on the test set. What is your mean squared error?

# Your code here

Now that we can see what happens in one case, let’s try doing this many times.
The code below changes our random seed.

✅ Do this: The randomseed += 1 command below changes the seed every time you run the cell. Use this to generate a train/test split with a new seed and compute the MSE.

What happens to the MSE?

# Running this changes the random seed to get a new split of the data
randomseed +=1

# Put your code here to generate a new train test split, run a new model, and compute the MSE

✅ Do this: Below, create a loop, repeating what you did the in the last cell \(k=30\) times. Keep track of the MSE in a list and draw a histogram of the results. What do you notice? If you want to see more pattern, set \(k\) to be something larger like 100.

# Your code here

Great, you got to here! Hang out for a bit, there’s more lecture before we go on to the next portion.

2. Leave One Out Cross Validation (LOOCV)#

Luckily, sklearn has a simple built in procedure to extract your LOOCV splits for easily passing to your models. However, these work a bit differently than before. As always, the sklearn documentation and user guide is an excellent place to start.

Let’s start with a very tiny version of our data set.

X_tiny_toy = X_toy[:6]
y_tiny_toy = y_toy[:6]

plt.scatter(X_tiny_toy,y_tiny_toy)
plt.title("Tiny Toy Data")

The following code gets us the LOOCV splits for this \(n=10\) data set. Notice that trying to print loo doesn’t give us much that’s useful

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
loo.get_n_splits(X_tiny_toy)

print(loo)

The power of the function shows up when we use it in a for loop:

for train_index, test_index in loo.split(X_tiny_toy):
    print("TRAIN:", train_index, "TEST:", test_index)
    print("\n")

The major difference between the LOO.split output and the train_test_split from before is that LOO.split spits out indices while train_test_split gives the data points themselves.

✅ Do this: Finish the code below go get out the training and testing sets for each LOOCV split.

for train_index, test_index in loo.split(X_tiny_toy):

    X_train = X_tiny_toy[train_index]
    # Finish the code to get the X_test, y_train, and y_test

✅ Do this: Use the leave one out splits to perform a linear regression on each one, return the mean squared error, and then average over all the values to get the LOOCV error estimation.

What do you notice about this error estimation vs. the validation set version above?

# Your code here #

The easier version#

Ok I lied to you a bit. There’s an even easier version of this whole LOO-CV thing.

# This command does all your work for you
from sklearn.model_selection import cross_val_score


#define cross-validation method to use
cv = LeaveOneOut()

#build linear regression model
model = LinearRegression()

#use LOOCV to evaluate model
scores = cross_val_score(model, X_toy, y_toy, 
                         scoring='neg_mean_squared_error',
                         cv=cv)

#view mean absolute error
np.average(np.absolute(scores))

Check out this number. Remember LOO-CV has no randomness in it so you should have the same number here as you calculated previously.

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.