Jupyter Notebook

Jupyter Notebook#

Lecture 23: More Cubic Splines#

In this module we are going to implement cubic splines

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

Loading in the data#

We’re going to use the Wage data used in the book, so note that many of your plots can be checked by looking at figures in the book.

df = pd.read_csv('../../DataSets/Wage.csv', index_col =0 )
df.head()

df.info()

df.describe()

Here’s the plot we used multiple times in class to look at a single variable: age vs wage

plt.scatter(df.age,df.wage)
plt.xlabel('Age')
plt.ylabel('Wage')
plt.show()

1. Splines#

Before we get to the wage dataset, we’ll build some simpler spline models. Let’s start by playing with some toy data, making heavy use of the examples provided on the scikitlearn spline page.

# Note: this bit is going to use some packages that are newly 
# provided in sklearn 1.0.  If you're having issues, try uncommenting
# and running the update line below.
# %pip install --upgrade scikit-learn

from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import make_pipeline

# This code block is going to make us some nasty fake data 
# to try to find some sort of interpolation. 

def f(x):
    """Function to be approximated by polynomial interpolation."""
    return x * np.sin(x)


# whole range we want to plot
x_plot = np.linspace(-1, 11, 100)
y_plot = f(x_plot)


# Make some data.  Provide a small amount of points to make 
# our polynomials all kinds of wiggly.
X = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
X = np.sort(rng.choice(X, size=20, replace=False))
y = f(X)

# # create 2D-array versions of these arrays to feed to transformers
# X_train = x_train[:, np.newaxis]
# X = X[:, np.newaxis]

X = X.reshape(-1,1)
y = y.reshape(-1,1)


#====ploting======

# plot function
plt.plot(x_plot, y_plot,label="ground truth")

# plot training points
plt.scatter(X, y, label="training points")

plt.legend()

plt.show()

Let’s pretend you never saw that \(f\) function I used to build the data, you’re just handed these scattered data points and asked to learn a piecewise polynomial that fits it.

# plot training points
plt.scatter(X, y, label="training points")

# Plots the axes
plt.axhline(0, color="black", alpha=0.3)
plt.axvline(0, color="black", alpha=0.3)

plt.legend()
plt.show()

The SplineTransformer sets up our basis functions for us. These are the functions that we are learning coefficients for when we are doing regression.

# This sets up the spline transformer. The fit command is deciding where to put the knots
splt = SplineTransformer(n_knots=4, degree=3).fit(X)

print('The knots:')
print(splt.bsplines_[0].t)


#----Visualizing the knots-----#
# Plots the axes
plt.axhline(0, color="black", alpha=0.3)
plt.axvline(0, color="black", alpha=0.3)
# plots the original points
plt.scatter(X, y, label="training points")
# Marks where the knots are as vertical lines
knots = splt.bsplines_[0].t
plt.vlines(knots[3:-3], ymin=-4, ymax=8, linestyles="dashed")

# Uncomment below if you want to see ALL the knots, see note below.
plt.vlines(knots, ymin=-4, ymax=8, linestyles="dashed")

plt.show()

Note that I am only drawing the middle 4 knots here. The SplineTransformer actually hands back several extra knots on either side of the input data points for technical reasons.

Next, we can peek at the basis function’s its using based on the chosen knots. Note that we haven’t actually learned any coefficients for the functions yet, we’re just setting up what the functions are.

Warning: The sklearn code uses a different collection of basis functions to get cubic splines. For our purposes, it doesn’t matter. We just need this thing to pop out a smooth function which it will do (at least inside the middle 4 knots)

x_plot = np.linspace(-10, 20, 100)

plt.figure(figsize=(12,5))
ax = plt.gca()

ax.plot(x_plot, splt.transform(x_plot.reshape(-1,1)))
ax.legend(ax.lines, [f"Spline {n}" for n in range(6)])
ax.set_title("SplineTransformer")

# plot knots of spline
knots = splt.bsplines_[0].t
ax.vlines(knots, ymin=0, ymax=0.8, linestyles="dashed")

plt.show()

I’m going to make use of a nice function from scikitlearn that builds up a pipeline for us to use. Basically, the make_pipline function here takes your input data, runs the SplineTransformer on it to get the features, then runs linear regression.

# B-spline with 4 + 3 - 1 = 6 basis functions
model = make_pipeline(SplineTransformer(n_knots=4, degree=3), LinearRegression())
model.fit(X, y)

Now, I can see the coefficients that linear regression learned by digging into the make_pipeline object as follows.

model.named_steps['linearregression'].intercept_

model.named_steps['linearregression'].coef_

Similarly, I can find the knots as follows.

# Marks where the knots are as vertical lines
knots = model.named_steps['splinetransformer'].bsplines_[0].t
print(knots)

So each of the coefficients were learned for each of the spline functions drawn in the figure above. While we could try to determine the function by hand, that’s getting beyond messy. However, as usual, we can figure out the predicted values from the learned model on the original \(X\) input data as follows.

y_hat = model.predict(X)

We can also draw the full function below by using the predict function on evenly spaced \(t\) values.

# Make the figure drawn wide
plt.figure(figsize = (10,5))

# Draw knots 
plt.vlines(knots[3:-3], ymin=-4, ymax=8, linestyles="dashed", label = 'Knots')

# plot training points
plt.scatter(X, y, label="Training points")

# Plot predicted values for each X
plt.scatter(X,y_hat,label = 'Predicted',marker = 's')


# Plot the full model
x_plot = np.linspace(-1, 11, 100)
spline_y_plot = model.predict(x_plot.reshape(-1,1))
plt.plot(x_plot,spline_y_plot, label = 'Learned model')

plt.legend()
plt.show()

2. Cubic spline on `age` to predict `wage`#

✅ Do this:

Split the wage and age data into a train test split.
Using the code above that generates splines, build a cubic spline model to predict wage from age in the Wage data set and train on your training set.
Use the trained model to predict on your testing set. What is the MSE?
Use your trained model to draw the learned spline on the scatter plot of age and wage, as in the left side of Fig 7.5. Remember we can always do this by predicting from our model on a linear spacing of x-axis values that we want.

Note we’re only doing regular splines with this code, not natural as in Fig 7.4, but the results end up pretty similar in our case

# Your code here #

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.