Jupyter Notebook#

Lec 17: Ridge Regression#

In this module we are going to test out the ridge regression method we discussed in class from Chapter 6.2.

# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Fix for deprecation warnings, but we should fix so this isn't here
# import warnings 
# warnings.filterwarnings('ignore')

Loading in the data#

Ok, here we go, let’s play with a baseball data set.

hitters_df = pd.read_csv('../../DataSets/Hitters.csv')
hitters_df.head()

Annoyingly enough we have some missing values in the data.

print("Number of null values:", hitters_df["Salary"].isnull().sum())

So let’s go clean those up…..

# Print the dimensions of the original Hitters data (322 rows x 20 columns)
print("Dimensions of original data:", hitters_df.shape)

# Drop any rows the contain missing values, along with the player names
hitters_df = hitters_df.dropna().drop('Player', axis=1)

# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
print("Dimensions of modified data:", hitters_df.shape)

# One last check: should return 0
print("Number of null values:", hitters_df["Salary"].isnull().sum())

hitters_df.head()

And finally, we can replace our categorical variables with dummy variables.

hitters_df = pd.get_dummies(hitters_df, drop_first = True)
hitters_df.head()
y = hitters_df.Salary

# Drop the column with the independent variable (Salary)
X = hitters_df.drop(['Salary'], axis = 1).astype('float64')

X.info()

Normalizing the data#

Our first job is to normalize the data. Right now, all our columns have different standard deviations.

np.std(X)

We’ll do it on all the columns, but I’ll draw pictures just with the Hits column for ease of visualization. Here’s that the distribution of that particular column is before doing any normalization.

sns.histplot(X.Hits)

I’m going to take all my columns and normalize them so that each column has standard deviation 1. I could do this by something like

X_Normalized = (X-X.mean())/X.std() 

however there’s a built in command in sklearn already. So I’ll do the same thing using the StandardScalar command.

from sklearn.preprocessing import StandardScaler
# First we set up the transformer to figure out what has to happen 
# to each column to get out normalized data
transformer = StandardScaler().fit(X)

# This command tells the transformer I want a pandas data frame back
# otherwise it will hand me a numpy array.
transformer.set_output(transform = 'pandas')

#Then I actually use the transform bit to do the action on all the columns. 
X_norm = transformer.transform(X)

X_norm.head()
# First we set up the transformer to figure out what has to happen 
# to each column to get out normalized data
transformer = StandardScaler().fit(X)

# This command tells the transformer I want a pandas data frame back
# otherwise it will hand me a numpy array.
transformer.set_output(transform = 'pandas')
# See below if this is throwing an error

#Then I actually use the transform bit to do the action on all the columns. 
X_norm = transformer.transform(X)

X_norm.head()
# Note if the block above is giving you an error (in particular the transform = 'pandas' above), it's possibly because you're using an older version of the numpy library. You can either update numpy or use the following code instead:

# transformer = StandardScaler().fit(X)
# X_norm = transformer.transform(X)
# X_norm = pd.DataFrame(X_norm,columns = X.columns)

# X_norm.head()

Woohoo, I have normalized data!

np.std(X_norm)

Notice that the shape of the Hits histogram is the same, we’ve just scaled the \(x\)-axis.

sns.histplot(X_norm.Hits)

Ridge Regression#

In class, we learned that doing ridge regression means that we try to find the best model according to the score

\[ RSS + \lambda \sum_{i} \beta_i^2. \]

The good news is that sklearn has a built in Ridge function.

from sklearn.linear_model import Ridge

The bad (ok, not honestly that bad) news is that they call their \(\lambda\) parameter \(\alpha\). So we’re just going to minimize

\[ RSS + \alpha \sum_{i} \beta_i^2. \]

instead. So if I pick an \(\alpha\) value, I can do ridge regression as follows.

a = 1 #<------ this is me picking an alpha value

# normalize the input
transformer = StandardScaler().fit(X)
transformer.set_output(transform = 'pandas')
X_norm = transformer.transform(X)

# Fit the regression
ridge = Ridge(alpha = a) 
ridge.fit(X_norm, y)

# Get all the coefficients
print('intercept:', ridge.intercept_)
print('\n')
print(pd.Series(ridge.coef_, index = X_norm.columns))
print('\nTraining MSE:',mean_squared_error(y,ridge.predict(X_norm)))

That is a bit annoying in terms of how much stuff I have to do, so here’s some code that does the exact same thing (note that your score and coefficients learned are exactly the same as above).

from sklearn.pipeline import make_pipeline

a = 1 #<------ this is me picking an alpha value


# The make_pipeline command takes care of the normalization and the 
# Ridge regression for you. Note that I am passing in the un-normalized
# matrix X everywhere in here, since the normalization happens internally.
model = make_pipeline(StandardScaler(), Ridge(alpha = a))
model.fit(X, y)
 
# Get all the coefficients. Notice that in order to get 
# them out of the ridge portion, we have to ask the pipeline 
# for the specific bit we want with the model.named_steps['ridge']
# in place of just ridge from above.
print('intercept:', model.named_steps['ridge'].intercept_)
print('\n')
print(pd.Series(model.named_steps['ridge'].coef_, index = X.columns))
print('\nTraining MSE:',mean_squared_error(y,model.predict(X)))

Of course, that was just me picking a random \(\alpha\) out of a hat so there’s no reason to trust that it’s a good one. I could sit here all day and move that \(\alpha\) around to see what’s going on, but why do that, when I can make a for loop!

Here’s a pile of \(\alpha\)’s for us to test on.

alphas = 10**np.linspace(4,-2,100)*0.5
alphas = np.append(alphas,0)
alphas

First off, let’s take a look at how the coefficients learned change for various choices of \(\alpha\).

Associated with each alpha value is a vector of ridge regression coefficients, which we’ll store in a matrix coefs. In this case, it is a \( 19\times 100\) matrix, with 19 rows (one for each predictor) and 100 columns (one for each value of alpha).

from sklearn.pipeline import make_pipeline

ridge = Ridge()
coefs = []

for a in alphas:
    model = make_pipeline(StandardScaler(), Ridge(alpha = a))
    model.fit(X, y)
    coefs.append(model.named_steps['ridge'].coef_)
    

coefs = pd.DataFrame(coefs,columns = X_norm.columns)
coefs.head()
   

Let’s say I just want to look at the coefficient for the Hits column.

plt.plot(alphas,coefs.Hits, label = 'Hits')
plt.xscale('log')

plt.legend()

Or just for the Runs column.

plt.plot(alphas,coefs.Runs, label = 'Runs')
plt.xscale('log')

plt.legend()

But that’s pretty annoying, so instead we look at all of the coefficients at once.

plt.plot(alphas, coefs, label = coefs.columns)
plt.xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('coefficients')

plt.legend(bbox_to_anchor = (1.1,1.1))
# Fancier plot, 
# also added grey color for ease of visualization
for var in coefs.columns:
    if np.abs(coefs[var][100])<200:
        plt.plot(alphas, coefs[var], label = var, color = 'grey', linewidth = .5)
    else:
        plt.plot(alphas, coefs[var], label = var)
    
plt.xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('coefficients')

plt.legend(bbox_to_anchor = (1.1,1.1))

Q: There are two variables that have higher magnitude than the rest for low \(\alpha\)’s (translation: they are either very large positive or very large negative). Which two are they from the data set? Which is which?

# Your code here #

Train/test split version#

Now we can start setting up the usual train/test splits to have at least a starting idea of how the testing error is going. The random_state=1 bit just makes it so that everyone should get the same random split.

# Split data into training and test sets
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Do this: Train a model using ridge regression with \(\alpha = 4\). What is the MSE of your model on the testing set?

# Your code here #

Do this: Ha ha nah, you can do better than that. Lets try all our alphas and take a look at the testing MSE to make a better decision about what \(\alpha\) we might want. Modify the code below to plot your testing MSE for all the alphas. What \(\alpha\) should we use to train the model?

# Modify your code from above and add it in the for loop to plot the testing MSE

ridge = Ridge()
errors = []

for a in alphas:
    # ==== Your code goes in here ==== #
    errors.append(17) #<----- random number in here so that the code runs before you fix it

#---plotting----

plt.plot(alphas,errors)
plt.title('Testing MSE')
ax=plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')

RidgeCV#

Whelp, your meanie professor didn’t tell you that there’s actually a built in function to do this for you (sorry-not-sorry). Aren’t you glad you didn’t read ahead?

from sklearn.linear_model import RidgeCV

Basically, RidgeCV runs LOOCV (unless you tell it otherwise, see the documentation) on all the \(\alpha\) values you specify on an input array, and tells you the best \(\alpha\) given that.

First, I am going to normalize. I’m being careful to not just normalize all of \(X\) and then pass this matrix on since this is an example of data leakage. In that version, we are using some amount of information about the testing data (in this case it is contributing to the normalization process) to affect something in our training process. So here’s what we do instead.

X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# To make sure my normalization isn't snooping, I fit the transformer only 
# on the training set 
transformer = StandardScaler().fit(X_train)
transformer.set_output(transform = 'pandas')
X_train_norm = transformer.transform(X_train)

# but in order for my output results to make sense, I have to apply the same 
# transformation to the testing set. 
X_test_norm = transformer.transform(X_test)
# I'm going to drop that 0 from the alphas because it makes 
# RidgeCV cranky
alphas = alphas[:-1]


ridgecv = RidgeCV(alphas = alphas, 
                  scoring = 'neg_mean_squared_error', 
                  )
ridgecv.fit(X_train_norm, y_train)



print('alpha chosen is', ridgecv.alpha_)

I can predict my values on the test set directly from the ridgecv model we just built.

pred = ridgecv.predict(X_test_norm)
mean_squared_error(y_test,pred)

This is exactly the same result as if I went and retrained my model using the chosen \(\alpha\) using Ridge.

ridge = Ridge(alpha = ridgecv.alpha_)
ridge.fit(X_train_norm, y_train)
mean_squared_error(y_test, ridge.predict(X_test_norm))

Do this: Why did we get a different best choice of \(\alpha\) than we found in the previous section?

Your answer here


Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.