Jupyter Notebook

Jupyter Notebook#

Lecture 4 - Simple Linear Regression#

CMSE 381 - Fall 2024#

Sept 4, 2024#

In the today’s lectures, we are focused on simple linear regression, that is, fitting models of the form

\[ Y = \beta_0 + \beta_1 X_1 + \varepsilon \]

In this lab, we will use two different tools for linear regression.

Scikit learn is arguably the most used tool for machine learning in python
Statsmodels provides many of the statistical tests we’ve been learning in class

0. A note on datasets and ethics#

For much of this course, I will follow the labs outlined in the textbook at the end of each section. However, there are many portions of this book that rely on the Boston data set. Although this dataset has been a standard example for a long time, often used for teaching linear regression, it has some major issues with assumptions based around race and housing. An excellent in-depth description of issues in the data set can be found in this medium post from a few years ago. More recently, the data set has marked as deprecated in scikit-learn 1.0, which essentially means that anyone loading it will encounter a warning, and is marked for removal in version 1.2. For these reasons, we will not be using the dataset in this class.

1. The Dataset#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

In this module, we will be using the Diabetes data set. While we could download a csv to put in the correct folder yadda yadda yadda, because this is a commonly used test data set, it’s available in scikit-learn for us to use without any cleanup. Yay!

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)

# Notice that this loads in a lot of info into what is essentially a beastly dictionary.
print(type(diabetes))
diabetes

# But we can immediately get it into a pandas data frame for ease of use as follows 
diabetes_df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
diabetes_df['target'] = pd.Series(diabetes.target)

diabetes_df

Info about the data set#

Look up the documentation about the dataset here:

From https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset

✅ Q:

Write a brief description of the data set.
What do the columns s1 through s6 correspond to?
Which of the available variables are quantitative? Which are categorical?
What is the target that we are trying to predict?

Your answer here

2. Getting familiar with the data#

The following command should show you the top of your data frame.

diabetes_df.head()

✅ Q: Do some basic data exploration. How many data points do we have? How many variables do we have? Are there any data points with missing data?

Your answer here

✅ Q: Use the seaborn sns.pairplot command to look at relationships between the variables. Are there pairs of variables that appear to be related?

Your answer here

3. Simple Linear Regression#

We’re now going to fig to a simple linear regression to the models $$ \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s1} $$ and $$ \texttt{target} = \beta_0 + \beta_1 \cdot\texttt{s5} $$ where the variables are

$\texttt{s1}$: tc, total serum cholesterol
$\texttt{s5}$: ltg, possibly log of serum triglycerides level.

Let’s start by looking at using s5 to predict target.

from sklearn.linear_model import LinearRegression


# sklearn actually likes being handed numpy arrays more than 
# pandas dataframes, so we'll extract the bits we want and just pass it that. 
X = diabetes_df['s5'].values
X = X.reshape([len(X),1])
y = diabetes_df['target'].values
y = y.reshape([len(y),1])

# This code works by first creating an instance of 
# the linear regression class
reg = LinearRegression()
# Then we pass in the data we want it to use to fit.
reg.fit(X,y)

What the fork, nothing seems to have happened? Well actually, we first created an instance of the regression class, which is just a collection of the model functionality waiting to be trained. When we run the fit command with data handed in, it actually figures out the best choice of coefficients for our particular data. Once they’re found, we can extract them from the class as follows.

# We can find the intercept and coefficient information 
# from the regression class as follows.

print(reg.coef_)
print(reg.intercept_)

✅ Q:

What is the model using these coefficients? That is, write down the function $\hat f$ explicitly.
What is the prediction by the model for $\texttt{s5} = 0.05$?

# Your answer here

✅ Q: Overlay a plot of your predicted model (your line) on a scatter plot of the data used. Does linear seem like a good assumption?

# Your answer here

It turns out there is a bit of a cheap trick for plotting linear regression using seaborn. This command will actually both run the linear regression (that is, find the required $\beta_i$’s) and plot it for you. The tradeoff is that this will only work for single variable linear regression; we’ll have to work harder when we’re doing multi-variable linear regression. They also do not provide any easy way to get the equation of the line out, so this isn’t really the best tool to use for anything other than quick and dirty visualization.

# First easy version, but hard to get out the parameters....
sns.regplot(x = diabetes_df.s5,y = diabetes_df.target)

Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Jupyter Notebook

Contents

Jupyter Notebook#

Lecture 4 - Simple Linear Regression#

CMSE 381 - Fall 2024#

Sept 4, 2024#

0. A note on datasets and ethics#

1. The Dataset#

Info about the data set#

2. Getting familiar with the data#

3. Simple Linear Regression#

Congratulations, we’re done!#