Pre-Class Assignment: Regression#
Day 13#
CMSE 202#
✅ Put your name here
#Goals for Preclass#
After this pre-class assignment you should be able to:
Generate a variety of randomized data
Construct a 1-dimensional Linear Regression fit to these data
Explain how to judge the quality of a 1-dimensional Linear Regression fit
This assignment is due by 11:59 p.m. the day before class, and should be uploaded into appropriate submission folder on D2L. Submission instructions can be found at the end of the notebook.
Imports for the notebook#
Make sure you execute the following cell to get all of the imports you will need for this notebook.
✅ Review all of these imports, are any of them unfamiliar to you? If so, look up the unfamiliar module(s) to learn a bit about what they do.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
from IPython.display import HTML
✎ Do This - Erase the contents of this cell and replace it with notes on any of the imports that were unfamiliar to you and what they appear to be for. (double-click on this text to edit this cell, and hit shift+enter to save the text)
1. Regression#
In this pre-class assignment, we’re going to revisit a concept that you should have explored a bit in CMSE 201: Regression.
The term “Regression” actually represents an entire class of algorithms that are used to model data. A regression model provides a way to visualize and predict how a predictor variable, usually the x-axis relates to a dependent variable usually shown on the y-axis. It is usually a first place to start when trying to build a model of data.
Of the family of regression algorithms, the one most often taught at first is called Ordinary Least Squares, often shortened to OLS. When regression is first discussed in 201, you should have also learned a bit about the OLS approach to regression.
Let’s revisit and review OLS.
A regression is used to estimate predictor parameters from data. In simple linear regression the x and y data are assumed to be linearly related. That is, there exists an equation:
such that we must discover the values of \(A\), often called the slope and of \(B\), often called the intercept that is somehow optimal. There are many forms of optimization, but we will focus on OLS optimization.
1.1 Ordinary Least Squares#
The OLS approach to optimizing the parameters above are to “minimize the residuals.” The picture below demonstrates the concept of a residual:
(from here )
The idea is to minimize the square of the “residuals”, that is the sum of the squared error as measured between the line and each actual value. Thus OLS tries to minimize:
where \(e^i\) is the error, more specifically if \(\hat{y_i}\) is the predicted value and \(y_i\) is the actual value, the error is the difference between the two.
2. OLS and statsmodels#
In 201, when work with regression and fitting models to data, you used some of the built in functions available in NumPy and SciPy, but these are just the tip of the iceberg in terms of what you can do with statistics and regression in Python.
In addition, it’s possible you’ve come across the sklearn
module (no worries if you haven’t) and you might have used it for doing some regression but, at least for some of the model work we want to do right now, the statsmodel
package provides a little more statistical information and background. We’ll certainly use sklearn
later in the course, but it is always good to learn some alternatives, and there are quite a number available for Python!
So what are we going to do?#
For this assignment, we are going to make some sample data and do an OLS type regression but, most importantly, we want to pay attention to some of the statistics that are generated from the model. It’s all well and good to make a linear regression model, but it is very important that we know how good that model is!
2.1 Generating random numbers - a quick aside#
To generate our sample data we need to use some random numbers. It is important that we understand what a Random Number Generator, RNG for short, is and what we can and cannot do with it.
An RNG is an algorithm for generating random numbers. If you thought about that statement at all, it might trouble you just a bit. By definition an algorithm is a method, a reproducible method, that can be used to accomplish a task. In fact, it is even worse than that. Most RNGs take a seed
value, a value that initializes the RNG. If you seed an RNG with the same value, you get the same sequence.
The random
package provides access to RNG functions. Three approaches that we might find useful are (there are many more, look at the docs ):
random.seed()
. If you want to seed the RNG, you provide an argument. If not, a system dependent (often time-based) seed is used.random.int(start, stop)
: generate an integer between start and stop inclusiverandom.uniform(start, stop)
: generate a float between start and stop inclusiverandom.choice([sequence])
: yield one of the values from the sequence (not the range, the actual values).
In NumPy we can create random numbers in similar ways doing things like np.random.<METHOD>
(where <METHOD>
is the method we want to use for generating the random numbers or making random selections), but we can also do things a little differently and first create a random number generator object much like we have done when we wrote our own class or used objects from other packages. Below you can find some examples creating random numbers with NumPy.
Let’s quickly review these methods (you should be a bit familiar with generating random numbers from 201 as well).
# Create a random number generator
rng = np.random.default_rng(seed = 123)
# Create an array of 10 random integers from the domain [1, 5). 1 is included while 5 isn't
ary1 = rng.integers(1, 5, size = 10)
ary2 = rng.integers(1, 5, size = 10)
rng2 = np.random.default_rng(seed = 123)
ary3 = rng2.integers(low = 1, high = 5, size = 10)
# Create an array of 10 random floats from a uniform distribution in the domain [0, 1). 0 is included while 1 isn't
f_ar1 = rng.uniform(low = 0.0, high = 1.0, size = 10)
print(ary1)
print(ary2)
print(ary3)
print(f_ar1)
ary1
and ary2
use the same call and generate different sequences. But if we recreate the RNG with the same seed, we get the same sequence. ary1
and ary3
are the same.
What an RNG can provide is randomness of the sequence. That is, using statistical and other measures, predicting what the next number will be given an existing sequence should appear random. It is the sequence, which number comes next, that an RNG tries to randomize.
It can be convenient to get the same sequence given the same seed, for testing purposes as an example. But the sense of random from an RNG is the randomness of the sequence, not the specific numbers generated.
In the next code cell we will create random numbers from a normal distribution (aka Gaussian) and create and histogram plot from them
# Create an array of 1000 random floats from a normal distribution with mean 0, and standard deviation 0.1
def gaussian_dist(x, mu, sigma):
"""Calculate the Gaussian probability density function.
Parameters:
----------
x : numpy.ndarray
Range of values
mu: float
Center of the Gaussian PDF.
sigma: float
Standard deviation of the Gaussian PDF.
Return
------
f : numpy.ndarray
Gaussian PDF
"""
f = np.exp( - (x - mu)**2 / (2 * sigma**2) )
f /= (sigma * np.sqrt(2 * np.pi))
return f
mu, sigma = 0, 0.1 # mean and standard deviation
s = rng.normal(mu, sigma, 1000)
count, bins, _ = plt.hist(s, 30, density=True)
plt.plot(bins, gaussian_dist(bins, mu, sigma),linewidth=2, color='r')
plt.show()
2.2 Data#
Let’s generate some data.
✅ Do This - Using numpy
and random
as appropriate, let’s make the following arrays:
x_ary
with float values from 0 to 10 by 0.5 (you can usearange
)y_ary
that has values that are based onx_ary
linearly. Let’s use \(2x + 3\) as our linear relationship.y_noisy
is created by adding random noise to the values iny_ary
. Do it uniformly from [-1 to 1]y_fixed
Use thechoice
method of your RNG to add either 5 or -5 to each value iny_ary
Note: this can be done with just NumPy’s modules/methods, if you which to avoid using the built-in Python random
module.
# put your code here
✅ Do This - Plot x_ary
vs the other three array and see what you have. Here’s an example that uses the matplotlib subplots
functionality to make a stacked column of three plots that used sharey=True
to ensure the y-axis values use the same ranges as it’s already aligned in x by default:
2.3 Doing the fit#
Let’s now make an OLS fit for these three sets: (x_ary
, y_ary
), (x_ary
, y_noisy
), (x_ary
, y_fixed
).
✅ Before you try to find the fits, think about it for a minute and answer below. Which do you think will have the best fit? Which the worst fit? Why?
✎ Do this - Erase this and put your answer here.
Although we already imported this at the top of the notebook, we’re going to do the imports here again for statsmodels
so you can see it in context. As a common practice though, it’s still good to put all important imports at the top of the notebook.
import statsmodels.api as sm
If you look at the statsmodels documentation for OLS you should see that the standard order of operations is to:
create the model
fit the model
look at the results.
Let’s do the “perfect fit” pair above as an example and then you’ll do the other two.
model_perfect = sm.OLS(y_ary, x_ary) # make the model
results = model_perfect.fit() # run the OLS fit
print(results.params) # print intercept and slope
Well that’s not right! We should expect two values: slope and intercept. Let’s take a look at the predicted values:
print(results.predict())
So what happened?#
So what we actually got was a slope back with the assumption that the intercept was 0. But how to get the correct intercept? The statsmodels
OLS assumes that every column in the x values can be used as a parameter for regression, called multiple regression (which we’ll get to a little later). To get an intercept, we need to add a column of constant value, essentially turning the linear equation into:
Conveniently, statsmodels has a function just for that called (wait for it) add_constant
. It adds a constant column to either a DataFrame or as a NumPy array. Let’s do that and try the fit again:
x_with_cnst = sm.add_constant(x_ary)
print(x_with_cnst)
model = sm.OLS(y_ary, x_with_cnst)
results = model.fit()
print("Intercept and slope are:", results.params)
print(results.predict())
That’s the perfect fit we originally expected! We can also print a large number of statistics using another built-in method, summary()
:
print(results.summary() )
What does all of this mean?#
There is a lot of information there but the top 2 entries of 2nd column is information that tell us about the quality of the fit. The R-squared value has a range of 0 to 1.0, with 1.0 being perfect. As expected, ours is perfect.
We’ll talk about Adjusted R-squared later in multiple regression but it also ranges from 0-1.0 and, in this example, is also perfect.
We can also make plots of the fit using plot_regress_exog
, which provides a comparison of the real data and the modeled data, as well as the residual plot.
There are two additional plots that plot_regress_exog
produces – the partial regression plot and the CCPR plot. We will discuss those later.
You should notice that the residuals for this fit are extremely small, this is because of our perfect! As a result, the residuals are numerically equivalent to zero.
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(results, "x1", fig=fig)
2.4 Fit the other two data sets#
Now that you’ve followed along with the example that used our perfect fit, you’re going to try to fit the other two “y” arrays.
✅ Do This - Perform linear model fits for the y_noisy
and y_fixed
pairs. Print: the fit parameters, the predictions of the model, and the model summaries as we have already done. Also use plot_regress_exog
to produce plots of the model fits.
Note: Make sure you use the new x_with_cnst
array we created previously to ensure that you get back both a slope and a y-intercept in your best fit parameters.
# fit the y_noisy values here
# fit the y_fixed code here
✅ Do This: Between the noisy
and fixed
data sets, which has a better linear fit? How can you tell?
✎ Do this - Erase this and put your answer here.
Follow-up Questions#
Copy and paste the following questions into the appropriate box in the assignment survey include below and answer them there. (Note: You’ll have to fill out the assignment number and go to the “NEXT” section of the survey to paste in these questions.)
When we think about doing regression modeling, what do we mean by “residuals”?
What is the value that provides some form of measurement of the “goodness of fit” between our model and our data?
Assignment Wrap-up#
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
from IPython.display import HTML
HTML(
"""
<iframe
src="https://cmse.msu.edu/cmse202-pc-survey"
width="800px"
height="600px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)
Congratulations, you’re done with your pre-class assignment!#
Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for the appropriate pre-class submission folder (Don’t forget to add your name in the first cell).
© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University