Pre-Class Assignment: Polynomial Regression#

Day 14#

CMSE 202#

✅ Put your name here

#

Goals for Pre-Class Assignment#

After this pre-class assignment, you will be able to:

  1. Generate data for a polynomial regression

  2. Construct a set of polynomial regression models usings statsmodels

  3. Evaluate the quality of fit for a set of models using adjusted \(R^2\) and determine the best fit

  4. Explain why that model is the best fit for this data

Our Imports#

Make sure to execute this cell!

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_context("notebook")
import statsmodels.api as sm
from IPython.display import HTML

1. Polynomial Regression#

Previously we focused on fitting a line to data, but as we’ve seen, it’s possible that a straight line is not going to be good enough to model the data we are working with. We can augment our \( Ax + B\) with extra features. By adding features we are still doing linear regression, but where the features themselves can consist of, well anything.

However, to limit our focus, for this pre-class we will use polynomials. We can add values like \(x^2\) or \(x^5\) to the potential set of features that can be used to better map against our data.

Do This: The question we should ask ourselves is, how many such features should we add? What are the advantages and disadvantages of adding more and more features? Think about it and answer in the cell below.

Do this - Erase this and put your answer here.

1.1 Let’s make some Data#

When we are first starting out with a new tool, it can be useful to generate our own data. Data we generate gives us the advantage of knowing what the answer should be.

Do This: Generate some data by doing the following:

  • build a numpy array x_ary of values from -4 to 4 in increments of 0.2

  • generate a corresponding y_ary, using the values from x_ary, based on the formula \(x^4 + 2x^3 -15x^2 -12x + 36\)

  • create y_noisy, by adding random (uniform) noise to y_ary in the range of -15 to 15. Later on we might make the range bigger (say -25 to 25) or smaller (say -5 to 5) for comparison.

# put your code here

1.2 Plot the data#

As always, it’s been to look at our data before we try to model it.

Do This: Plot x_ary vs both y_ary and y_noisy. Do it overlapping with colors, or side by side, whatever you think would look good. Make sure to label your axes!

# put your code here
fig = plt.figure(figsize = (10,7))

2 Making the Polynomial Features#

Ultimately it would be nice to do our work using a pandas DataFrame so that we have the opportunity to label our columns. There’s the added benefit that statsmodels works well with pandas DataFrames.

Do This: Make a DataFrame consisting of the following columns: a constant value for the intercept, the values in x_ary, and additional powers of x_ary up to 10.

You can do this one of two ways:

  1. make the DataFrame out of x_ary and add features to the DataFrame

  2. add columns to the x_ary array and then finish off by adding to a DataFrame

In the end, you have a DataFrame no matter the approach.

To state the goal for this task again, the columns of the DataFrame should be:

  • Label the first column “const” and just place the value 1 in it

  • make the x_ary data column 1, labeled “data”

  • the next 9 columns should be based on x_ary and have as values: \(x^2\), \(x^3\), \(x^4 \ldots\) \(x^{10}\). Give them good (but short) label names

Print the head of your DataFrame when you’re done to make sure it looks right. It should end up looking something like this:

Screen-Shot-2021-03-05-at-3-43-11-PM
# put your code here

2.1 Fitting using the Polynomials#

We’ll talk about measures of “goodness” of fit during the class, but one good measure for a multi-feature fit is the Adjusted R-squared value. In general, the R-squared describes the variance in the model that it can account for. If the R-squared is 1.0, then all the variance is accounted for an you have a perfect fit. If the value is 0 and you have no fit. However, for multiple features R-squared tends to over-estimate. The Adjusted R-squared tries to deal with this and provide a value that is better suited to multiple features.

We’ll leave it to you how you want to do this, but what we’d like you to try is to fit different combinations of features against y_noisy and report the Adjusted R-squared value. For example, what is the Adjusted R-squared for:

  1. just the const column

  2. the const and data columns (which should be a line)

  3. the const, data and \(x^2\) columns

  4. the const, data, \(x^2\) and \(x^3\) columns

  5. \(\ldots\)

So on and so forth. You can do them individually or on a loop and collect the results.

The object that is returned by the .fit() method is an instance of a statsmodelsstatsmodels.regression.linear_model.RegressionResults”. Run the type command on it and see. If you look on the statsmodels doc page under “Properties” (scroll down and look for that word as a title), you will find all values you can gather from the variable returned by .fit(). For this assignment the most important one of those is .rsquared_adj.

Do This: Explore a variety of models that fit to the noisy data using increasingly more features. Look at that value for the combination of features you selected and say which one is the “best”. For this assignment, we would consider the “best” would be the highest value of .rsquared_adj.

Note: you do not have to try an exhaustive set of models (though you could set this up with a loop), just explore a variety of combinations and reflect on the results.

# put your code here

Questions: Which combination of features best “fit” your data? What was the Adjusted R-squared? Why might that combination produce the best fit?

Do this - Erase this and put your answer here.


3 Plot your data and your model#

Do this: Plot x_ary vs y_noisy and x_ary vs the best fitted values based on the adjusted R-squared value. Do it in the same graph. Again, the Property .fittedvalues gives out a panda Series with the fitted values (the y values for your best fit model). Also print out the summary for the variable returned by .fit()

Your plot might end up looking something like this:

best-fit
# put your code here

3.1 Are we justified in using this model?#

As we did previously, we can check how well we are justified in using this model, by looking at the residual plot.

Do this: Again, using plot_regress_exog, plot the residuals as a function of the independent variable (data or x, whatever you called it).

# put your code here

Question: Do we appear justified in using this model? Why or why not?

Do this - Erase this and put your answer here.


Follow-up Questions#

Copy and paste the following questions into the appropriate box in the assignment survey include below and answer them there. (Note: You’ll have to fill out the assignment number and go to the “NEXT” section of the survey to paste in these questions.)

  1. Which combination of features best “fit” your data? What was the Adjusted R-squared? Why might that combination produce the best fit? (you should be able to copy your answer to this question from above)

  2. Based on your plot of the residuals, do we appear justified in using this model? Why or why not? (you should be able to copy your answer to this question from above)


Assignment wrap-up#

Hopefully you were able to get through all of that. We’ll be trouble-shooting any issues you had

You must completely fill this out in order to receive credit for the assignment!

HTML(
"""
<iframe 
	src="https://cmse.msu.edu/cmse202-pc-survey" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

Congratulations, you’re done with your pre-class assignment!#

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for the appropriate pre-class submission folder (Don’t forget to add your name in the first cell).

© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University