In-Class Assignment: Multiple Regression#
Day 14#
CMSE 202#
✅ Put your name here
#✅ Put your group member names here
#Goals#
By the end of today’s class, you’ll have practiced:
Loading and manipulated fixed width column data
Replacing/removing missing data entries
Performing multiple regression using all features and a reduced set of statistically significant features.
Agenda for today’s class:#
Imports#
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_context("notebook")
import pandas as pd
import statsmodels.api as sm
1. Working with more unfamiliar data#
We are going to work with some data generated by U.N.E.S.C.O. (United Nations Education, Scientific, and Cultural Organization) and data they collected relating to poverty and inequality in the world. There are two files you need to do the work:
poverty.dat
which is the data file itselfpoverty.txt
which describes the data columns as fixed width column data. That is, this file describes the columns of the data for each category. For example, the data in columns 1-6 ofpoverty.dat
contain the “live birth rates per 1,000 population”.
You can download the files from here:
https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/poverty.dat
https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/poverty.txt
How does one deal with a “fixed width column” data file?#
Conveniently there is a fixed width column pandas data reader. Look it up and read in the data. Check with your group members to make sure every can find the right function to use!
Again we find ourselves with a data file does doesn’t contain any column headers (argh!). Take a look at the poverty.txt
file for column information and give the columns in your Pandas DataFrame short, but useful names.
✅ Do This: Read the data into a DataFrame and display the head()
of the DataFrame. Remember that you can set the column labels by setting the .columns
attribute to a list with the appropriate column labels.
# put your code here
1.1 Examining the “type” of the data#
✅ Questions: Now look at the .dtypes
of your DataFrame and comment on anything that doesn’t immediately make sense to you. Do all of the columns have a type that matches your expectations? If not, what is it about the values in the DataFrame that is causing this?
✎ Do this - Erase this and put your answer here.
1.2 Handling missing data - Imputation#
Let’s face it, sometimes data is bad. Values are not recorded, or are mis-recorded, or are so far outside of your expectations that you suspect that there is something wrong. On the other hand, just changing the data seems like cheating. We have to work with what we have, and if we have to make changes it would be good to do that programmatically so that it is recorded for others to see.
The process of imputation is the statistical replacement of missing/bad data with substitute values.
It turns out that we have a case of missing data in our dataset! In the Gross National Product (GNP) column some of the values are set to “ * “ indicating missing data. When Pandas reads in the column the only type that makes sense when both characters and numbers are present is a string. Therefore Pandas chose to set the type to object
instead of the expected int64
or float64
.
Using numpy.nan
as a replacement#
For better or worse, pandas assumes that “bad values” will be marked in the data as NaN which it can then represent using NumPy’s “nan
”. NaN is short for “Not a Number”.
If we can mark the missing data with “NaN” instead of “*”, we will have access to some of the imputation methods, which would allow us to replacing the NaN values with various substitution values (e.g. mean, median, specific value, etc.).
There are (at least) two ways to do this:
You can do a
.replace
on the column using a dictionary of the form “{value to replace : new value, …}”. If you do this, remember to save the result. After you do this, you’ll still need to change the column type from “object” to a “float64” in order to assure that the values are numeric values. Note that you cannot convert anp.nan
to an integer but you can to a float.You can convert the everything that can be converted to a number using the Pandas
.to_numeric()
function. Conveniently if you use the “errors
” argument in the function you can force Pandas to convert any non-numbers to “np.nan
” values. As with the previous method, you need to save the converted column in place of the column with the “*” entries. This option has the benefit of not requiring an additional step of having to manually change the data type!
✅ Do This: Convert the missing entries in the GNP column to np.nan
values and show the head of your modified DataFrame to ensure that the “NaN” values are showing up. Also print the dtypes
to show that the column has changed type.
# put your code here
Changing np.nan values#
Now that “bad values” are marked as numpy.nan
, we can use the DataFrame method fillna
to change those values. For example:
poverty_df["GNP"].fillna(0)
The above cell returns a new DataFrame where all the np.nan
values in the GNP column are replaced with 0.
You can do other things are well, for example:
# Two ways of accomplishing the same thing:
# Change the values in the series object (the column) directly
poverty_df["GNP"].fillna(poverty_df["GNP"].mean())
# Changes the value of the series object (the column) by accessing the full dataframe and using a dictionary to reference the column
poverty_df.fillna({"GNP": poverty_df["GNP"].mean()})
Both of the lines in the above cell do the same thing. The first version changes any np.nan
in the GNP
column to be the mean of the column. The second takes a dictionary where the the key of the dictionary is the column to change and the value is what to replace the np.nan
with. Note you could replace with other values like: median, min, max, or some other fixed value.
Remember that all of these examples return either a new Series (when working with just a column) or a DataFrame (if working with the entire element). Nothing is changed in the original unless you assign the result or use inplace=True
in the call.
Finally, if you decide that the right thing to do is remove any row with a np.nan
value, we can use the .dropna
method of DataFrames as shown below:
len(poverty_df)
poverty_df_dropped = poverty_df.dropna()
print(len(poverty_df), len(poverty_df_dropped))
What do you think?#
✅ Do This: Discuss with your group what you think is the best thing to do with the “bad values” in the DataFrame given the discussion above. Make a collective decision and record it below. Once you’ve come to a decision, modify your dataset accordingly.
✎ Do this - Erase this and put your answer here.
2. Multiple Regression#
In the past, we have limited ourselves to using a single feature or independent variable to fit a line or, as in the pre-class, created additional features based on our original feature to fit a polynomial. However, we can just as easily use all, or some combination of all, the features available our dataset to make a OLS model. This is referred to as multiple regression (you can see a brief introduction here). The question is, is it a good idea to just use all the possible features available to make a model?
✅ Do This: Discuss this idea with your group and record your answer below.
✎ Do this - Erase this and put your answer here.
2.1 Infant Mortality model#
Using the U.N.E.S.C.O. data, we can make a model of “Infant Mortality” as the dependent variable against all the other available features. As a hint, an easy way to do this is the make the sm.OLS
model with “Infant Mortality” as the first argument (the dependent variable) and then the entire DataFrame where “Infant Mortality” is dropped as the second argument. You should also drop the “Country” column as unique strings don’t play well in basic linear models.
✅ Do This: Make an OLS model that predicts “Infant Mortality” using the other variables (making sure to drop the “Country” column as well) and display the .summary()
of that process.
# put your code here
There are several interesting things about this .summary()
. Let’s start with things you have seen before.
✅ Do This: Look for the adjusted \(R^2\) statistic. What does this adjusted \(R^2\) tell you about how well your model fits your data?
✎ Do this - Erase this and put your answer here.
Now, let’s look at something new, the “P” values associated with the features used in the model. P values are used widely in statisical testing to judge if a result is statistically significant. Those P values that are 0 (or typically less 0.05) indicate a feature that is “significant” in its ability to predict the dependent variable. Those larger than 0.05 are less significant. Of course, one should be cautious with relying solely on P-values as they can be misused and p-hacking (intentional or not) can lead to misleading results.
✅ Do This: With a healthy dose of caution in mind, review your P-values. The values you get will depend on what you did with your “bad values”, but list below the top three “most significant” features and the overall Adjusted R-squared using all the features.
✎ Do this - Erase this and put your answer here.
2.2 A “reduced” model using only the “significant” features#
Modeling data is as much a craft as it is a science. We often seek the simplest models that explain or data well because they are typically more interpretable, easier to explain, and provide the information on the main influences of the system we are studying. There are reasons we might want a more complex model to capture the details and the nuance of the system. But for the U.N.E.S.C.O. data that we have, we are likely able to capture most of the system using a smaller number of features. These ideas are related to the pre-class modeling you did with increasingly higher powers of x
.
✅ Do This: Redo the model with only the top three features you found above vs “Infant Mortality”. Display the summary.
# your code here
✅ Do This: Review this model and the one you constructed earlier in the notebook. Report how the Adjusted R-squared value changed from using only the top three vs using all the available features. How well does this reduced model appear to fit your data?
✎ Do this - Erase this and put your answer here.
3. Visualization - How well does our model fit our data?#
We have been checking how our models fit our data using both plots of the fitted values and the residuals. These plots are generated from the information stored in various attributes of the OLS results object. We will continue to use the top two plots from .graphics.plot_regress_exog
to investigate our fits. But you could also construct the plots directly using the attributes of the OLS results object.
Note that you will need one plot for each feature in the model as each figure is only produced for a given choice of feature.
✅ Do This: Create three .graphics.plot_regress_exog
figures, one for each of the features in your reduced model. Pay special attention to the top two plots: the fitted values figure and the residual plot.
# put your code here
✅ Questions: Based on these figures, how well does it appear your reduced model fit your data? Do you have any concerns about the distribution of the residuals?
✎ Do this - Erase this and put your answer here.
Congratulations, you’re done with your in-class assignment!#
Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today’s submission folder (Don’t forget to add your names in the first cell).
© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University