Homework 3

Homework 3#

Regression models#

✅ Put your name here.
#

✅ Put your GitHub username here.
#

Goal for this homework assignment#

By now, you have learned a bit about regression models. In this assignment, you will practice:

Using branches in Git
Performing linear regression
Performing multiple regression
Performing logistic regression
Creating a project timeline

This assignment is due by 11:59 pm on Friday, November 7th. It should be uploaded into the “Homework Assignments” submission folder for Homework 3. Submission instructions can be found at the end of the notebook. There are 90 standard points possible in this assignment, including points for Git commits/pushes. The distribution of points can be found in the section headers.

Table of contents#

Part 1: Git branch (6 points)
Part 2: Loading the datasets (9 points)
Part 3: Simple linear regression (17 points)
Part 4: Multiple regression (27 points)
Part 5: Logistic regression (22 points)
Part 6: Project planning (5 points)
Part 7: Assignment wrap-up (4 points)

Run this cell below before moving on:#

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split


sns.set_context("talk")

Back to ToC

Part 1: Git Branch (6 points)#

You’re going to add this assignment to the cmse202-f25-turnin repository you created so that you can track your progress on the assignment and preserve the final version that you turn in. You will do this by performing the tasks 1.1 - 1.6 below.

Important: Double-check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the HW01 assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

✅ Question 1.1 (1 point): Navigate to your cmse202-f25-turnin local repository and create a new directory called hw-03. In the cell below put the command(s) you used to do this.

✎ Put your answer here

✅ Question 1.2 (1 point): Move this notebook into that new directory in your repository, but do not add or commit it to your repository yet. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.3 (1 point): Create a new branch called hw03_branch (The Day 16 PCA and ICA content has information on how to do this). Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.4 (1 point): “Check out” the new branch (so that you’ll be working on that branch). Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.5 (1 point): Double check to make sure you are actually on that branch. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.6 (1 point): Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw03_branch to push your new branch to GitHub. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-f25-turnin” repository inside the hw-03 directory that you just created within the new branch hw03-branch.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the problems for a bit.

✅ Do this: Remember to do every Git commit/push mentioned throughout the assignment!

Back to ToC

Part 2: Loading the dataset. (9 points)#

In this section, you will work with data from the California Housing Prices dataset. The dataset includes all block groups in California from the 1990 Census. On average, each block group contains approximately 1,425.5 individuals residing in a geographically compact area. As expected, the size of each area varies inversely with population density. Distances between block group centroids are calculated using latitude and longitude coordinates. Block groups with zero values for either the independent or dependent variables were excluded from the analysis. The final dataset consists of 20,640 observations across 9 variables.

Our goal is to use ordinary least squares to design regression models to fit the median house value in a California census district, given eight features. We will examine a simple linear model using all the provided features, a reduced linear model that only uses a subset of the features, and a multiple regression model.

Reference: Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

✅ Question 2.1 (1 points): Do This: Download the file housing.csv from the link below, and save it into the same directory as your notebook. Then, in the cell below, put the command line command(s) you used to download the file. If you did not use a command line tool to download the file, write down the command(s) that would have downloaded the file.

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F22-data/main/data/housing.csv

# Put the (two) commands you used to download the two files here.

✅ Question 2.2 (2 points): Next, load the data using Pandas and display the first 20 rows.

# Put your code here

You should notice that the DataFrame has a non-numerical feature called “ocean_proximity”. There are also a few rows with NaN values, although you may not see them in the few rows that were displayed. We will not use the “ocean_proximity” column or any of the rows with NaN values in this assignment.

✅ Question 2.3 (2 points): Do This: Drop the “ocean_proximity” column from the dataframe, and drop all the rows with NaN values.

# Put your code here

✅ Question 2.4 (2 points): How many rows did you end up dropping from this data set? What total percentage of data was removed?

#  Put your answer here

✅ Question 2.5 (2 points): Look at the Kaggle website for this dataset is hosted on Kaggle. What do the columns longitude and latitude represent?

✎ Put your answer here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 2 complete”, and push the changes to GitHub.

Back to ToC

Part 3: One Variable Linear Regression (17 points)#

In this part, we’ll perform some one-variable linear regression analysis on the California Housing dataset we just downloaded.

✅ Question 3.1 (4 points): Using the OLS() method in statsmodels.api, make a simple linear regression model that predicts “median_house_value” using “median_income” as the independent variable. Be sure to use the add_constant() method to add a column of ones to the DataFrame before using the OLS() method so that your linear model includes a constant term.

# Put your code here

✅ Question 3.2 (2 points): Comment on the fit of your model. What are you using to judge the fit?

✎ Put your answer here.

✅ Question 3.3 (4 points): Plot the scatter plot of your independent and dependent data and also plot the line predicted by the regression. Include descriptive labels, titles, and legends as appropriate.

# Put your answer here

✅ Question 3.3 (2 points): From your plot, you will notice that the dataset includes a lot of entires with a median housing value of $500,000. Let us investigate if this affects our regression analysis or not.

Do this: Now mask the housing dataset to exclude all data points with a median house value of at least $500,000.

# Put your code here

✅ Question 3.4 (4 points): Now make a simple linear regression model that predicts “median_house_value” using “median_income” as the independent variable, on this new data subset. Be sure to use the add_constant() method to include a constant term in your model.

# Put your answer here

✅ Question 3.5 (4 points): Make a scatter plot of your independent and dependent data, and also plot the line predicted by the new regression model above. Include descriptive labels, titles, and legends as appropriate.

# Put your answer here

✅ Question 3.6 (1 point): Comment on the fit of your last model. What are you using to judge the fit?

✎ Put your answer here:

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 3 complete”, and push the changes to GitHub.

Back to ToC

Part 4: Multiple Regression (27 points)#

In this part, we will explore multivariable regression on the same California Housing dataset from part 3. Reload the dataset that includes median house values of $500,000.

✅ Question 4.1 (5 points): Using the OLS() method in statsmodels.api, make a multiple regression model that predicts “median_house_value” using the other variables, and display the .summary() of that process. Remember that you may need to use the add_constant() method to make sure OLS fits a general line $y = k+ax_1 + bx_2 +... +hx_8$, with $k$ constant, to the data instead of a line through the origin $y = ax_1 + bx_2 +... +hx_8$.

# Put your code here

✅ Question 4.2 (4 points): Answer the following two questions:

What is the R-squared value you got?
Based on your R-squared value, what does it tell you about the regression fit, and how the model fits the data?

✎ Put your answers here:

✅ Question 4.3 (2 points): Based on the output of the OLS summary, which of these features (variables) appears to be “significant” in predicting the “median_house_value”?

✎ Put your answers here:

✅ Question 4.4 (4 points): In the output of the OLS summary, you should have seen a note that says something like

“The condition number is large, [[number]]. This might indicate that there are strong multicollinearity or other numerical problems.”

Multicollinearity is a statistical phenomenon where some of the features in a model can be linearly predicted using some of the other features in the model. In other words, the features in the model are somewhat redundant. Hence, even if each feature may be deemed significant, it may still be possible to form a “reduced” model using a smaller number of features.

Do This: Design a second linear model that uses only three of the eight variables to predict the “median_house_value”, and fits the data comparably well as the first linear model you designed in Question 4.2. You can choose this subset either by trial and error or by any other method you’d like.

# Put your code here

✅ Question 4.5 (2 points): How did your reduced linear model fit the data compared to the full linear model you created in Question 4.1? Give some quantitative justification for this answer.

✎ Put your answers here:

✅ Question 4.6 (5 points): Now that you have your reduced model, make a heat map showing the correlations between the different variables (similar to what we did on Day 14). Be sure to include a legend!

# Put your code here.

✅ Question 4.7 (1 point): You should find that there isn’t much overlap between the high-correlation variables in the heat map and the variables you used in your (reduced) model, the opposite of what we found on Day 14. Explain why this is the case.

✎ Put your answers here:

✅ Question 4.8 (3 points): Create three .graphics.plot_regress_exog figures, each one using one of the three features in your reduced model as the independent variable, to examine the fit to the data. Pay attention to the top two plots in each instance: the fitted values figure and the residual plot.

# Put your code here

✅ Question 4.9 (1 point): Now use some online resource to help you make sense of these residual plots. Describe the trends that you see. Be as detailed as possible. Is there heteroscedasticity? Is there constant variance? Are there any signs of non-linearity? These are a few questions you might ask yourself or try to answer to make sense of the residual plots.

✎ Put your explanations here.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 4 complete”, and push the changes to GitHub.

Back to ToC

Part 5. Logistic Regression (22 points)#

In this part of the homework, you will work with data from an unknown source. Our goal is to use logistic regression to identify who is more likely to buy merchandise from ads on social networks.

✅ Question 5.1 (3 points): Do This:

Download the dataset and write the command you used in the next cell

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F22-data/main/HW/Homework_4/ads.csv

Load the data in this file into a Pandas dataframe
Display the first five rows of the dataframe.

# Put the command to download the data here

# Put your code for reading in the dataset here

As you can see the dataset has only few columns. The first columns is not useful since it is a unique identifier. The second column could be useful, however, we need numbers instead of strings. Hence we are left only with the last three columns. Age and EstimatedSalary will be our features while Purchased will be our labels

✅ Question 5.2 (3 points):

Do This: Drop the first and second columns of the dataset
Do This: Divided the rest of the dataset into a train and a test dataset using train_test_split function of scikit-learn. The test dataset should be 25% of the original data

# Put your code here

✅ Question 5.3 (4 points):

Do This: Use the Logit class to perform Logistic regression on your training dataset (don’t forget to add the constant).
Do This: Print the results of your model

# Put your code here

✅ Question 5.4 (2 points): Do you think this is a good fit? Explain your answer

✎ Put your answer here:

✅ Question 5.5 (4 points): Use the above model to make predictions on the test dataset. Remember that the Logit model returns continuous values from 0 to 1 while you need two discrete values. Then use the function accuracy_score from scikit-learn to see how good your model is.

# Put your code here

✅ Question 5.6 (1 point): Does the accuracy score change your opinion of the goodness of your model?

✎ Put your answer here:

✅ Question 5.7 (5 points): Does your model improve if you re-introduce the Gender column? Since the column is made of strings, replace Male with 0 and Female with 1. Is Gender an informative feature? Explain your answers.

# Put your code here

✎ Put your explanation here:

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 5 complete”, and push the changes to GitHub.

Back to ToC

Part 6. Setting a project timeline. (5 points)#

You should know which project you will be working on with your group by now. You and your group will be presenting this project during the last week of class (November 24th-December 2nd). Come up with a project timeline with specific goals/checkpoints to meet as this deadline approaches. The ability to set project timelines is a very useful skill to have professionally. You can create this timeline yourself, as a group, or you may ask generative AI to try and make a timeline for you. Try to, in the very least, create weekly checkpoints (~3).

✎ Put your timeline here:

Back to ToC

Part 7: Assignment wrap-up (4 points)#

7.1: (1 point) Have you put your name and GitHub username at the top of your notebook?

7.2: (3 points) Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main branch.

✅ Do the following:

Switch back to your main branch.
Merge your hw03_branch with your main branch.
Finally, push the changes to GitHub.

NOTE: The grader will be able to see your commit messages and whether you pushed the repo at this stage, if everything has gone as planned. Double-check that things look correct on GitHub before you submit this notebook to D2L.

Congratulations, you’re done!#

Submit this assignment by uploading it to the course D2L web page. Go to the “Homework Assignments” folder, find the dropbox link for Homework 3, and upload it there.

Homework 3

Contents

Homework 3#

Regression models#

✅ Put your name here.#

✅ Put your GitHub username here.#

Goal for this homework assignment#

Table of contents#

Run this cell below before moving on:#

Part 1: Git Branch (6 points)#

Part 2: Loading the dataset. (9 points)#

🛑 STOP#

Part 3: One Variable Linear Regression (17 points)#

🛑 STOP#

Part 4: Multiple Regression (27 points)#

🛑 STOP#

Part 5. Logistic Regression (22 points)#

🛑 STOP#

Part 6. Setting a project timeline. (5 points)#

Part 7: Assignment wrap-up (4 points)#

Congratulations, you’re done!#

✅ Put your name here.
#

✅ Put your GitHub username here.
#