Homework Assignment 3#
Regression models#
✅ Put your name here.
#✅ Put your GitHub username here.
#Goal for this homework assignment#
By now, you have learned a bit about regression models. In this assignment, you will practice:
Using branches in Git
Performing linear regression
Performing multiple regression
Performing logistic regression
Creating a project timeline
This assignment is due by 11:59 pm on Friday, April 4th. It should be uploaded into the “Homework Assignments” submission folder for Homework 3. Submission instructions can be found at the end of the notebook. There are 80 standard points possible in this assignment including points for Git commits/pushes. The distribution of points can be found in the section headers.
Part 1: Git Branch (6 points)#
For this assignment, you’re going to add it to the cmse202-s25-turnin
repository you created so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to
✅ Do the following:
Navigate to your
cmse202-s25-turnin
local repository and create a new directory calledhw-03
Move this notebook into that new directory in your repository, but do not add or commit it to your repository yet.
Create a new branch called
hw03_branch
(The Day 16 PCA and ICA content has information on how to do this).“Check out” the new branch (so that you’ll be working on that branch).
Double check to make sure you are actually on that branch.
Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use
git push origin hw03_branch
to push your new branch to GitHub.
Finally, ✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below. Points for this part will be given for correctly setting up branch, etc., above, and for doing git commits/pushes mentioned throughout the assignment.
✎ Put your answer here
Important: Double check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the previous homework assignment).
Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!
If everything went as intended, the file should now show up on your GitHub account in the “cmse202-s25-turnin
” repository inside the hw-03
directory that you just created within the new branch hw03-branch
.
Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the problems for a bit.
### Points breakdown -- 1 pt for command to clone repo
### 2 pts for setting up branch, etc.
### 3 pts for Git commits/Pushes throughout homework.
Part 2: Loading the datasets (10 points)#
In Parts 2, you will be working with the California Cooperative Oceanic Fisheries Investigations oceanographic and larval fish dataset that is available at https://www.kaggle.com/datasets/sohier/calcofi?resource=download.
To get started on Part 2, you’ll need to download the following file:
https://raw.githubusercontent.com/gambre11/CMSE202/refs/heads/main/Book1.csv
✅ Question 2.1 (2 points): Do this: Save the above CSV file in the same directory as your notebook. Then, in the cell below, put the command line command(s) you used to download the files. If you did not use a command line tool to download the files, write down the command(s) that would have downloaded the files.
# Put the (two) commands you used to download the two files here.
✅ Question 2.2 (2 points): Next, load the data using Pandas and display the first 20 rows
# Put your code here
✅ Question 2.3 (2 points): Do you notice any entries in the datasets are empty or have have NaN
values. Drop these rows from the dataframes.
# Put your code here
oceanographic_data = oceanographic_data.dropna()
len(oceanographic_data)
814247
✅ Question 2.4 (2 points): How many rows did you end up dropping from this data set? What total percentage of data was removed?
✅ Question 2.5 (2 points): Look at the website in which this dataset is hosted on Kaggle. What do the columns Salnty
and T_degC
represent?
✎ Put your answer here.
🛑 STOP#
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your local git
repository using the commit message “Part 2 complete”, and push the changes to GitHub.
Part 3: One Variable Linear and Polynomial Regression (28 points)#
In this part, we’ll perform some one-variable linear and polynomial regression analysis on the California Cooperative Oceanic Fisheries Investigations oceanographic and larval fish data.
✅ Question 3.1 (6 points): Using the OLS
method in statsmodels
, perform a linear regression to predict the Salnty
using the T_degC
and display the results summaries. Remember that you may need to use the add_constant()
method to make sure OLS
fits a general line \(y = ax+b\) to the data instead of a line through the origin \(y = ax\).
# Put your code here
✅ Question 3.2 (4 points): Answer the following questions:
What is the R-squared value you got?
Based on your R-squared value, what does it tell you about the regression fit, and how the model fits the data?
✎ Put your answers here:
✅ Question 3.3 (6 points): Now make a scatter plot of T_degC
(x-axis) vs. Salnty
(on y-axis). Plot the best fit line on the same plot. Label the axes, and add a legend, and give the plot a title.
# Put your code here
✅ Question 3.4 (2 points): What is the slope and intercept of your fit line?
# Put code here.
✅ Do this: Question 3.5 (4 points): Use plot_regress_exog
to investigate the distribution of residuals in your model fit. Make sure to create a large enough figure so that everything is easily visible.
# Put code here.
✅ Question 3.6 (6 points): Now use some online resource to help you make sense of this residual plot. Is there heteroscedastisity? Is there constant variance? Does it show signs of non-linearity? These are a few questions you might ask yourself or try to figure out in making sense of the residual plot.
✎ Put your explanations here.
Answer: It looks biased and heteroscedastic. aka it is not what we want to see in a residual plot for a model.
🛑 STOP#
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your local git
repository using the commit message “Part 3 complete”, and push the changes to GitHub.
Part 4: Multiple Regression (24 points)#
In this part, we’ll use multiple features to do predictions. https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/data
First, download and read in this synthetic dataset of Student Performance. https://raw.githubusercontent.com/gambre11/CMSE202/refs/heads/main/Student_Performance.csv
✅ Question 4.1 (5 points): Display the data types of the data you have just read in. We want all of our data types to be integers or floats. Modify the Extracurricular Activities
column so that a YES is now a 1 and a NO is now a 0.
# Put your code here
✅ Question 4.2 (5 points): Using the OLS
method in statsmodels
, perform a multivariable linear regression to predict the Performance Index
based on Hours Studied
, Previous Scores
,Extracurricular Activities
, Sample Question Papers Practiced
and Sleep Hours
. Also, use the add_constant()
method in statsmodels
to ensure the model includes a constant term as well. Fit these models and display the summary of results. For now only use three columns of data as independent variables.You can. choose which columns to use in your model. We will add all of the other columns later.
# Put your code here
✅ Question 4.3 (4 points): Answer the following questions:
What is your R-squared value?
Is your multiple regression model a good fit? why or why not?
✎ Put your answers here:
✅ Question 4.4 (2 points): Perform the multivariable linear regression again, but this time with all the features/columns. Display the summary of these results.
# Put your code here
✅ Question 4.5 (2 points): How much better/worse is the full model compared to the original model you made? What are its advantages? Briefly discuss the answer.
✎ Put your answers here:
✅ Question 4.6 (3 points): Create five .graphics.plot_regress_exog
figures, one for each of the features (columns of original dataframe) in your model. Pay attention to the top two plots: the fitted values figure and the residual plot.
# Put your code here.
✅ Question 4.7 (3 points): If we could only use one feature to predict Student Performance, which feature would do the best job?
Put your answer/code here
🛑 STOP#
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your local git
repository using the commit message “Part 4 complete”, and push the changes to GitHub.
Part 5: Logistic Regression (17 points)#
In this part, we’d like to use logistic regression to classify whether a candy has chocolate or not. Logistic regression (as we’ve learned so far in class) does binary classification.
✅ Question 5.1 (2 points): We will work with data that is available at https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/data
You’ll need to download the following file:
https://raw.githubusercontent.com/gambre11/CMSE202/refs/heads/main/candy-data.csv
Do This: In the cell below, type the code for downloading the data from inside your notebook and also code for loading the data into a pandas dataframe.
# Put your code here
✅ Question 5.2 (3 points):
Create a Pandas
Series
called labels
which has data from the Chocolate
column of the DataFrame. Also, create a Pandas
DataFrame
called features
which consists of all the columns besides competitorname
and chocolate
. Display the labels and features to make sure you did this correctly.
# Put your code here.
✅ Question 5.3 (4 points): Split your data into a training and testing set with a training set representing 80% of your data. For reproducibility, set the random_state
argument to 0
. Print the shapes of the training features, the testing features, the training labels, and the testing labels to show you have the right number of entries in each of the four variables.
# Put your code here
✅ Question 5.3 (4 points): Now, train a logistic regression model using your training features and training labels. Display the summary.
# Put your code here
✅ Question 5.4 (4 points): Finally, test your logistic regression model using your testing features and testing labels. Display the fraction of testing data points that were correctly predicted.
# Put your code here
🛑 STOP#
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your local git
repository using the commit message “Part 5 complete”, and push the changes to GitHub.
Part 6. Setting a project timeline. (5 points)#
You will know which project you will be working on as a group on Monday/Tuesday March 24th/25th. You and your group will be presenting this project during the last week of class (April 21st - 25th). Come up with a project timeline with specific goals/checkpoints to meet as this deadline approaches. The ability to set project timelines is a very useful skill to have professionally. You can create this timeline yourself, as a group, or you may ask generative ai to try and make a timeline for you. Try to in the very least create weekly checkpoints (~3).
Put your timeline here
Part 1. Continued#
Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main
branch.
✅ Do the following:
Switch back to your
main
branch.Merge your
hw03_branch
with yourmain
branch.Finally, push the changes to GitHub.
Assignment wrap-up#
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
from IPython.display import HTML
HTML(
"""
<iframe
src="https://forms.office.com/r/mB0YjLYvAA"
width="800px"
height="600px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)
Congratulations, you’re done!#
Submit this assignment by uploading it to the course D2L web page. Go to the “Homework Assignments” folder, find the dropbox link for Homework 3, and upload it there.
© Copyright 2025, Department of Computational Mathematics, Science and Engineering at Michigan State University