Homework 3

Homework 3#

Regression models#

✅ Put your name here.
#

✅ Put your GitHub username here.
#

Goal for this homework assignment#

By now, you have learned a bit about regression models. In this assignment, you will practice:

Using branches in Git
Performing linear regression
Performing multiple regression
Performing logistic regression
Creating a project timeline

This assignment is due by 11:59 pm on Friday, April 3rd. It should be uploaded into the “Homework Assignments” submission folder for Homework 3. Submission instructions can be found at the end of the notebook. There are 83 standard points possible in this assignment, including points for Git commits/pushes. The distribution of points can be found in the section headers.

Table of contents#

Part 1: Git branch (6 points)
Part 2: Simple linear regression (24 points)
Part 3: Symbolic Regression (12 points)
Part 4: Multiple regression (14 points)
Part 5: Logistic regression (18 points)
Part 6: Project planning (5 points)
Part 7: Assignment wrap-up (4 points)

Run this cell below before moving on:#

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import StackGP as sgp
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split


sns.set_context("talk")

Back to ToC

Part 1: Git Branch (6 points)#

You’re going to add this assignment to the cmse202-S26-turnin repository you created so that you can track your progress on the assignment and preserve the final version that you turn in. You will do this by performing the tasks 1.1 - 1.6 below.

Important: Double-check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the HW01 assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

✅ Question 1.1 (1 point): Navigate to your cmse202-S26-turnin local repository and create a new directory called hw-03. In the cell below put the command(s) you used to do this.

✎ Put your answer here

✅ Question 1.2 (1 point): Move this notebook into that new directory in your repository, but do not add or commit it to your repository yet. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.3 (1 point): Create a new branch called hw03_branch (The Day 17 PCA and ICA content has information on how to do this). Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.4 (1 point): “Check out” the new branch (so that you’ll be working on that branch). Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.5 (1 point): Double check to make sure you are actually on that branch. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

✅ Question 1.6 (1 point): Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw03_branch to push your new branch to GitHub. Put the command(s) you used to do this in the cell below.

✎ Put your answer here

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-S26-turnin” repository inside the hw-03 directory that you just created within the new branch hw03-branch.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the problems for a bit.

✅ Do this: Remember to do every Git commit/push mentioned throughout the assignment!

Back to ToC

Part 2: One Variable (Linear) Regression (24 points)#

In this part, we’ll perform some one-variable linear regression analysis on the supplied “Train_1.csv” dataset which we will download.

✅ Question 2.1 (1 points): Do This: Download the file Train_1.csv from the link below, and save it into the same directory as your notebook. Then, in the cell below, put the command line command(s) you used to download the file. If you did not use a command line tool to download the file, write down the command(s) that would have downloaded the file.

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/Train_1.csv

# Put the commands you used to download the two files here.

✅ Question 2.2 (2 points): Next, load the data using Pandas and save it into a data frame called train_data. Then display the entire dataframe.

# Put your code here

✅ Question 2.3 (2 points): How many rows and columns are in this data set? What are you thoughts on this?

✎ Put your answer here.

✅ Question 2.4 (5 points): Using the OLS() method in statsmodels.api, make a simple linear regression model that predicts “Y” using “X” as the independent variable. Be sure to use the add_constant() method to add a column of ones to the DataFrame before using the OLS() method so that your linear model includes a constant term. Display the fitting summary once you have fit the model.

# Put your code here

✅ Question 2.5 (2 points): Comment on the fit of your model. What are you using to judge the fit?

✎ Put your answer here.

✅ Question 2.6 (4 points): Plot the scatter plot of your independent and dependent data and also plot the line predicted by the regression. Include descriptive labels, titles, and legends as appropriate.

# Put your answer here

✅ Question 2.7 (3 points): Now download the test set Test_1.csv from the link below and save it into a dataframe named test_data. Once downloaded, split the data into features and labels and then add a constant to the features.

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/Test_1.csv

# Put your code here

✅ Question 2.8 (4 points): Make a scatter plot of your independent and dependent data from the test set, and also plot the line predicted by the new regression model on the test data. Include descriptive labels, titles, and legends as appropriate.

# Put your answer here

✅ Question 2.9 (1 point): Comment on the fit of your model using the above plot. Do you think this is the right model for this data?

✎ Put your answer here:

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 2 complete”, and push the changes to GitHub.

Back to ToC

Part 3: Symbolic Regression (12 points)#

In this part, we will explore the same data from the previous section, but here we will be using the Symbolic Regression tool we learned in class (StackGP) to see if we can find a nonlinear model to fit the data.

✅ Question 3.1 (1 points): Use the following code to convert the data into numpy arrays compatible for Symbolic Regression. If you named you data frames something different just replace the train_data and test_data in the code to whatever you used. You do not need to use the version with an added constant.

X_train_sr = np.array([train_data["X"].values])
y_train_sr = train_data["Y"].values

X_test_sr = np.array([test_data["X"].values])
y_test_sr = test_data["Y"].values

✅ Question 3.2 (4 points): Now using the Symbolic Regression evolve function, build a model that fits the training data (X_train_sr and y_train_sr). Set the following arguments in the evolve function:

generations=300,
popSize=300,
liveTracking=True,
liveTrackingInterval=0

Note: Setting generations and popSize to 300 will ensure it searches a large enough space to find a decent solution. The live tracking will just make it so you can visualize the training progress live.

This will take about 1 minute to run.

# put your code here

✅ Question 3.3 (1 point): Use the sgp.printGPModel function to display the best model found during the search.

Note: For examples on how to use this function, you can reference back to our previous ICA or follow example uses on this page from the StackGP documentation https://hoolagans.github.io/StackGP-Documentation/Notebooks/Evolve.html

# put your code here

✅ Question 3.4 (4 points): Now using the Symbolic Regression model create a scatter plot comparing the test data and the model predictions. You can use the evaluateGPModel function to evaluate the model on the test data (X_test_sr). Be sure to give the plot informative labels.

# put your code here

✅ Question 3.5 (2 points): Looking at the quality of the fit in the plot, does it seem to have found a good solution? Given the size of the training data, did you expect this?

✎ Put your answers here:_

NOTE: Symbolic regression is highly flexible, which makes it powerful but also prone to overfitting—models can easily become overly complex, capturing noise rather than true underlying relationships. Controlling model complexity and using strong regularization or validation are essential to ensure the discovered equations generalize well.

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 3 complete”, and push the changes to GitHub.

Back to ToC

Part 4: Multiple Regression (14 points)#

In this part, we will explore multivariable regression on the Computer Hardware dataset from the UCI Machine Learning Repository (Data Link).

Download a cleaned version of the data from here: https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/machine.csv

✅ Question 4.1 (5 points): Using the OLS() method in statsmodels.api, make a multiple regression model that predicts "PRP" using the other variables, and display the .summary() of that process. Once downloaded be sure to add a constant to the dataset using add_constant().

# Put your code here

✅ Question 4.2 (2 points): Answer the following two questions:

What is the R-squared value you got?
Based on your R-squared value, what does it tell you about the regression fit, and how the model fits the data?

✎ Put your answers here:_

✅ Question 4.3 (2 points): Based on the output of the OLS summary, do any of the features appear to be insignificant for predicting “PRP”? List any features that you would remove.

✎ Put your answers here:

✅ Question 4.4 (3 points): Do This: Design a second linear model without any features you determined were insignificant to predict the “median_house_value”.

# Put your code here

✅ Question 4.5 (2 points): How did your reduced linear model fit the data compared to the full linear model you created in Question 4.1? Give some quantitative justification for this answer.

✎ Put your answers here:

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 4 complete”, and push the changes to GitHub.

Back to ToC

Part 5: Logistic Regression (18 points)#

In this part, we will be using logistic regression to classify whether a person has diabetes or not. Logistic regression (as we’ve learned so far in class) does binary classification.

✅ Question 5.1 (2 points): We will work with data originally from https://www.kaggle.com/datasets/aemyjutt/diabetesdataanslysis?select=diabetes.csv.

We will be using the cleaned version which can be downloaded from the link below:

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/diabetes.csv

Do This: In the cell below, type the code for downloading the data from inside your notebook and also code for loading the data into a pandas dataframe.

# Put your code here

✅ Question 5.2 (3 points):

Create a Pandas Series called labels which has data from the Outcome column of the DataFrame. Also, create a Pandas DataFrame called features which consists of just the columns Glucose, BloodPressure, BMI, and Age. Display the labels and features to make sure you did this correctly.

# Put your code here.

✅ Question 5.3 (5 points): Split your data into a training and testing set with a training set representing 80% of your data. For reproducibility, set the random_state argument to 541 (the 100th prime number!). Print the shapes of the training features, the testing features, the training labels, and the testing labels to show you have the right number of entries in each of the four variables.

# Put your code here

✅ Question 5.3 (4 points): Now, train a logistic regression model using your training features and training labels. Be sure to add a constant to the training features. Display the summary.

# Put your code here

✅ Question 5.4 (4 points): Finally, test your logistic regression model using your testing features and testing labels. Compute and display the accuracy score on the test data.

# Put your code here

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 5 complete”, and push the changes to GitHub.

Back to ToC

Part 6. Setting a project timeline. (5 points)#

You should know which project you will be working on with your group by now. You and your group will be presenting this project during the last week of class. Come up with a project timeline with specific goals/checkpoints to meet as this deadline approaches. The ability to set project timelines is a very useful skill to have professionally. You can create this timeline yourself or as a group. Try to, in the very least, create weekly checkpoints (~3). Write out your timeline below.

✎ Put your timeline here:

Back to ToC

Part 7: Assignment wrap-up (4 points)#

7.1: (1 point) Have you put your name and GitHub username at the top of your notebook?

7.2: (3 points) Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main branch.

✅ Do the following:

Switch back to your main branch.
Merge your hw03_branch with your main branch.
Finally, push the changes to GitHub.

NOTE: The grader will be able to see your commit messages and whether you pushed the repo at this stage, if everything has gone as planned. Double-check that things look correct on GitHub before you submit this notebook to D2L.

Congratulations, you’re done!#

Submit this assignment by uploading it to the course D2L web page. Go to the “Homework Assignments” folder, find the dropbox link for Homework 3, and upload it there.

*Dancing gif source: https://cdn.pixabay.com/animation/2025/08/11/04/05/04-05-10-511_512.gif

Homework 3

Contents

Homework 3#

Regression models#

✅ Put your name here.#

✅ Put your GitHub username here.#

Goal for this homework assignment#

Table of contents#

Run this cell below before moving on:#

Part 1: Git Branch (6 points)#

Part 2: One Variable (Linear) Regression (24 points)#

🛑 STOP#

Part 3: Symbolic Regression (12 points)#

🛑 STOP#

Part 4: Multiple Regression (14 points)#

🛑 STOP#

Part 5: Logistic Regression (18 points)#

🛑 STOP#

Part 6. Setting a project timeline. (5 points)#

Part 7: Assignment wrap-up (4 points)#

Congratulations, you’re done!#

✅ Put your name here.
#

✅ Put your GitHub username here.
#