Homework Assignment 3#

Regression models#

✅ Put your name here.

#

✅ Put your GitHub username here.

#

Goal for this homework assignment#

By now, you have learned a bit about regression models. In this assignment, you will practice:

  • Using branches in Git

  • Performing linear regression

  • Performing multiple regression

  • Performing logistic regression

This assignment is due by 11:59 pm on Friday, November 15th. It should be uploaded into the “Homework Assignments” submission folder for Homework 3. Submission instructions can be found at the end of the notebook. There are 72 standard points possible in this assignment. The distribution of points can be found in the section headers.


Part 1: Git Branch (6 points)#

For this assignment, you’re going to add it to the cmse202-f24-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to

✅ Do the following:

  1. Navigate to your cmse202-f24-turnin local repository and create a new directory called hw-03

  2. Move this notebook into that new directory in your repository, but do not add or commit it to your repository yet.

  3. Create a new branch called hw03_branch (The Day 16 PCA and ICA content has information on how to do this).

  4. “Check out” the new branch (so that you’ll be working on that branch).

  5. Double check to make sure you are actually on that branch.

  6. Once you’re certain you’re working on your new branch, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use git push origin hw03_branch to push your new branch to GitHub.

Finally, ✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.

Put your answer here

Important: Double check you’ve added your Professor and your TA as collaborators to your “turnin” repository (you should have done this in the previous homework assignment).

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked!

If everything went as intended, the file should now show up on your GitHub account in the “cmse202-f24-turnin” repository inside the hw-03 directory that you just created within the new branch hw03-branch.

Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.


Part 2: Loading up on Portuguese Wine (13 points)#

For this homework, you’ll be working with the Wine Quality dataset from the UCI Machine Learning Repository, which contains measurements of various chemical properties of red and white wines. The dataset includes properties like fixed acidity, volatile acidity, citric acid, and other measurements important for understanding wine characteristics. This data was collected to support predictive models of wine quality, based on a range of measurable attributes. The wine dataset is distributed into two files: 1) for red wines and 2) for white wines.

While Parts 3, 4, and 5 are structured so that they can be completed independently of each other, it is recommended to finish Part 2 before moving on to these sections. To get started on Part 2, you’ll need to download the following file:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

and a description of the files here:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names

Question 2.1 (1 point): Do this: Save the winequality-red.csv, winequality-white.csv, file in the same directory as your notebook. Then, in the cell below, put the command line command(s) you used to download the file. If you did not use a command line tool to download the file, write down the command(s) that would have downloaded the file.

# Put the command you used to download the wine dataests file here.

Question 2.2 (4 points): Next, load the data of red and white wine into two separate Pandas DataFrames and display the first and last 5 rows.

# Put your code here

Question 2.3 (4 points): Let’s investigate which features of red wines are correlated by plotting a correlation heatmap using Seaborn. Create a heatmap to visualize the relationships between various chemical properties of red wine.

Remember to rotate the tick labels so they are easy to read, and use tight_layout() to avoid any label cut-offs.

# Put your code here

Question 2.4 (4 points): Let’s investigate which features of white wines now are correlated by plotting a correlation heatmap using Seaborn. Create a heatmap to visualize the relationships between various chemical properties of red wine.

Remember to rotate the tick labels so they are easy to read, and use tight_layout() to avoid any label cut-offs. Which features are different between red and white?

# Put your code here

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 2 complete”, and push the changes to GitHub.



Part 3: One Variable Linear Regression (21 points)#

In exploring the characteristics of wine, understanding the relationship between individual chemical components can provide insights into the wine’s flavor profile, quality, and potential fermentation properties. One simple yet valuable analysis is examining the relationship between citric acid and fixed acidity.

Citric acid, a natural preservative that adds freshness, often contributes to the tartness and overall acidity in wine. Meanwhile, fixed acidity is a broader measure that includes acids, such as tartaric and malic acid, giving wine its sharp, crisp taste. By performing a single-variable linear regression with citric acid as the predictor for fixed acidity, we can investigate whether higher levels of citric acid are associated with an increase in fixed acidity, potentially indicating a specific acid balance characteristic to certain wine types.

This regression analysis can reveal subtle patterns in acidity management, helping winemakers predict and control acidity levels for quality consistency, and giving scientists insight into how specific acid types interact in the broader context of wine chemistry.

Question 3.1 (3 points): Using the OLS method in statsmodels, perform a linear regression to predict the fixed acidity of a wine sample using its citric acid content and display the results summary. Remember that you may need to use the add_constant() method to ensure OLS fits a general line y = ax + b to the data rather than a line through the origin y = ax .

For this problem, make sure that you’re using the entire wine dataset, not just a subset of the data. So the first part should be to make a combined dataframe (using pd.concat) out of white and red

# Put your code here

Question 3.2 (3 points): Answer the following questions:

  1. What was the equation of the best-fit linear relationship between a red wine’s fixed acidity and its citric acid content?

  2. As a red wine’s citric acid content increases, does its fixed acidity increase or decrease? What aspect of the regression output tells you this?

  3. Based on the p -value for citric acid, is the relationship you found between a wine’s fixed acidity and citric acid content statistically significant? Justify your answer.

Put your answers here:

Question 3.3 (4 points, 2 points per part): Now, let’s perform linear regression separately for red and white wine. To avoid confusion and ensure that results are not overwritten, make sure that the variable names for the OLS models and results are different for each wine type.

Question 3.3.Red (2 points): Using the OLS method in statsmodels, perform a linear regression to predict the fixed acidity of red wine samples using citric acid as the predictor, and display the results summary.

# Put your code here

Question 3.3.White (2 points): Using the OLS method in statsmodels, perform a linear regression to predict the fixed acidity of white wine samples using citric acid as the predictor, and display the results summary.

# Put your code here

Question 3.4 (5 points): Answer the following questions:

  1. For both red and white wines, what is the equation of the best-fit linear relationship between fixed acidity and citric acid?

  2. For both red and white wines, as citric acid content increases, does the fixed acidity increase or decrease?

  3. Based on the ( p )-values for citric acid, is the relationship between fixed acidity and citric acid content statistically significant for each wine type?

Put your answers here:

Question 3.5 (6 points): Assuming you did everything correctly, the relationship between a wine’s fixed acidity and citric acid content may differ when you split the data by wine type (red or white). This might seem confusing at first. Part of the reason for this is because we skipped a very important step when working with unfamiliar data: visualizing the data.

Do this: Make a scatterplot showing fixed acidity vs. citric acid for the wine dataset. Color-code the points so that red and white wines are in different colors. Then, display the best fit line for each type of wine in the same color as the points, and also display the best fit line for all wines combined in a different color. Don’t forget to label your axes. When you’re done, your plot should contain two colors of points and three lines (two lines should match the colors of the two wine types, and one line for the combined data in a different color).

Hint: We’ve included a function to help you plot a line. Feel free to use it, or not.

# Put your code here

import matplotlib.pyplot as plt
import numpy as np
def plot_line(slope, intercept, xmin, xmax, color):
    xline = np.array([xmin,xmax])
    yline = slope*xline+intercept
    plt.plot(xline,yline,color)

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 3 complete”, and push the changes to GitHub.


Part 4: Multiple Regression (16 points)#

In this part, we’ll use multiple features to predict the quality of red wine samples. Specifically, we’ll explore how a combination of chemical properties—such as volatile acidity, citric acid, alcohol, and others—can be used to estimate the overall quality rating of red wines.

Question 4.1 (3 points): Using the OLS method in statsmodels, perform a multivariable linear regression to predict the quality of wine based on volatile acidity, citric acid, and alcohol content. Be sure to use the add_constant() method to ensure OLS includes a constant term in the model. As before, make sure to display a summary of your results.

# Put your code here

Question 4.2 (4 points): Answer the following questions:

  1. Suppose a red wine sample has a volatile acidity of 0.52, citric acid content of 0.27, and alcohol of 10.0. What does your linear model predict for the wine’s quality rating? Explain how you arrived at your answer.

  2. For each of the features (volatile acidity, citric acid, and residual sugar), specify if it is statistically significant in the model. Briefly justify your answers.

Put your answers here:

Question 4.3 (3 points): Perform the same multivariable linear regression again, but this time check for chemicals and check if that improves the quality determination. Display the summary of these results.

# Put your code here

Question 4.4 (4 points): Answer these questions:

  1. Qualitatively, how much better/worse is the reduced model compared to the original model? Briefly justify your answer.

  2. Explain in your own words why we might want to use a model with fewer features, even if it fits the data a bit worse than a model with more features.

Put your answers here:

Question 4.5 (2 points): Suppose we wanted to use the color of a wine (red or white) as a feature to predict its quality. Will simply including the color column in the second argument to OLS() work? If not, why, and what could we do to fix it?

Put your answers here:


🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 4 complete”, and push the changes to GitHub.


Part 5: Logistic Regression (16 points)#

In this part, we’d like to use logistic regression to classify whether a wine is red or white based on its chemical properties. Logistic regression, as we’ve learned in class, is commonly used for binary classification. Here, we’ll use it to distinguish between the two wine types, aiming for high accuracy in prediction, as mistaking one for the other could lead to a poor wine pairing experience!

Question 5.1 (4 points): Let’s start by setting up a classifier to distinguish red wines from white wines.

Do This: Add a new column called color to both the red and white wine DataFrames. For red wine samples, set color to 1, and for white wine samples, set color to 0. Then, use pd.concat() to combine the two DataFrames into one unified DataFrame.

Hint: After creating the color column in each DataFrame, use pd.concat([df_red, df_white]) to concatenate them into a single DataFrame.

Finally, split the combined DataFrame into features and labels, where features consists of all columns except color and quality, and labels is the color column.

# Put your code here

Question 5.2 (4 points): Split your data into a training and testing set with a training set representing 75% of your data. For reproducibility, set the random_state argument to 0. Print the shapes of the training features, the testing features, the training labels, and the testing labels to show you have the right number of entries in each of the four variables.

# Put your code here

Question 5.3 (4 points): Now, train a logistic regression model using your training features and training labels. Display the summary.

# Put your code here

Question 5.4 (4 points): Finally, test your logistic regression model using your testing features and testing labels. Display the fraction of testing data points that were correctly predicted.

# Put your code here

🛑 STOP#

Pause to commit your changes to your Git repository!

Take a moment to save your notebook, commit the changes to your local git repository using the commit message “Part 5 complete”, and push the changes to GitHub.


Part 1. Continued#

Now that you’ve finished your new “development” on your 202 turn-in repo, you can merge your work back into your main branch.

✅ Do the following:

  1. Switch back to your main branch.

  2. Merge your hw03_branch with your main branch.

  3. Finally, push the changes to GitHub.

Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” folder, find the dropbox link for Homework 3, and upload it there.

© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University