Homework 3: Loading, Cleaning, Analyzing, and Visualizing Data with Pandas (and using resources!)#
✅ Put your name here
#Learning Goals#
Using pandas to work with data and clean it
Make meaningful visual representations of the data
Fitting curves to data and evaluating model fits
Assignment instructions#
Work through the following assignment, making sure to follow all of the directions and answer all of the questions.
This assignment is due at 11:59pm on Friday, October 31, 2025
It should be uploaded into D2L Homework #3. Submission instructions can be found at the end of the notebook.
Total points possible: 80
Table of Contents#
Part 0. Academic Integrity Statement (2 points)
Part 1. Reading, describing, and cleaning data (29 points)
Part 2. Exploratory Data Analysis and Data visualization (25 points)
Part 3. Fitting curves to data. (24 points)
Part 0. Academic integrity statement (2 points)#
In the markdown cell below, paste your personal academic integrity statement. By including this statement, you are confirming that you are submitting this as your own work and not that of someone else.
✎ Put your personal academic integrity statement here.
Before we read in the data and begin working with it, let’s import the libraries that we would typically use for this task. You can always come back to this cell and import additional libraries that you need.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.optimize import curve_fit
Part 1: Reading, describing, and cleaning data (29 total points)#
1.0 Data Source & Context (4 points)#
Before analyzing any dataset, it’s important to know where it came from and what that implies for the kinds of conclusions we can (and cannot) draw.
The dataset we’ll use in this assignment is: “Analyzing Student Academic Trends” by Saad Ali Yaseen, hosted on Kaggle (https://www.kaggle.com/datasets/saadaliyaseen/analyzing-student-academic-trends). The dataset contains 200 student records; variables include hours_studied, sleep_hours, attendance_percent, previous_scores, exam_score.
✅ Task Write a 3-5 sentence Context Statement that addresses: (4 pts)
Whether the data appears to be real-world observational, experimental, or simulated/synthetic/curated (and why you think so).
What population the data might (or might not) represent, and any limits on generalization.
What kinds of claims are reasonable (e.g., associations vs. causal claims) given the dataset’s provenance and size.
Any data quality or measurement caveats you notice (e.g., small N, unknown sampling, engineered features).
✎ Put your answer here:
1.1 Read the data (1 point)#
✅ Task
Read in the data from student_exam_scores.csv into a Pandas dataframe (0.5 pt) and display the head of the data (0.5 pt).
## your code here
1.2 Describe the data (3 points)#
1.2.1 ✅ Task (1 point)#
Use describe to display several summary statistics from the data frame.
## your code here
1.2.2 ✅ Task (2 points)#
The columns in this dataset represent the following:
student_id: A unique identifier for each student.
hours_studied: The number of hours the student studied before the exam.
sleep_hours: The average number of hours the student slept per day.
attendance_percent: The percentage of classes attended by the student.
previous_scores: The marks obtained by the student in previous assessments.
exam_score: The final exam score, representing the studentâs overall performance.
Using the results from describe above, write in the markdown cell below:
The minimum and maximum number of hours studied among all students. (1 pt)
The lowest and highest exam scores observed in the dataset. (1 pt)
✎ Put your answer here:
Part 1 (1pt):
Part 2 (1pt):
1.3 Isolating and performing basic statistics on data (6 points)#
1.3.1 ✅ Task (1 point)#
Display the exam_score column on its own using the name of the column.
## your code here
1.3.2 ✅ Task (1 points)#
Using .iloc, display the first ten rows of just the exam_score column.
## your code here
1.3.3 ✅ Task (4 points)#
Using mean and median functions, print out the mean and median exam_score values for all the data (2 pts).
## your code here
Using only the mean and median you computed above, what can you infer about the shape of each variableâs distribution? For each variable you summarized, state whether the data are likely approximately symmetric, right-skewed, or left-skewed, and justify your choice based on the relationship between mean and median.
Rule of thumb:
mean \(\approx\) median \(\rightarrow\) roughly symmetric
mean > median \(\rightarrow\) right-skew (long right tail)
mean < median \(\rightarrow\) left-skew (long left tail)
Note: These are indicators, not proofs. A histogram is still the best check.
✅ Do this (2 pts): For the variable summarized in 1.3.3, write one sentence naming the likely shape and one sentence justifying it with your mean/median values.
✎ Put your answer here:
1.4 Filter the data using masking (8 points)#
You decide to analyze the median number of hours studied for the following groups based on their previous_scores range:
0-20
21-40
41-60
61-80
81-100
1.4.2 ✅ Task (6 pts)
For each of the above score ranges, calculate the median value of
hours_studied.For each range, print a statement such as:
“The median number of hours studied for students with previous scores between (insert range here) is: x hours.”
Hint: You can create Boolean masks for value ranges as follows:
data[(data["previous_scores"] >= lower_bound) & (data["previous_scores"] <= upper_bound)]
## your code here
1.4.2 ✅ Task (2 pts)
Suppose you want to check for an association between past assignment performance and study time. Using your masked subsets from Task 1.4.1, report how the median hours_studied changes (if at all) across increasing values (or bins) of previous_scores. Provide a 1-2 sentence descriptive summary of the observed pattern (e.g., âthe median appears higher/lower/similar for higher previous_scoresâ).
✎ Put your answer here
1.5 Clean out the NaN values (7 points)#
1.5.1 ✅ Task (2 points)#
Using the isna().sum()function to count NaN values in each column.
# Put yoru answer here
1.5.3 ✅ Task (2 point)#
You observe "NaN" values in the "exam_score" column. Create a new, clean DataFrame (0.5 pt) by removing all rows where "exam_score" contains "NaN" values (0.5 pt). Use the built-in pandas function dropna() to accomplish this (you can look it up online to learn more).
#Put your answer here
1.5.4 ✅ Task (2 points)#
Now that you have the original and clean datasets, determine the number of rows in each of them (1 pt). Print out your results (1 pt).
#Put your answer here
1.5.5 ✅ Task (1 point)#
Let’s reflect on the cleaning process. If you hadnât dropped the "NaN" data, would it have affected your results?
✎ Put your answer here
Part 2: Exploratory Data Analysis (25 total points)#
It’s time to explore our data! Youâll use basic EDA to understand distributions and relationships.
If you were not able to get part 1 done correctly, run the cell below to load the clean data set and use it in the rest of the assignment. You can also use this data set to check your work in part 1.
exam_scores_df = pd.read_csv("student_exam_scores_cleaned.csv")
print(exam_scores_df.head())
student_id hours_studied sleep_hours attendance_percent previous_scores \
0 S001 8.0 8.8 72.1 45
1 S002 1.3 8.6 60.7 55
2 S003 4.0 8.2 73.7 86
3 S004 3.5 4.8 95.1 66
4 S005 9.1 6.4 89.8 71
exam_score
0 30.2
1 25.0
2 35.8
3 34.0
4 40.3
Part 2.1: Distributions & summary statistics (8 points)#
Understanding distributions helps you reason about ranges, skewness, and typical values before modeling anything.
✅ Do this (8 points): Using matplotlib and pandas:
Create 3 subplots (use plt.subplot) in 1 row \(\times\) 3 columns to show histograms for three numeric columns: use exam_score, hours_studied, and sleep_hours. (3 pts; 1 each)
Use plt.figure with figsize=(9, 3) (or similar wide aspect). (1 pt)
Compute and print the mean and median for each of the three plotted columns. (2 pts)
On each histogram, draw vertical lines for the mean and median (e.g., plt.axvline). Add clear titles and axis labels. (2 pts)
Hints: Select columns by name, e.g., df[“exam_score”].
# Put your code here
Part 2.2: Visualizing relationships (9 points)#
✅ Do this (9 points): Use matplotlib to make three scatter plots showing:
exam_score vs. hours_studied
exam_score vs. attendance_percent
exam_score vs. sleep_hours:
For full credit:
Create 3 subplots using plt.subplot with 3 rows \(\times\) 1 column. (3 pts; 1 each)
Use
plt.figureandfigsize=(4, 8)(or similar tall aspect). (2 pt)Provide an overall title (e.g., âStudy Habits and Exam Performanceâ). (2 pt)
Provide x-axis and y-axis labels and subplot titles. (2 pts)
# Put your code here
Part 2.3: Reflections (8 points)#
✅ Answer the following: Looking at the plots,
Which relationship looks strongest? Briefly justify. (2 pt)
Which plot looks the most linear? (2 pt).
Which plot do you believe would be the easiest to create a model for. Briefly justify. (2 pt)
In 3-4 sentences, compare and contrast when a histogram is useful versus when a scatter plot is useful (2 pt). Consider three scenarios:
the number/type of variables each shows,
the kinds of patterns each reveals,
the relationship between âexam scores for one classâ and âhours studiedâ.
✎ Put your answer here
Part 1 (2pt):
Part 2 (2pt):
Part 3 (2pt):
Part 4 (2pt):
Part 3: Fitting curves to data - Predicting Exam Scores (24 points)#
Now that we’ve visualized our data, let’s ask:
What is the relationship between exam score and each of these variables:
hours_studied
sleep_hours
Can we predict exam score for both of these variables with a simple linear model?
Part 3.1: Model (4 points)#
In the above plots we notice that exam_score tends to increase as hours_studied increases in value. Specifically, we can describe this relationship with a linear model:
Where \(variable\) is the predictor variable (e.g., hours_studied, or sleep_hours), \(m\) is the slope of the line, and \(b\) is the intercept.
✅ Do this:
Write a function called
exam_modelthat calculates exam score based on a given predictor variable using the equation above.The equation should be structured so that it can be used with the
curve_fitfunction in the next section.
# Put your code here
Part 3.2: Fit the model (10 points)#
Weâll now use the curve_fit function from SciPy to find the best-fitting line that predicts exam scores.
✅ Do this:
Use the dataset student_exam_scores_cleaned.csv
For each predictor variable (hours_studied, and sleep_hours):
Use the
curve_fitwith yourexam_modelfunction to find the slope (\(m\)) and intercept (\(b\)). (3 pts each, 6 total)Print out the value of the slope \(m\), write “The value of the slope is…” (1 pt each, 2 total)
Print out the value of the intercept \(b\), write “The value of the intercept is…” (1 pt each, 2 total)
# Put your code here
Part 3.3 Check your model (10 points)#
✅ Do this (8 points): Use matplotlib to make two plots using two predictor variables (hours_studied, and sleep_hours), to compare your model and the data.
Plot the actual data (1 point each, 2 total)
Plot your modeled data (1 point each, 2 total)
Use x and y axis labels and title (1 point each, 2 total)
Adjust the color, and size, and alpha of the datapoints. Make sure the plot of your model is a different color than the datapoints so that you can see it. (1 point each, 2 total)
# Put your code here
✅ Answer the following (2 points):#
Do you think your models are good models? Is one model better that the other? Clearly justify your answer (looking for more than a “yes” or “no” answer).
✎ Put your answer here
Congratulations, you’re done!#
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” section, find the submission folder link for Homework #3, and upload it there.
© 2024 Copyright the Department of Computational Mathematics, Science and Engineering.