# Homework 3: Loading, Cleaning, Analyzing, and Visualizing Data with Pandas (and using resources!)

### <p style="text-align: right;"> &#9989; Put your name here </p>

## Learning Goals

* Using pandas to work with data and clean it
* Make meaningful visual representations of the data
* Fitting curves to data and evaluating model fits

___

## Assignment instructions

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

**This assignment is due at 11:59pm on Friday, October 31, 2025** 

It should be uploaded into D2L Homework #3.  Submission instructions can be found at the end of the notebook.

Total points possible: **80**

---
<a id="toc"></a>

## Table of Contents

[Part 0. Academic Integrity Statement](#part_0) (2 points)

[Part 1. Reading, describing, and cleaning data](#part_1)  (29 points)

[Part 2. Exploratory Data Analysis and Data visualization](#part_2)  (25 points)

[Part 3. Fitting curves to data.](#part_3) (24 points)
    

---
<a id="part_0"></a>

## Part 0. Academic integrity statement (2 points)

[Back to Top](#toc)

In the markdown cell below, paste your personal academic integrity statement. By including this statement, you are confirming that you are submitting this as your own work and not that of someone else.

<font size=6 color="#009600">&#9998;</font> *Put your personal academic integrity statement here.*

Before we read in the data and begin working with it, let's import the libraries that we would typically use for this task. You can always come back to this cell and import additional libraries that you need.

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.optimize import curve_fit

---
<a id="part_1"></a>

## Part 1: Reading, describing, and cleaning data (29 total points)

[Back to Top](#toc)

### 1.0 Data Source & Context (4 points)

Before analyzing any dataset, it's important to know where it came from and what that implies for the kinds of conclusions we can (and cannot) draw.

The dataset we'll use in this assignment is: "Analyzing Student Academic Trends" by Saad Ali Yaseen, hosted on Kaggle (https://www.kaggle.com/datasets/saadaliyaseen/analyzing-student-academic-trends). The dataset contains 200 student records; variables include hours_studied, sleep_hours, attendance_percent, previous_scores, exam_score.


&#9989;&nbsp; **Task**
Write a 3-5 sentence Context Statement that addresses: (4 pts)

1. Whether the data appears to be real-world observational, experimental, or simulated/synthetic/curated (and why you think so).

2. What population the data might (or might not) represent, and any limits on generalization.

3. What kinds of claims are reasonable (e.g., associations vs. causal claims) given the dataset's provenance and size.

4. Any data quality or measurement caveats you notice (e.g., small N, unknown sampling, engineered features).

<font size=8 color="#009600">&#9998;</font> Put your answer here:  



### 1.1 Read the data (1 point)

&#9989;&nbsp; **Task**

Read in the data from `student_exam_scores.csv` into a Pandas dataframe (0.5 pt) and display the `head` of the data (0.5 pt).

In [14]:
## your code here

### 1.2 Describe the data (3 points)

#### 1.2.1 &#9989;&nbsp; **Task (1 point)**

Use `describe` to display several summary statistics from the data frame.

In [17]:
## your code here

#### 1.2.2 &#9989;&nbsp; **Task (2 points)**
The columns in this dataset represent the following:

- **student_id**: A unique identifier for each student.  
- **hours_studied**: The number of hours the student studied before the exam.  
- **sleep_hours**: The average number of hours the student slept per day.  
- **attendance_percent**: The percentage of classes attended by the student.  
- **previous_scores**: The marks obtained by the student in previous assessments.  
- **exam_score**: The final exam score, representing the studentâs overall performance.

Using the results from `describe` above, write in the markdown cell below:

1. The **minimum and maximum** number of hours studied among all students. (1 pt)  
2. The **lowest and highest** exam scores observed in the dataset. (1 pt)

<font size=8 color="#009600">&#9998;</font> Put your answer here:  

**Part 1 (1pt):**  

**Part 2 (1pt):**

### 1.3 Isolating and performing basic statistics on data (6 points)

#### 1.3.1 &#9989;&nbsp; **Task (1 point)**

Display the `exam_score` column on its own using the name of the column.

In [21]:
## your code here

#### 1.3.2 &#9989;&nbsp; **Task (1 points)**

Using `.iloc`, display the first ten rows of just the `exam_score` column.

In [24]:
## your code here

#### 1.3.3 &#9989;&nbsp; **Task (4 points)**

Using `mean` and `median` functions, print out the mean and median `exam_score` values for all the data **(2 pts)**. 

In [27]:
## your code here

Using only the mean and median you computed above, what can you infer about the shape of each variableâs distribution? For each variable you summarized, state whether the data are likely approximately [symmetric, right-skewed, or left-skewed](https://en.wikipedia.org/wiki/Skewness), and justify your choice based on the relationship between mean and median.

Rule of thumb:
- mean $\approx$ median $\rightarrow$ roughly symmetric
- mean > median $\rightarrow$ right-skew (long right tail)
- mean < median $\rightarrow$ left-skew (long left tail)

Note: These are indicators, not proofs. A histogram is still the best check.

&#9989;&nbsp; **Do this (2 pts):**
For the variable summarized in 1.3.3, write one sentence naming the likely shape and one sentence justifying it with your mean/median values.

<font size=8 color="#009600">&#9998;</font> Put your answer here:  



### 1.4 Filter the data using masking (8 points)

<!-- You want to know whether studentsâ study habits have changed over time, and if so, whether they tend to study **more or less** in recent years.
 -->
You decide to analyze the **median number of hours studied** for the following groups based on their **previous_scores** range:

* 0-20  
* 21-40  
* 41-60  
* 61-80  
* 81-100  

**1.4.2** &#9989;&nbsp; **Task (6 pts)**
1. For each of the above score ranges, calculate the median value of `hours_studied`. 
2. For each range, print a statement such as:  
> "The median number of hours studied for students with previous scores between (*insert range here*) is: x hours."


**Hint:** You can create Boolean masks for value ranges as follows:
```python
data[(data["previous_scores"] >= lower_bound) & (data["previous_scores"] <= upper_bound)]


In [31]:
## your code here

**1.4.2** &#9989;&nbsp; **Task (2 pts)**

Suppose you want to check for an association between past assignment performance and study time.
Using your masked subsets from Task 1.4.1, report how the median hours_studied changes (if at all) across increasing values (or bins) of previous_scores. Provide a 1-2 sentence descriptive summary of the observed pattern (e.g., âthe median appears higher/lower/similar for higher previous_scoresâ).


<font size=8 color="#009600">&#9998;</font> Put your answer here

### 1.5 Clean out the NaN values (7 points)

#### 1.5.1 &#9989;&nbsp; **Task (2 points)**
Using the `isna().sum()`function to count NaN values in each column.

In [35]:
# Put yoru answer here

#### 1.5.3 &#9989;&nbsp; **Task (2 point)**
You observe `"NaN"` values in the `"exam_score"` column. Create a new, clean DataFrame (0.5 pt) by removing all rows where `"exam_score"` contains `"NaN"` values (0.5 pt). Use the built-in pandas function `dropna()` to accomplish this (you can look it up online to learn more).


In [38]:
#Put your answer here

#### 1.5.4 &#9989;&nbsp; **Task (2 points)**
Now that you have the original and clean datasets, determine the number of rows in each of them (1 pt). Print out your results (1 pt).

In [41]:
#Put your answer here

#### 1.5.5 &#9989;&nbsp; **Task (1 point)**
Let's reflect on the cleaning process. If you hadnât dropped the `"NaN"` data, would it have affected your results?

<font size=8 color="#009600">&#9998;</font> Put your answer here

---
<a id="part_2"></a>

## Part 2: Exploratory Data Analysis (25 total points)

[Back to Top](#toc)

It's time to explore our data! Youâll use basic EDA to understand distributions and relationships.

If you were not able to get part 1 done correctly, run the cell below to load the clean data set and use it in the rest of the assignment. You can also use this data set to check your work in part 1.

In [45]:
exam_scores_df = pd.read_csv("student_exam_scores_cleaned.csv")

print(exam_scores_df.head())

  student_id  hours_studied  sleep_hours  attendance_percent  previous_scores  \
0       S001            8.0          8.8                72.1               45   
1       S002            1.3          8.6                60.7               55   
2       S003            4.0          8.2                73.7               86   
3       S004            3.5          4.8                95.1               66   
4       S005            9.1          6.4                89.8               71   

   exam_score  
0        30.2  
1        25.0  
2        35.8  
3        34.0  
4        40.3  


### Part 2.1: Distributions & summary statistics (8 points)

Understanding distributions helps you reason about ranges, skewness, and typical values before modeling anything.

&#9989;&nbsp; **Do this (8 points):** Using matplotlib and pandas:

1. Create 3 subplots (use plt.subplot) in 1 row $\times$ 3 columns to show histograms for three numeric columns: use exam_score, hours_studied, and sleep_hours. (3 pts; 1 each)

2. Use plt.figure with figsize=(9, 3) (or similar wide aspect). (1 pt)

3. Compute and print the mean and median for each of the three plotted columns. (2 pts)

4. On each histogram, draw vertical lines for the mean and median (e.g., plt.axvline). Add clear titles and axis labels. (2 pts)

> _Hints:_ Select columns by name, e.g., df["exam_score"].

In [46]:
# Put your code here

### Part 2.2: Visualizing relationships (9 points)

&#9989;&nbsp; **Do this (9 points):**  Use `matplotlib` to make three _scatter plots_ showing:

1. exam_score vs. hours_studied

2. exam_score vs. attendance_percent

3. exam_score vs. sleep_hours:

For full credit:

- Create **3 subplots** using plt.subplot with **3 rows $\times$ 1 column**. (3 pts; 1 each)

- Use `plt.figure` and `figsize=(4, 8)` (or similar tall aspect). (2 pt)

- Provide an overall **title** (e.g., âStudy Habits and Exam Performanceâ). (2 pt)

- Provide **x-axis** and **y-axis** labels and subplot titles. (2 pts)


In [48]:
# Put your code here

### Part 2.3: Reflections (8 points)
&#9989;&nbsp; **Answer the following:** Looking at the plots, 

1. Which relationship looks strongest? Briefly justify. (2 pt)
2. Which plot looks the most linear? (2 pt).
3. Which plot do you believe would be the easiest to create a model for. Briefly justify. (2 pt)
4. In 3-4 sentences, compare and contrast when a histogram is useful versus when a scatter plot is useful (2 pt). Consider three scenarios:
    - the number/type of variables each shows,
    - the kinds of patterns each reveals,
    - the relationship between âexam scores for one classâ and âhours studiedâ.

<font size=8 color="#009600">&#9998;</font> Put your answer here

**Part 1 (2pt):**  

**Part 2 (2pt):**  

**Part 3 (2pt):**  

**Part 4 (2pt):**  



---
<a id="part_3"></a>

## Part 3: Fitting curves to data - Predicting Exam Scores (24 points)

[Back to Top](#toc)

Now that we've visualized our data, let's ask:

What is the relationship between **exam score** and each of these variables:

* **hours_studied**

* **sleep_hours**

Can we predict exam score for both of these variables with a simple linear model?

### Part 3.1: Model (4 points)
In the above plots we notice that `exam_score` tends to **increase** as `hours_studied` increases in value. Specifically, we can describe this relationship with a linear model:

$$ exam\_score = m \times variable + b $$

Where $variable$ is the predictor variable (e.g., `hours_studied`, or `sleep_hours`), $m$ is the slope of the line, and $b$ is the intercept.

&#9989;&nbsp; **Do this:**  
- Write a function called `exam_model` that calculates exam score based on a given predictor variable using the equation above.  
- The equation should be structured so that it can be used with the `curve_fit` function in the next section.  


In [51]:
# Put your code here

### Part 3.2: Fit the model (10 points)

Weâll now use the `curve_fit` function from SciPy to find the best-fitting line that predicts exam scores.

&#9989;&nbsp; **Do this:** 

Use the dataset `student_exam_scores_cleaned.csv` 

For each predictor variable (`hours_studied`, and `sleep_hours`):

- Use the `curve_fit` with your `exam_model` function to find the slope ($m$) and intercept ($b$). (3 pts each, 6 total)

- Print out the value of the slope $m$, write "The value of the slope is..." (1 pt each, 2 total)

- Print out the value of the intercept $b$, write "The value of the intercept is..." (1 pt each, 2 total)

In [53]:
# Put your code here


### Part 3.3 Check your model (10 points)

&#9989;&nbsp; **Do this (8 points):**  Use matplotlib to make two plots using two predictor variables (hours_studied, and sleep_hours), to compare your model and the data.
- Plot the actual data (1 point each, 2 total)
- Plot your modeled data (1 point each, 2 total)
- Use x and y axis labels and title (1 point each, 2 total)
- Adjust the color, and size, and alpha of the datapoints. Make sure the plot of your model is a different color than the datapoints so that you can see it. (1 point each, 2 total)

In [55]:
# Put your code here

##### &#9989;&nbsp; **Answer the following (2 points):** 
Do you think your models are good models? Is one model better that the other? Clearly justify your answer (looking for more than a "yes" or "no" answer).

<font size=8 color="#009600">&#9998;</font> Put your answer here

---

### Congratulations, you're done!
[Back to Top](#toc)

Submit this assignment by uploading it to the course Desire2Learn web page.  Go to the "Homework Assignments" section, find the submission folder link for Homework #3, and upload it there.

&#169; 2024 Copyright the Department of Computational Mathematics, Science and Engineering.