Day 10 Pre-class Assignment: Computational Models Overview and The Pandas Data Analysis Library#

✅ Put your name here

#

Goals for today’s pre-class assignment#

  • Review some types of Computational Models

  • Use the Pandas module to explore and visualize data

Assignment instructions#

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-class assignments” submission folder. If you run into issues with your code, make sure to use Teams to help each other out and receive some assistance from the instructors. Submission instructions can be found at the end of the notebook.

It is important that you do your best to complete the pre-class assignment! Going through this assignment and trying to complete it to the best of your ability will help to make sure you’re prepared for the content that is covered in class.


Part 1: Overview of Computational Models#

So far this semester, we looked and and even built a few examples of computational models. Take a moment to think about what we mean when we talk about “models” by watching the following overview video to get an introduction to the definition of Computational Models. If the YouTube video doesn’t work, try this MediaSpace link.

Note: This video references a model for the motion of a snowball that was replaced by the Savings Account In Class assignment for Fall 2021. If you want to learn more about the model, ask your instructor!

# Video on the Pandas module  
from IPython.display import YouTubeVideo  
YouTubeVideo("7qAunwHsuj8",width=640,height=360)

✅  Question 1: In your own words, decribe what scientific and/or computational models do.

Put your answer here

✅  Question 2: What are some of the limitations of scientific and/or computational models?

Put your answer here


Part 2: Working with data in Pandas#

Pandas is an extremely useful Python library for reading and analyzing datasets in a variety of formats with a variety of data types. Managing complicated files with only the Python tools we’ve learned up to this point is often very difficult and at times, may seem impossible. Lucky for you, someone else decided to make your life easier and created Pandas!

Watch the following videos to learn a bit about how you can load and analyze data using this handy library! Pay particular attention to the part of the video that talks about “slicing” the data. If the YouTube video doesn’t work, try this MediaSpace link.

Note: In a previous semester of this course, the order of the topics covered was slightly different so there may be references to “future” content that you already have experience with. Also, the version of Python used in this video is a bit older than the version that you have, so the format of the Pandas dataframe might look a bit different in your notebook when you display it in your notebook.


Video 1#

# Video on the Pandas module  
from IPython.display import YouTubeVideo  
YouTubeVideo("A0InxIMAvlU",width=640,height=360,end=405)

Video 2#

This next video introduces plotting dataframes using the Pandas module. Pandas does have its own plotting functions which you may see referenced from time to time. You are welcome to explore them and use them, but in CMSE 201, we will focus on using Pandas (or numpy) for loading and analyzing data and Matplotlib for visualizing the data! If the video doesn’t work, try this MediaSpace Link!

Note: this video will reference the “blood pressure” dataset, which you will be working with in the code cells below.

# Video on the Pandas module  
YouTubeVideo("9nU-lCobiM8",width=640)

Useful Pandas references#

For this pre-class assignment and the assignments that follow during the next week of the course, these two references might prove to be particularly useful:


2.1: Pandas and Plotting#

Now that we understand a bit of the fundamentals for Pandas dataframes, we’re going to use a small database of data taken from patients that were admitted to the hospital with chest pains. In the dataset, there are several columns. Here are what a few of them correspond to:

  • age is patient age

  • chol is serum cholesterol in units of mg/dl

  • trestbps is the resting blood pressure of the patient upon admission

  • thalach is the maximum heart rate achieved

The cell immediately below this uses Pandas to read in the data, and you should run that cell before you do anything else and make sure you understand what it’s doing!

Importing Pandas and all of our other useful modules we know up to this point.#

As always, we should make sure we import all of the Python modules we might want to use as we work through the notebook.

Then, download heart_disease_data.csv from the course website. Make sure the file is in the same folder as your notebook! If you are having trouble with the data file, make sure to ask for help!

Pay special attention to the new import command for Pandas, it’s a bit different than what is shown in the video!

Make sure you execute this cell!

# import numpy
import numpy as np

# import matplotlib and make sure plots show up in the notebook
import matplotlib.pyplot as plt

# Look at the Pandas import -- we take a similar approach to how we import numpy
# "pd" will be the short-hand for accessing Pandas functions.
import pandas as pd

# read in some data on heart disease
# from http://www.datasets.co/dataset/Heart-Disease
heart_disease_data = pd.read_csv('heart_disease_data.csv')

2.1.1: Checking out the DataFrame#

Now let’s see what the data looks like by examining all of the columns labels and the first few rows of data using the .head() command that was explained in the video.

# Check out the top of the data structure
heart_disease_data.head()
observation_number age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
0 0 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
1 1 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
2 2 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
3 3 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
4 4 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0

Let’s also test out the .describe() function to get some details about the data.

# Check out some of the properties of the data
heart_disease_data.describe()
observation_number age sex cp trestbps chol fbs restecg thalach exang oldpeak slope num
count 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000 302.000000
mean 150.500000 54.410596 0.678808 3.165563 131.645695 246.738411 0.145695 0.986755 149.605960 0.327815 1.035430 1.596026 0.940397
std 87.324109 9.040163 0.467709 0.953612 17.612202 51.856829 0.353386 0.994916 22.912959 0.470196 1.160723 0.611939 1.229384
min 0.000000 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 75.250000 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000 0.000000 133.250000 0.000000 0.000000 1.000000 0.000000
50% 150.500000 55.500000 1.000000 3.000000 130.000000 241.500000 0.000000 0.500000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 225.750000 61.000000 1.000000 4.000000 140.000000 275.000000 0.000000 2.000000 166.000000 1.000000 1.600000 2.000000 2.000000
max 301.000000 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 3.000000 4.000000

2.1.2: Intro to Accessing data in Pandas Dataframes#

To access data in a DataFrame for plotting, there are several different ways. Review the code below to see how you access data by its column header. Note: In Part 3, we will explore additional ways to access data with more control!

heart_disease_data['trestbps'] # accesses the column labeled 'trestbps'
0      160
1      120
2      130
3      130
4      120
      ... 
297    110
298    144
299    130
300    130
301    138
Name: trestbps, Length: 302, dtype: int64

To access the data by column name, you need to know the labels for each column name. If you need to know what strings pandas is using for column headers, you can find out using the .keys() command. Run the cell below to get the column header names for heart_disease_data.

Note: Using .keys() is particularly useful when there are extra spaces or other attributes of the column names that you can’t see with the standard dataframe display. Remember this if you decide to use data for your semester project!

heart_disease_data.keys()
Index(['observation_number', 'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
       'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

✅  In the cell below, print another one of the columns of heart_disease_data

# put your answer here

2.2 Visualizing the data#

The following questions give you a chance to try visualizing the pandas dataframe using some of the functions you were shown in the video and one you weren’t.

Note:#

In this section, you will be plotting using matplotlib.pyplot as shown in Video 2.

✅  Question 1: Using matplotlib.pyplot, make a histogram (using plt.hist()) of the resting blood pressure of the patient upon admission. Make sure to include axis labels and a title indicating what you’re looking at.

# Put your code here

✅  Question 2: Make a scatter plot of the resting blood pressure (‘trestbps’) versus age. Make sure you put the right variables on the \(x\) and \(y\) axes.

Think back to (or refer to) the Great Lakes in-class assignment we did previously in the course and remind yourself about correlations. Do you think these values are correlated?

# Put your code here

✅  Question 3: Make a boxplot for the resting blood pressure. Like plt.hist, the code structure of boxplot is slightly different than plt.plot or plt.scatter because it contains only one dimension of data (i.e. one column of a dataframe).

# Put your code here

Part 3: Accessing Parts of Dataframes#

What if we want to access dataframes by something other than the column headers?

The code below creates a dataframe from scratch. Review the code and, when necessary, use the internet to learn what the following Pandas methods/attributes are used for (e.g. .index and .columns) and comment on what each line of code is doing below.

Example_DataFrame = pd.DataFrame([['A',0,1,2,3],['B',4,5,6,7],['C',8,9,10,11],['D',12,13,14,15]]) # put comment here

Example_DataFrame.index = Example_DataFrame[0] # put comment here

Example_DataFrame = Example_DataFrame.iloc[:,1:] # put comment here 

Example_DataFrame.index.name = None # This line is removing the column header for the index column

Example_DataFrame.columns = ['Col_1', 'Col_2', 'Col_3','Col_4'] # put comment here 

Example_DataFrame
Col_1 Col_2 Col_3 Col_4
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15

Now that we’ve created this Pandas dataframe, the following image should serve as a reference for how one can access information within it.

3.1: Examples of using .iloc and .loc to access data#

Have you figured out the difference yet?

  • If you want to access data by its label or name, we use .loc

  • If you want to access data by its indices or position, we use .iloc

Now, here are some examples that require us to index Pandas dataframes:

✅  Example 1: Access information in the second column of Example_Dataframe.

  • Information Needed: Look at the above image and in your mind ‘highlight’ the information we want to access. We want the values under Col_2.

  • Available Tools:

    • We notice that since we want all the information in the column, we can access it by the column name.

    • However, we can also simply retreive the values in that column using .iloc. We want all the rows in the column at the 1 index.

#Index Col_2 by name
print(Example_DataFrame['Col_2'])

#Index Col_2 by .iloc, pay special attention to the use of ":" -- what's going on there?
print(Example_DataFrame.iloc[:,1])
A     1
B     5
C     9
D    13
Name: Col_2, dtype: int64
A     1
B     5
C     9
D    13
Name: Col_2, dtype: int64

Check your code: Look at the gif below. Is this the image you pictured earlier? We can check that the code we used was correct by comparing the output with the column highlighted in the original dataframe.

✅  Example 2: We access the information in the second row in a similar manner.

#Index Row B with loc
print(Example_DataFrame.loc['B'])

#Index Row B with .iloc
print(Example_DataFrame.iloc[1])
#Or
#Index Row B with .iloc
print(Example_DataFrame.iloc[1,:])

Now, we have a slightly more complicated task. We are going to access single entries or multiple entries in a single column.

✅  Example 3: Access the top left element of the Example_Dataframe.

  • Information Needed: Look at Example_Dataframe and in your mind ‘highlight’ the information we want to access. We want the value 0.

  • Available Tools:

    • Since we want a single value, we can combine the column and row access we used before, by using both the column name and .loc with the row name.

    • Or we could use the indices to retreive the value using .iloc. We want the entry in the 0th row and 0th column.

#Index Value 0 with .loc
print(Example_DataFrame['Col_1'].loc['A'])

#Index Value 0 with .iloc
print(Example_DataFrame.iloc[0,0])

We can also access multiples entries. In the final example, we want to access the entries 1,2,3 or the entries in the 0th row and the 1st, 2nd, 3rd columns.

Example_DataFrame
Example_DataFrame.iloc[0,1:]

Check your code: Look at the gif below. We can check that the code we used was correct, by comparing the output with the column highlighted in the original dataframe.

3.2: Let’s practice different ways to index using both .iloc and .loc#

Examine the Example_Dataframe and complete the questions below. For each question, follow the same process we did above: What is the task? What information do you need? What tools can you use? Make sure to check that the output matches what you expected from the original Dataframe. For the Available Tools portion in Tasks 1 and 3, make sure to describe the differences between the two methods.

✅  Question 1: Using .loc and .iloc, access the row of data containing [8,9,10,11] from the example dataframe.

  • Information Needed:??

  • Available Tools:

    • ??

    • ??

# 1

✅  Question 2: Using .iloc, access just the values [6,10,14] from the example dataframe.

  • Information Needed:??

  • Available Tools:

    • ??

#2

✅  Question 3: Using .loc and .iloc, access the value “10” from the example dataframe.

  • Information Needed:??

  • Available Tools:

    • ??

    • ??

#3

✅  Question 4: Print out the column names and index names.

#4

3.3: Understanding the Pandas Dataframes default#

By default, when you load a dataset with Pandas or create a dataframe from scratch, Pandas will define the row indices to just be numbers rather than a unique set of labels that match your data. This is how we will commonly interact with Pandas dataframes.

✅  Review the following code that constructs a dataframe without providing specific row index names and comment what each line of code is doing. Indexing will be similiar to before but now the row names are just integers representing row numbers.

DF_NoIndex = pd.DataFrame([[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15]]) #Comment here
DF_NoIndex.columns = ['Col_1', 'Col_2', 'Col_3','Col_4'] # Comment here

DF_NoIndex
#Index the third row by .loc
print(DF_NoIndex.loc[2])

#Index the third row by .iloc
print(DF_NoIndex.iloc[2])
#Index the first and second row by loc
print(DF_NoIndex.loc[0:1])

#Index the first and second row by .iloc
print(DF_NoIndex.iloc[0:2,:])

Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment! If the survey doesn’t load, paste https://cmse.msu.edu/cmse201-pc-survey into your browser to fill out the survey!

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://cmse.msu.edu/cmse201-pc-survey" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Pre-class assignments” folder, find the appropriate submission link, and upload it there.

See you in class!

© Copyright 2024, The Department of Computational Mathematics, Science and Engineering at Michigan State University