Homework 3: Analyzing, modeling, and visualizing data with Pandas and Numerically solving ODEs#

✅ Put your name here
#Learning Goals#
Content Goals#
Using pandas to work with data and clean it
Make meaningful visual representations of the data
Solving ODEs numerically using
solve_ivp
Practice Goals#
Read Python module documentation to learn how to accomplish new things that may be unfamiliar to you
Assignment instructions#
Work through the following assignment, making sure to follow all the directions and answer all the questions.
This assignment is due at 11:59 pm on Friday, October 25. It should be uploaded into the “Homework Assignments” submission folder for Homework #3. Submission instructions can be found at the end of the notebook.
Table of Contents and Grading#
0. Importing the modules you will need for this assignment
1. Reading, describing, and cleaning data (20 points)
2. Exploratory Data Analysis (22 points)
3. Solving ODEs with `solve_ivp (28 points)
Total points possible: 70
0. Importing the modules you will need for this assignment#
In this assignment you will be using matplotlib, numpy, pandas, and solve_ivp. Of course, in order to make sure you can use these modules when you need them, you have to import them.
Put the import commands you need to include to be able to use all of the above modules in this notebook in the cell below. Make sure to execute the cell before you move on!
# Put your import commands here
1. Reading, describing, and cleaning data (20 points)#
There have been many layoffs in the world’s technology industry in the last year. In this section, you’ll load and work with some data from these layoffs. There are some blank (NA or NaN) values in the data set, which you’ll have to clean in this section of the homework.
1.1 Read the data (1 point)#
1.1.1 ✅ Task (1 point)#
Read in the data from tech_layoffs.csv into a Pandas dataframe and display the head of the data.
## your code here
1.2 Describe the data (3 points)#
1.2.1 ✅ Task (1 point)#
Use describe to display several summary statistics from the data frame.
## your code here
1.2.2 ✅ Task (2 points)#
You should see two columns above, total_layoffs and impacted_workforce_percentage. What do the numbers in the 75% row mean, in the context of the whole layoff data set? Describe in everyday terms, avoiding jargon like “quartile” or “percentile.”
✎ Put your answer here
1.3 Display subsets of the data (4 points)#
1.3.1 ✅ Task (1 point)#
Display the industry column on its own by using the column name as an index to the DataFrame.
## your code here
1.3.2 ✅ Task (1 point)#
Using iloc, display the 261st company’s additional_notes. For reference, the 1st company in the data frame is C2FO. Important: where does indexing start in Python?
## your code here
1.3.3 ✅ Task (2 points)#
Using iloc, display company, industry, and headquarter_location columns starting with the company at row index 300, and going up to but excluding row index 320. Note: avoid using a “print” statement to display the data, so that you get the nice Pandas formatting.
## your code here
1.4 Filter the data (7 points)#
1.4.1 ✅ Task (7 points)#
Filter the data frame put the result into a new data frame name filtered_layoffs. This new data frame should only include columns company, total_layoffs, impacted_workforce_percentage, and status.
It should also only include Public companies with at least 10 percent of the workforce impacted by the layoffs.
Display the head of this new data frame.
## your code here
1.5 Clean out the NA values (3 points)#
1.5.1 ✅ Task (2 points)#
Get rid of all the rows from filtered_layoffs with NA or NaN values. Put the result into a new dataframe called clean_layoffs. Hint: Pandas has a built-in function that does this! A quick google search should hopefully point you in the right direction.
## your code here
1.5.2 ✅ Task (1 point)#
Using code, find the number of rows in your new clean_layoffs dataframe.
## your code here
1.6 Reflect on cleaning process (2 points)#
1.6.1 ✅ Task (2 points)#
In this cleaning process, the last step was to drop NA and NaN values. Could we have just done this step at the beginning of the cleaning process (before step 1.4), and gotten the same result? Is there a good reason for leaving this step to the end? If yes, describe what it is. If not, describe why it doesn’t matter. Feel free to experiment with this if you’re not sure and see how the outcome changes!
✎ Put your answer here
2. Exploratory Data Analysis (22 points)#
The first step in every data science project, after obtaining the dataset, is sometimes referred to as Exploratory Data Analysis (EDA). As the name suggest it consists in exploring your acquired dataset to look for general trends and possible correlations. Basically, you want to make sure you’ve got a basic sense of the properties of the data you are working with!
In this part of the assignment, you are given a dataset containing information about stars. Note that you don’t need to know astronomy to complete this homework assignment! The focus is to explore the dataset and see what sort of information you can pull out of it, even if you don’t have a deep grasp of the discipline from which it came. In the tasks below, you will be asked to explore the data and answer some questions about it.
2.1: Read in the data (2 points)#
✅ Do this (2 points):
Use
pandasto read the csv file with stars’ data,stars.csv, and save it into a dataframe calledstars_data. (1 point)Display data for the last 15 stars (note: this requires using something opposite to the
head()method) (1 point)
# Put your code here
# Make sure you use this variable name
stars_data =
Detais about this dataset:#
The dataset contains information on 240 stars. The columns are
Temperature (K): Absolute temperature of the star measured in Kelvin
Luminosity (L/Lo): Luminosity of the star relative to the luminosity of the sun \(L_o = 3.828 \times 10^{26}\) Watts (Avg Luminosity of Sun)
Radius (R/Ro): Radius of the star relative to the radius of the sun \(R_o = 6.9551 \times 10^8\) m (Avg Radius of Sun)
Absolute Magnitude (Mv): Absolute Magnitude (which is a measurement of the star’s brightness)
Star type: Type of star; this is a limited dataset with only 6 types = “Red Dwarf”, “Brown Dwarf”, “White Dwarf”, “Main Sequence”, “Super Giants”, “Hyper Giants”
Star color: Color of the star
Spectral Class: Classification of stars (which is a another way of categorizing stars based on their observed properties, compared to the “star type”)
2.2 Check for possible correlations (3 points)#
As you can see there are four columns with numerical values and three columns with string values. Obviously, we can look for correlations only on the first four columns.
✅ Do this (2 points): Display a correlation matrix of the first four columns. Hint: Look up the corr method associated with dataframes.
# Put your code here
✅ Answer this question (1 point): Do you notice any high correlation between columns? You may need to remind yourself what numbers represention “high correlation”.
✎ Put your answer here
2.3 Grouping the data and looking for correlations (3 points)#
Although we can not look for correlations between the values in the last three columns, they still provide important information! Each star is assigned as “type”, “color”, and “spectral class” based on its properties.
For now, we are going to focus on exploring the “type” column, "Star type".
✅ Do this (1 point): Remove the columns "Star color" and "Spectral Class" from the stars_data dataframe and store the result in a new dataframe called stars_data_reduced. (Hint: look up the drop method associated with dataframes or use slicing to select only the columns you want to keep.)
# Put your code here
✅ Run the following cell and then answer the following questions below. (this assumes you correctly named your new reduced dataframe)
stars_data_reduced.groupby(["Star type"]).corr()
✅ Do this (2 points):
What is the
groupbymethod doing in the code above? (you may want to refer to the documentation or look up examples online in addition to looking at this output)When looking at the correlations based on star type, do you notice any differences from the original correlation matrix you printed above? Where do you see the strongest correlations now?
✎ Put your answers here
2.4: Visual representation of correlations (10 points)#
The numbers above gives us a quantitative measure of the correlations in the dataset. As you can see their Temperature, Luminosity, and Radius become more or less correlated depending on the type of star. While looking at the numbers can be useful, it can be even more helpful to visualize the correlations.
Visualization in data science is an important skill! However, one of the issues we run into from time to time is that we might have several columns of data and traditional plots can only plot two variables at a time. One way to visualize additional variables is to use color, marker shapes, or marker sizes (though using all of these options at the same time can get to be confusing!).
So, one option for plotting three columns from our dataframe at the same time is to use the x-axis, y-axis, and a color axis. You’ve likely seen a plot like this before! For this example, The x-axis and y-axis will contain numerical values, while the color axis will be used for the type of star.
We want to make a plot similar to this one:
✅ Do this (10 points): Use matplotlib to make two scatter plots like the ones above. Each plot has the following characteristic
The entire figure should be 12 inches wide and 6 inches tall (1 point).
Temperature or Radius on the \(x\) -axis and Luminosity on the \(y\) -axis (2 points)
Make both the \(x\) and \(y\) axis log scaled (1 point).
Label the \(x\)-axis as Temperature (K) and Radius (R/Ro), as appropriate, and the \(y\)-axis as Luminosity (L/Lo) (1 point)
Add each star type to the plot, given them unique colors, and provide a label so that a legend can be generated. Do your best to avoid “hard-coding” this part, if you can. That is, try to design your code such that it would work regardless how many star types there were. That said, points will not be lost as long as you end up making the required plots following the specifications stated here. (4 points).
Show the legend (1 point).
Side Notes
The above plot on the left is called the Hertzprung-Russell diagram. For details as to what the Hertzsprung-Russel Diagram is, you can take a look here. Basically, this is a plot that helps astronomers understand how stars evolve. Over time, individual stars will track out a path on this diagram and when looked at for a whole group of stars, we can use this plot to estimate how old that population of stars might be. If you were to compare your plot with any of the HR diagrams on the internet you would notice that your plot is flipped! This is because the Temperature increases towards the left in those plots. If you wanted to (but it is not required), you could fix this by adding this line at the end of your code.
plt.gca().invert_xaxis()You can choose any color scheme you want as long as it clearly differentiates the type of stars (which means avoiding any two colors that are too similar). The color scheme used in the example plot is given below in the
scolorlist, if you want to use it.
# A possible color scheme that matches the one above.
# You don't have to choose this, you can choose any color scheme you want.
# If you are interested in knowing more about this color scheme Google "Okabe-Ito color palette"
scolor = ["#CC79A7", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#56B4E9", "#E69F00","#000000"]
# Put your code here
2.5 Interpreting the plots (4 points)#
✅ Answer the following (4 points):
Based on the visualizations, which star type shows the strongest correlation between Luminosity and Temperature (and Radius)? (1 point)
Now that you’ve looked at the two plots, which of the two plots most likely motivated astronomers to use the names: “Red Dwarf”, “Brown Dwarf”, “White Dwarf”, “Main Sequence”, “Super Giants”, “Hyper Giants”? What motivates your answer? (1 point)
Just looking at the main sequence stars, which of the two plots have the steepest slope? Justify your answer and look carefully at the range of values the data span. Do you best to estimate this from plot. If you want, you can try to confirm your hunch by using the actual data values to calculate this. (2 points)
✎ Put your answer here
3. Solving ODEs with solve_ivp (28 total points)#
✅ 3.1 (10 points): As you might have heard at some point, it was discovered that various water sources in Michigan have been contaminated with high levels of PFAS chemicals (https://www.mlive.com/news/index.ssf/page/michigans_water_crisis_pfas.html). In fact, there was symposium taking place in Lansing about this very issue while I was crafting this homework assignment. Many of these chemicals are introduced by manufacturing companies improperly disposing of the by-products of the manufacturing process.
If you consider the network of water sources in the state of Michigan (i.e. ponds, lakes, rivers, ground water), you might begin to understand how the introduction of a chemical into one location could lead an eventual contamination of the entire system – eventually even our wonderful Lake Michigan could be impacted!
Let’s take this scenario as our context for thinking about how one could build a compartmental model to study the movement of a pollutant a local pond system where a industrial factory might be dumping PFAS chemicals. Let’s say that we have three ponds, \(A\), \(B\), and \(C\), connected via streams in the following way, with the factory dumping pollutants at a rate of, \(p\), into Pond \(A\):
we can then define a set of ordinary differential equations that describe how the pollutant moves through the system that might look something like this:
where \(p\) is the rate at which the pollutant is dumped into Pond \(A\) and \(q\), \(r\), and \(s\) are the “per minute” pollutant flow rates out of Ponds \(A\), \(B\), and \(C\), respectively. In this model, we’re using \(A\), \(B\), and \(C\) to represent how many pounds of pollutant are currently in the corresponding pond.
Solve this pond pollution system using solve_ivp() and assume that all of the ponds start with zero pollutant, so the initial conditions are:
\(A(0) = 0\)
\(B(0) = 0\)
\(C(0) = 0\)
Assume that the pollutant is dumped into Pond \(A\) at a rate, \(p = 0.125\) lb/min. Let \(q = r = s = 0.001\).
Run the model for 48 hours (make sure to convert this to minutes!) and use a timestep of \(\Delta t = 1\) minute.
Note: at the end of this part of the assignment you should have called solve_ivp but haven’t been asked to plot the results yet!
# Put your code here
✅ 3.2 (6 points): Now that you have a solution to the model, make a single plot that displays the evolution of all three variables, A, B, and C as a function time (in hours). Each line should be a different color and your plot should contain a \(x\)-axis label, a \(y\)-axis label, and a legend.
### Put your code here
✅ 3.3 (2 points): How are the amounts of the pollutant in the ponds evolving at late times? Does behavior match your expectations, why or why not?
✎ Put your answer here
✅ 3.4 (3 points): Make a second plot where you zoom in on just the first ten hours of the evolution of the system. Ensure that all of the same labels are included. Note: This should not involve re-running your model, but instead should involve plotting a subset of all of the values using NumPy array indexing.
### Put your code here
✅ 3.5 (1 point): At what point in time does the amount of pollutant in Pond \(B\) begin to diverge noticeably from that of Pond \(C\)?
✎ Put your answer here
✅ 3.6 (6 points): Using the final values from the first 48 hours as a starting point (you should be able to extract these from your solution array from solve_ivp), run the model for another 48 hours but assume that the factory stops dumping pollutant into Pond \(A\) (p = 0). Make a plot of your results from the next 48 hours. Comment on what you observe. Do the results match your expectations? Why or why not?
# Put your code here
✎ Put your comments on the results here
Congratulations, you’re done!#
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” section, find the submission folder link for Homework #3, and upload it there.
© 2024 Copyright the Department of Computational Mathematics, Science and Engineering at Michigan State University.