Day 13: Pre-class Assignment: Visualizing data with Seaborn and using masks in NumPy#

✅ Put your name here


Goals for today’s pre-class assignment#

  • Use the seaborn module to visualize data

  • Practice using masks with NumPy arrays

Assignment instructions#

This assignment is due by 7:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-class assignments” submission folder. If you run into issues with your code, make sure to use Slack to help each other out and receive some assistance from the instructors. Submission instructions can be found at the end of the notebook.

Useful reference#

For this pre-class assignment, you may find parts of the seaborn tutorial page useful:

Part 1: Visualizing data with Seaborn#

When doing data science, visualizing your results can often be just as important as doing the analysis. Afterall, if you can’t communicate your results effectively, what good was all the work that you did?

Thus far in the course we’ve use standard matplotlib functions to visualize our results, but there is a handy package called Seaborn that tries to improve upon the basic aesthetics of matplotlib while also adding some specific functionality for doing statistical analysis of data.

In this assignment, you’re going to test out how to use seaborn to make plots and change the way they look.

There first thing we need to do is import the seaborn module:

# Import matplotlib and make sure plots will show up in the notebook
import matplotlib.pyplot as plt
%matplotlib inline

# import seaborn  ("sns" is the commonly used variable name for seaborn)
import seaborn as sns

Of course, if we want to make some nice new visualizations, we need some data to plot. Let’s revisit the Great Lakes data from a previous in-class assignment! However, instead of loading the .csv files with NumPy, let’s use Pandas!

# import pandas
import pandas as pd

# Load up the Lake Ontario data
# (remember, you need to have a copy of the data in the same location as this notebook)
ont = pd.read_csv("ont.csv")

Hmm, why do we have two “Unnamed” columns? Take a look at the original .csv file and determine why this is happening. Since carrying around columns that are full of “NaN”s isn’t particularly useful, we’re going to use Pandas to “drop” those columns and create a new dataframe that doesn’t contain those columns:

# Drop the unnamed columns and store the new version of the dataframe
ont.drop(columns=["Unnamed: 2", "Unnamed: 3"], inplace=True)

This looks much better!

Do the same thing with the other lake files: eri.csv, sup.csv, and mhu.csv. Make sure the dataframes have all of the right information in them and no undesired columns.

Note: you might notice that the Lake Erie file is missing a column label and that you end up with one extra “Unnamed” column, but instead of being full of NaNs, it’s the column with the actual lake levels. See if you can figure out how to rename the “Unnamed” column using the rename() function. Remember that you need to store the new dataframe that is created by the rename function.

# Put your code here for reading in the other lake files using Pandas

Making seaborn plots#

Now that you have all the data loaded up, let’s try making some of the same plots we made for the Great Lakes data in class.

First, we’ll make a plot of the Lake Ontario levels as a function of time:

plt.plot(ont['Lake Ontario annual averages'], ont['AnnAvg'])

You should get a plot that looks familiar but looks a bit different than a normal matplotlib plot.

Question: What is the sns.set() line doing? You might want to refer to the seaborn documentation here: Note that the set() command only needs to be called once for the effect to be applied to all future plots.

Put your answer here

Task: Add some axis labels to the plot and then make similar plots for each of the Lake data files.

# Put your code here and create new cells as necessary

Task: Now, make a scatter plot of the Lake Ontario levels versus the Lake Erie levels and give it appropriate axis labels.

# Put your code here

Task: It turns out that seaborn offers more than one sort of plot style and you can change the style using set_style(). Refer to the documentation to see what the different options are and then try them out with your scatter plot. You should also test out the despine() function and see what that does.

# Change the seaborn style here and remake your scatterplot
# test the "despine" function as well

Question: Which style do you prefer? When might you use one style versus another?

Put your answer here

Seaborn can also be really useful when you need to modify a plot to show up better in a presentation, on a poster, or in a paper. This can be done using the set_context() function.

Task: Refer to the documention and make a new version of your scatter plot using each available context.

# Put your code here and create new cells as necessary

Finally, as mentioned earlier, Seaborn can be used to do data analysis as well. For example, you can use the jointplot() function to make a scatter plot similar to the one that you just made with matplotlib, but with additional information about the distribution of the data along each axis.

Task: Using the same data values that you used for the previous scatter plot, create a Seaborn jointplot. Try using the kind argument to add a best fit line to the data. The bars on the top and side of the plot show the histogram of the values for each lake.

# Put your jointplot code here

Task: Try out jointplot with other combinations of lake data as well

# Put some more example plots here

Part 2. Masking#

Masking is an extremely important and absurdly useful tool that we will be adding to our coding tool box. Fundamentally, “masking” is a process that allows us to select specific parts of our data that meet some condition. Let’s work through some examples, to better understand what this means.

Let’s start by looking at the random set of numbers.

import numpy as np
vals = np.array([3, 11, 6, 9, 7, 12, 8, 11, 5, 3, 15, 13])

Task: Write a piece of code that uses a for loop and an if statement to identify all values of vals that are above 8. Append the values that meet this condition to a new list and print them out.

# Insert your code here

What we just did was select a subset of our data; specifically, all values in vals greater than 8. Now, let’s do the same thing using masking. The code below creates a mask and then uses that mask to select all of the data points that meet our condition.

mask = vals > 8
vals_masked = vals[mask]

Task: Let’s break this code down, piece by piece. First, write a piece of code that prints mask.

# Insert your code here

Task: You should see a list of Boolean values (True or False). Compare your mask values to the values of vals. What do you notice?

Write your answer here

Task: Now print out the values of vals_masked.

# Insert your code here

Task: What values does vals_masked contain? How are these values connected to vals, mask, and the list that you created in 2.1?

Write your answer here

Task: Try tinkering with the mask generation code (line 1 from the cell in 2.1) by changing the condition (e.g., values below 8, values above 11, etc.). Print out the mask and vals_masked values for each one to convince yourself that the “masks” you create match your expectations.

# Insert code here

Task: Take a moment and reflect. What is a mask? What does the process of masking data or values do?

Write your answer here

Task: Before moving on, explain the concept of masking as if you were talking to someone who has never coded before.

Write your answer here

Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!

from IPython.display import HTML

Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Pre-class assignments” folder, find the appropriate submission link, and upload it there.

See you in class!

© Copyright 2023, The Department of Computational Mathematics, Science and Engineering.