Day 9 In-class Assignment: Exploring Great Lakes Water Levels using NumPy#
✅ Put your name here
#✅ Put names of your group members
#In today’s activity, were going to use NumPy and Matplotlib to interact with some data that pertains to the water levels of the Great Lakes.
In recent years there have been some rapid changes in Great Lakes water levels that have led to flooding. It has also driven speculation that such changes are driven by extreme precipitation events brought on by climate change.

Today we will examine over 100 years of water level measurements for each Great Lake, using data collected by the National Oceanic and Atmospheric Administration (NOAA). You can find this dataset here: https://www.glerl.noaa.gov/data/wlevels/#observations. This also gives us an opportunity to practice working with numpy
arrays and matplotlib
!
Learning Goals:#
By the end of this assignment you should be able to:
Load data using NumPy so that you can visualize it using matplotlib
Work with NumPy “array” objects to compute simple statistics using built-in NumPy functions.
Use NumPy and matplotlib to look for correlation in data
Part 1: Using NumPy to explore the water level history of the Great Lakes#
# Although there are some exceptions, it is generally a good idea to keep all of your
# imports in one place so that you can easily manage them. Doing so also makes it easy
# to copy all of them at once and paste them into a new notebook you are starting.
# Bring in NumPy and Matplotlib
import numpy as np
import matplotlib.pyplot as plt
To use this notebook for your in-class assignment, you will need these files:
lake_michigan_huron.csv
lake_superior.csv
lake_erie.csv
lake_ontario.csv
These files are supplied along with this notebook on the course website. You will read data from those files and it is important that the files are in the same place as the Jupyter notebook on your computer. Work with other members of your group to be sure everyone knows where the files are.
Take a moment to look at the contents of these files with an editor on your computer. For example,
*.csv
files open with Excel or, even better, look at it with a simple text editor like NotePad or TextEdit or just try opening it inside your Jupyter Notebook interface.
As you saw in the pre-class assignment, you can use the command below to load in the data.
# use NumPy to read data from a csv file
# second, better example of loadtxt()
mhu_date, mhu_level = np.loadtxt("lake_michigan_huron.csv", usecols = [0,1], unpack=True, delimiter=',') # example for the lake_michigan_huron.csv file
Once you have your data, it is always a good idea to look at some of it to be sure it is what you think it is. You could use a print statement, or just type the variable name in an empty cell.
mhu_date
array([1918. , 1918.08333333, 1918.16666667, ..., 2020.75 ,
2020.83333333, 2020.91666667])
✅ What do you think this data represents?
✎ Put your answer here
✅ Next, write some code in this cell to read the data from the other files. Use descriptive variable names to store the results.
# Read in data from the remaining files.
# Print some of the values coming in from the files to ensure they look fine.
✅ Question: Before you move on, what is the variable type of the lake data you’ve loaded? Use the type()
function to check on the mhu_date
and mhu_level
variables. Does this match your expectations?
✎ Put your answer here
Part 2: Descriptive Statistics of Data Sets#
✅ Now that you have read in the data, use NumPy’s statistics operations from the pre-class to compare various properties of the water levels for all of the lakes.
mean
median
standard deviation
2.1 Means and Medians#
What is the mean of a data set?
The mean, also referred to as the average, is calculated by adding up all of the observations in a data set, and dividing by the number of observations in the data set.
\(mean = \frac{\sum\limits_{i=1}^{N} x_i}{N}\), \(N\) = number of observations, \(x_i\) = an observation
More simply put,
mean = sum of observations / number of observations
The mean of a data set is useful because it provides a single number to describe a dataset that can be very large. However, the mean is sensitive to outliers (observations that are far from the mean), so it is best suited for data sets where the observations are close together.
What is the median of a data set?
The median of a data set is the middle value of a data set, or the value that divides the data set into two halves. The median also requires the data to be sorted from least to greatest. If the number of observations is odd, then the median is the middle value of the data set. If the number of observations is even, then the median is the average of the two middle numbers.
Unlike the mean, because the median is the midpoint of a data set, it is not strongly affected by a small number of outliers.
2.1.1 Using numpy to Calculate Mean and Median#
We can calculate the mean and median from scratch, but we are going to introduce one option for functions in Python that will do the calculations for you. Because the calculations of mean and median are so useful, many Python packages include a function to calculate them, but here we are going to use the calculations from numpy.
The documentation for numpy mean shows additional options, but the basic use is:
np.mean(data)
Similarly, the median is calculated by:
np.median(data)
In the cell below, calculate the mean and median of the water level for each of the four lakes using the numpy functions.
# Put your code here
2.2 Computing standard deviation by hand#
One way to describe the distribution of a data set is the standard deviation. The standard deviation is the square root of the variance of the data. The variance is a measure of how “spread out” the distribution of the data is. More specifically, it is the squared difference between a single observation in a data set from the mean. The standard deviation is often represented with the greek symbol sigma (\(\sigma\)).
✅ Fix the following function which is supposed to take in a list of values and calculate the standard deviation using only basic python functions. The function is already written but it doesn’t quite work. Run the cell to see. Here’s the equation for standard deviation:
$\( \sigma = \sqrt{\frac{\sum\limits_{i=1}^{N} (x_{i}-\mu)^2}{N}} \)$#
where the symbols in this equation represent the following:
\(\sigma\): Standard Deviation
\(\mu\): Mean
\(N\): Number of observations
\(x_{i}\): the value of dataset at position \(i\)
You may want to check in with your group to make sure you understand the notation in this equation!
# Fix this function to make sure it correctly calculates the standard deviation
def std(vals):
length = len(vals)
mean = sum(vals)
diffs = []
for i in range(length):
diffs.append(vals[i] - mean)
return sum(diffs)**0.5
✅ Check your function for accuracy
Call your function using the variable test_list
(provided below) as the input and compare your function’s output with that of np.std()
to make sure you calculated standard deviation correctly.
import matplotlib.pyplot as plt
import numpy as np
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
app.launch_new_instance()
File "C:\Users\rache\anaconda3\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
app.start()
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\kernelapp.py", line 736, in start
self.io_loop.start()
File "C:\Users\rache\anaconda3\Lib\site-packages\tornado\platform\asyncio.py", line 195, in start
self.asyncio_loop.run_forever()
File "C:\Users\rache\anaconda3\Lib\asyncio\base_events.py", line 607, in run_forever
self._run_once()
File "C:\Users\rache\anaconda3\Lib\asyncio\base_events.py", line 1922, in _run_once
handle._run()
File "C:\Users\rache\anaconda3\Lib\asyncio\events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\kernelbase.py", line 516, in dispatch_queue
await self.process_one()
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\kernelbase.py", line 505, in process_one
await dispatch(*args)
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\kernelbase.py", line 412, in dispatch_shell
await result
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\kernelbase.py", line 740, in execute_request
reply_content = await reply_content
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\ipkernel.py", line 422, in do_execute
res = shell.run_cell(
File "C:\Users\rache\anaconda3\Lib\site-packages\ipykernel\zmqshell.py", line 546, in run_cell
return super().run_cell(*args, **kwargs)
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3024, in run_cell
result = self._run_cell(
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3079, in _run_cell
result = runner(coro)
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
coro.send(None)
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3284, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3466, in run_ast_nodes
if await self.run_code(code, result, async_=asy):
File "C:\Users\rache\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3526, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "C:\Users\rache\AppData\Local\Temp\ipykernel_19468\3032755339.py", line 1, in <module>
import matplotlib.pyplot as plt
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\__init__.py", line 129, in <module>
from . import _api, _version, cbook, _docstring, rcsetup
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\rcsetup.py", line 27, in <module>
from matplotlib.colors import Colormap, is_color_like
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\colors.py", line 56, in <module>
from matplotlib import _api, _cm, cbook, scale
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\scale.py", line 22, in <module>
from matplotlib.ticker import (
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\ticker.py", line 138, in <module>
from matplotlib import transforms as mtransforms
File "C:\Users\rache\anaconda3\Lib\site-packages\matplotlib\transforms.py", line 49, in <module>
from matplotlib._path import (
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
AttributeError: _ARRAY_API not found
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[12], line 1
----> 1 import matplotlib.pyplot as plt
2 import numpy as np
File ~\anaconda3\Lib\site-packages\matplotlib\__init__.py:129
125 from packaging.version import parse as parse_version
127 # cbook must import matplotlib only within function
128 # definitions, so it is safe to import from it here.
--> 129 from . import _api, _version, cbook, _docstring, rcsetup
130 from matplotlib.cbook import sanitize_sequence
131 from matplotlib._api import MatplotlibDeprecationWarning
File ~\anaconda3\Lib\site-packages\matplotlib\rcsetup.py:27
25 from matplotlib import _api, cbook
26 from matplotlib.cbook import ls_mapper
---> 27 from matplotlib.colors import Colormap, is_color_like
28 from matplotlib._fontconfig_pattern import parse_fontconfig_pattern
29 from matplotlib._enums import JoinStyle, CapStyle
File ~\anaconda3\Lib\site-packages\matplotlib\colors.py:56
54 import matplotlib as mpl
55 import numpy as np
---> 56 from matplotlib import _api, _cm, cbook, scale
57 from ._color_data import BASE_COLORS, TABLEAU_COLORS, CSS4_COLORS, XKCD_COLORS
60 class _ColorMapping(dict):
File ~\anaconda3\Lib\site-packages\matplotlib\scale.py:22
20 import matplotlib as mpl
21 from matplotlib import _api, _docstring
---> 22 from matplotlib.ticker import (
23 NullFormatter, ScalarFormatter, LogFormatterSciNotation, LogitFormatter,
24 NullLocator, LogLocator, AutoLocator, AutoMinorLocator,
25 SymmetricalLogLocator, AsinhLocator, LogitLocator)
26 from matplotlib.transforms import Transform, IdentityTransform
29 class ScaleBase:
File ~\anaconda3\Lib\site-packages\matplotlib\ticker.py:138
136 import matplotlib as mpl
137 from matplotlib import _api, cbook
--> 138 from matplotlib import transforms as mtransforms
140 _log = logging.getLogger(__name__)
142 __all__ = ('TickHelper', 'Formatter', 'FixedFormatter',
143 'NullFormatter', 'FuncFormatter', 'FormatStrFormatter',
144 'StrMethodFormatter', 'ScalarFormatter', 'LogFormatter',
(...)
150 'MultipleLocator', 'MaxNLocator', 'AutoMinorLocator',
151 'SymmetricalLogLocator', 'AsinhLocator', 'LogitLocator')
File ~\anaconda3\Lib\site-packages\matplotlib\transforms.py:49
46 from numpy.linalg import inv
48 from matplotlib import _api
---> 49 from matplotlib._path import (
50 affine_transform, count_bboxes_overlapping_bbox, update_path_extents)
51 from .path import Path
53 DEBUG = False
ImportError: numpy.core.multiarray failed to import
test_list = [1,3,5,10,15,5]
# Put your code for comparing the answers here
2.3 Calculating Standard Deviation with Numpy#
Similarly to np.mean()
and np.median()
, we can calculate the standard deviation with numpy as follows:
np.std(data)
In the cell below, calculate the standard deviation of the water level for each of the four lakes using the numpy function.
# put your code here
2.4 Visualizing the Data with matplotlib#
✅ Now, let’s see what is in the files by plotting the second column versus the first column using matplotlib
. This means that the second column should go on the y-axis and the first column should go on the x-axis.
Do this for all of the files.
This is our first example of doing some (very simple!) data science - looking at some real data. As a reminder, the data came from here; if you ever find data like this in the real world, you could build a notebook like this one to examine it. In fact, your projects at the end of the semester might be much larger versions of this.
# plot the water levels here
Plots like this are not very useful. If you showed them to someone else they would have no idea what is in them. In fact, if you looked at them next week, you wouldn’t remember what is in them. Let’s use a little more matplotlib
to make them of professional quality. There are two things that every plot should have: labels on each axis. And, there are many other options:
✅ First, remake separate figures for each of the datasets you read in and include in the plots: \(x\)-axis labels, \(y\)-axis labels, grid lines, markers, and a title.
✅ Then, make all of them in the same plot using the same formating techniques you used in the separate plots but also add a legend.
We are not going to tell you how to do this directly! But, we’re here to help you to figure it out. If you find yourself waiting for help from an instructor, you can also try using Google to answer your questions. Searching the internet for coding tips and tricks is a very common practice!
The Python community also provides helpful resources: they have created a comprehensive gallery of just about any plot you can think of with an example and the code that goes with it. That gallery is here and you should be able to find many examples of how to make your plots look professional. (You just might want to bookmark that webpage…..)
# Put your code here to make each plot separately. You might need to create multiple notebook cells or use "subplot"
# Make sure they are professionaly constructed using all of the options above.
# Make another plot here with all the data in the same plot and include a legend
# It still might not be super useful, but at least with a legend you can tell which line is which!
✅ What observations about the data do you have? Are the lake levels higher or lower than they have been in the past? Put your answer here:
✎ In the data I see….
Part 3: Looking for correlations in data (Time Permitting)#
In the plots you have made so far you have plotted water levels versus time. This is fairly intuitive and corresponds to the way the data was given to us. Next, we are going to do something a little more abstract to seek correlations in the data, a standard goal in data science. As you have seen, there are a lot of fluctuations in the data - what do they tell us? For example, do the levels go up at certain times of year? In certain years that had more rain? Can we see evidence of global warming? While we won’t answer these questions at this point, we can look for patterns across the lakes to see if the fluctations in levels might correspond to trends. To do this, we will plot the level of one lake versus the level of another lake (we will not plot either level against time). Note that we somewhat lose the time information because we aren’t using that array anymore.
✅ In the cell below, plot the level of one lake versus the level of another lake (time should not be involved in your plot command) - do this for several combinations. Put them in separate cells if you need to - otherwise each will be in the same plot, which might be less useful. (If you’re feeling comfortable using subplot feel free to use that.)
# add your plots here (with labels, titles, legend, grid)
# what line type should you use? what are the best markers to use?
# next lake here, and so on.....
✅ In this cell, write your observations. What do you observe about the lake levels?
✎ I observed…..
3.2 Pearson Correlation Coefficient with np.corrcoef()
#
Now that you have made some qualitative observations of the water levels, we are going to explore how to quantify those observations. There are many ways to measure correlation, but we are going to use the Pearson Correlation Coefficient (also referred to as “r”, “\(\rho\)”, or “correlation coefficient”). The Pearson Correlation Coefficient ranges from -1 to 1 and provides a measure of how dependent on one another your variables are.
If the correlation coefficient is close to 1 or -1, then the correlation is strong, if it is close to zero, the correlation is weak. If the the correlation coefficient is negative, then the y values decrease as x increases. If the correlation coefficient is positive, then the y values increase as x increases. See the image below for a visualization of this!
.
Using np.corrcoef()
#
To calculate the correlation coefficient, we are going to use another numpy
function, np.corrcoef. To use np.corrcoef()
, you will call it with your x and y variables like this:
np.corrcoef(x,y)
It will return all of the possible correlation coefficients (including the data with itself) as an array.
✅ In the cell(s) below, choose two of the plots you made in the beginning of Part 3 and calculate the correlation coefficient on the data from each plot. Were there any differences between your qualitative observations and your quantitative calculations?
# put your code here
✎ I observed…..
ASIDE: Saving Plots#
Finally, you will need to use your plots for something. In your other classes and labs you often will need to make plots for your assignments and lab reports - now is the time to start using Python for that! Modify the code above to write the plot into a file in PNG format. Here are a couple of examples for how you can save files as a PNG file and as a PDF file:
plt.savefig('foo.png')
plt.savefig('foo.pdf')
Put your name in the filename so that we can keep track of your work.
Assignment wrapup#
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
from IPython.display import HTML
HTML(
"""
<iframe
src="https://cmse.msu.edu/cmse201-ic-survey"
width="800px"
height="600px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)
Congratulations, you’re done!#
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “In-class assignments” folder, find the appropriate submission folder, and upload it there. Make sure to upload your plot images as well!
If the rest of your group is still working, help them out and show them some of the things you learned!
See you next class!
© Copyright 2024, The Department of Computational Mathematics, Science and Engineering at Michigan State University