Week 01 In Class Assignment: Covid modeling

Week 01 In Class Assignment: Covid modeling#

✅ Put your name here.#

✅ Put your group member names here.

covid

Welcome to your first ICA!

In this first ICA you are going to work through the process of completing a project with your group and learn a little about curve fitting. Most of the coding has been done for you: you need to understand the coding, get it to work on your machine and be able to make modfications to the code. It’s open ended what you choose to do.

You have one hour to complete this assignment. After one hour we will choose a few groups randomly to present what they did. Make nice plots and compose a narrative/story with your group in case you are picked.

There are many ways we could make predictions from data, and we’ll see some very powerful ways of doing this later in the semester. In particular, we want to predict the future of covid. That is, we want to forecast what covid will do for the rest of the semester based on recent data.

For today, let’s do something very simple: collect covid data, construct several hypotheses \(g\), fit the hypotheses \(g\) to the data, determine which \(g\) is likely to be closest to the true function and use that \(g\) to predict the future (forecast). We will use data from the New York Times, which is in the form of a CSV (comma separated values) file.

The data can be found here

Read through this code and be sure you understand (most of) it. Surely there will be a few things you haven’t seen yet, and don’t worry about that for now. Discuss the code with your group and see if together you can understand most of what it does. If this code looks very mysterious to you, contact me right away - you might need some Python catch-up!

# read libraries

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('bmh')
from scipy.optimize import curve_fit # no fancy ML library yet, just SciPy
import pandas as pd

Part 1. Get the data#

# Read data, put it into a dataframe and process it a bit
# you will need to modify this so that it points to the file you downloaded
df = pd.read_csv(<filename>)

Explore the dataset#

Use pandas and seaborn to explore the dataset

# Let's look at the dataset first
# Use this code cell to learn something from the dataset

✅ Question 1: What have you learned? Based on your analysis what are the most important features? What feature would you use to model and predict the spread of covid?

Answer in the cell below

✎ Put your answer here

Part 2. Modeling#

The dataset contains the number of COVID cases and deaths, with their respective rolling averages, from Jan 21 2020 until March 03 2023. We want to predict the number of cases so let’s take only a slice of our dataset.

Modify the code below to read in as many columns as you want to explore.

I’ll give you an idea: today, many people use home testing for covid. And, testing is imperfect. As a result, the number of reported cases likely has a huge uncertainty. But (sadly), when someone dies it is pretty clear and always reported. Perhaps forecasting deaths is more meaningful? Try both?

Training and Testing dataset#

Since we are using time series data our training dataset and testing dataset correspond to slices in time with consecutive values. Obviously, the testing dataset should be the last days/months.

In the cell below divide your dataset into a training set and into a testing set.

# Put your code here
# Choose your time range
cases = df["cases_avg"].iloc[1000:1085]
# It is easier to have a numeric array for the x-axis, but plots are nicer when you use a datetime format.
days = np.arange(len(cases))

Define your model#

This is where we define our hypotheses. One is given here – a Gaussian – but you will want to try several choices. Look at your data to see what makes sense. Even better, think about your data; for example, would a polynomial be reasonable?

Have each person in your group suggest a different hypothesis.

# define the hypotheses we want to fit to
# change this to any other functions you want to try

def gaussian(x, a, b, c, d):
    return a*np.exp(-(x - b)**2/c) + d

# Keep adding your models here

def model_1():
    pass

Train your model#

In the next few cells you will use curve_fit to learn the parameters of your model. Pay attention to what the curve_fit library is returning. Apart from just doing the curve fit, it provides quantitative information about the quality of the fit, which you can use for what is called “model selection”. That is, you can use this information to inform you which hypothesis is best.

Curve fitting works by solving an optimization problem; we will see a lot of optimization in this course. Sometimes it is necessary to help the optimization process by giving a good intitial guess. That is done here. If it doesn’t help, comment it out. Or, if curve_fit is having trouble finding a solution for one of your hypotheses, this is how you help it.

# p0 = [1e6, 36, 100, 1e5] is used here to help the algorithm find the solution by
# giving it a good initial guess for its search
# what does this syntax mean and what is being returned? discuss within your group
popt, pcov = curve_fit(..., ..., ... , p0 = [1e6, 36, 100, 1e5]) # this is the ML step: train model with the data (supervised learning, regression)

# be sure to look at what curve_fit returns: what are these?! how do you use them?!
print("\n popt is:\n {}\n\n pcov is:\n {}\n".format(popt,pcov)) 

Add all of your hypotheses to the plot. Most likely they will all look “good” - that is what the optimization algorithm tried to do after all. We need a way to examine the quality beyond just looking at the curves.

plt.figure(figsize=(12,6))
# Data
plt.plot(..., ..., 'o', label = "Data")
# Model
plt.plot(..., ..., label = "My Model") 

plt.legend()
plt.title("Covid Forecast")
# Don't forget to label your axes
plt.xlabel(...)
plt.ylabel(...)

Luckily, curve_fit provides us with pcov, which is the covariance matrix of the fit parameters.

✅ Question 2: Explain in your own words what information the covariance matrix provides.

✎ Put your answer here

Model Validation#

It is now time to see how good your models are. In the next code cell I provide some code showing how I will do that in the case of a gaussian. I sample the fit parameters from a four-dimensional Gaussian (we have four parameters) using pcov from curve_fit. I can then run with those samples to see how confident we should be in the curve we just produced.

times = np.array([])
# Predict the future
days_fine = np.arange(0, 120)

plt.figure(figsize=(10,6))

# sample from the covariance matrix of fit parameters
for _ in range(1000):
    # IMPORTANT: Make sure you understand what this code snippet does. What role do popt and pcov play here?
    p1, p2, p3, p4 = np.random.multivariate_normal(popt, pcov)
    times = np.append(times,p2)
    sample = gaussian(days_fine, p1, p2, p3, p4)
    plt.plot(days_fine, sample, 'gray', alpha = 0.04)

plt.title("Monte Carlo Using Parameter Covariance Matrix")
plt.plot(days, cases, 'o')
plt.plot(days_fine, gaussian(days_fine, popt[0], popt[1], popt[2], popt[3]))
plt.xlabel("days")
plt.ylabel("cases")
# plt.ylim(0,1.2e6)

Part 3: Presentation#

Prepare some slides with Google Slides or MS Powerpoint to present your results. Do not worry about the slides design/template. Try to build your slides by include these:

do different slices: in particular, what if you didn’t have the most recent data point?
explore the data: is there an issue with fitting just after a weekend? (as this ICA was being written, the NYT added data each day - how does each new data point change the regression?)
discuss with your group whether you believe this prediction or not,
the code read in the cases_avg column from the NYT data - rerun with the cases column - what changes?
add a vertical line for Feb 1, 2023 (that’s when we switched back to full in-person teaching),
annotate the plot with text and an arrow that points to the vertical line with “back in person”,
any other visualization tricks you want to include.

If you have extra time, explore your own ideas with this dataset. If you have some nice results, contact an instructor to volunteer to show your results to the class.

Now that you are done, follow these steps:

Submit your notebook to D2L.
Be sure to include the names of everyone in your group.