This is the webpage for CMSE495 Data Science Capstone Course (Spring 2022)
Today we will be exploring some of the extensive datasets available at the National Oceanic and Atmospheric Administration (NOAA). Work as a team to try to get as many of todays activities done. We will meet as a class again around 3:30pm to discuss what you learned.
Image From: https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/us-climate-reference-network-uscrn
We are going to start today’s activity by doing a code review of a web spider program.
✅ DO THIS: Download the noaa_scraper.py and this jupyter notebook annd put them in the same directory. Run the file via the following command:
%matplotlib inline
import matplotlib.pyplot as plt
from noaa_scraper import get_noaa_temperatures
✅ DO THIS: Run the get_noaa_temperatures
function as follows:
air_temperatures = get_noaa_temperatures('http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/', 'Gaylord', 100)
plt.plot(air_temperatures)
# plt.axis([0,1000,-20,80])
✅ DO THIS: With your group, do a code review of the contents of the noaa_scraper.py file and figure out what it does. What are the main part of this module and what do they do? Be prepared to discuss this with the class.
Put your notes here
For this class we will be trying out BeautifulSoup, a Python web parsing module.
✅ DO THIS: Install the beautifulsoup4
library on your computer (the following will work on jupyterhub but should work anywhere). When you are done, help your neighbor and raise your hand if you need help.
#!mkdir packages
#!pip install -t ./packages/ beautifulsoup4
import sys
sys.path.append('./packages/')
This second example is a web scraper program. Found this idea by reading the following blog post: Scraping US President list
✅ DO THIS: Click on the following link and review the page source with your teams. Discuss which tags you need to look for to try and isolate the table data only. Ideally we want to create a pandas table
of this data:
Chronological List of Presidents, First Ladies, and Vice Presidents of the United States
Put notes on what you find here.
The following code should download the above website and parse read it into a beautifulsoup
object:
#The following library downloads the data and stores it in a page variable
import requests
page = requests.get("https://www.loc.gov/rr/print/list/057_chron.html")
#Import and run beautifule should html.parser
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
✅ DO THIS: explore the soup
variable using python functions such as; type
, dir
and help
.
#Put your answer to the above here
#Print out the raw html using "pretty print"
print(soup.prettify())
Next, the following code finds all of the table
sections in the website:
tables = soup.find_all('table')
type(tables)
len(tables)
According to the above the results show that there are 9 table
objects in the document. We are just looking for the one that has our data in it.
✅ DO THIS: Find the table from the nine tables that has only the data we want. Make a variable table
that only includes the information we want. Hint, it is not the first table which we can see by using the following code.
table = tables[0]
print(table.prettify())
The rows of a table are determined by the tr
(table row) tag and the columns are determined by the td
. The following code can find all of the rows in the table:
rows = table.find_all('tr')
rows
The first row is the column header row as can be seen by running the following code:
rows[0]
labels = []
for c in rows[0].find_all('th'):
labels.append(c.get_text())
labels
✅ DO THIS: The next step is to loop though the remaining rows and save the data as a list of lists
#put your code here
Assuming the above works, we can convert the list of lists and labels to a Pandas Dataframe
import pandas as pd
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=labels)
The two above examples were fairly simple. However, sometimes websites get a lot more complex. This is especially true when the website includes “client side” code. This code (typically javascript) runs on the web browser in your local computer and not the web server. It makes the problem difficult because to pull the data out you often need to either figure out what the code is doing and mimic it in your python program or “render” the program using a javascript client and then figure out the output.
Fortunatly there are tools that can help. Have your team do a google search and see if you can find some python tools specifically designed to help render dynamic websites. See if you can download/install and test the code.
Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.