{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[Link to this jupyter notebook](./Files/0204-Web_Scraping.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# In-Class Assignment: Web Scraping\n", "\n", "Today we will be exploring some of the extensive datasets available at the National Oceanic and Atmospheric Administration (NOAA). Work as a team to try to get as many of todays activities done. We will meet as a class again around 3:30pm to discuss what you learned. \n", "\n", "\n", "\n", "Image From: https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/us-climate-reference-network-uscrn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Agenda for today's class (80 minutes)\n", "\n", "\n", "1. [(10 minutes) NOAA Example](#NOAA_Example)\n", "1. [(5 minutes) Installing Beautiful Soup](#Installing_Beautiful_Soup)\n", "2. [(20 minutes) Presidential data example](#Presidential_data_example)\n", "4. [(20 minutes) Dynamic Website example](#DynamicWebsites)\n", "5. [(25 minutes) wrap-up Discussion](#Wrapup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "\n", "# 1. NOAA Example and Coding Standards.\n", "\n", "We are going to start today's activity by doing a code review of a **_web spider_** program. \n", "\n", "✅ **DO THIS:** Download the [noaa_scrapper.py](./Files/noaa_scrapper.py) and [this jupyter notebook](./Files/0204-Web_Scraping.ipynb') annd put them in the same directory. Run the file via the following command:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "running as noaa_scraper\n" ] } ], "source": [ "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "\n", "from noaa_scraper import get_noaa_temperatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ **DO THIS:** Run the ```get_noaa_temperatures``` function as follows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "691ca32a38264410b08bf84dcfc82a97", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2007/CRNS0101-05-2007-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2007-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2008/CRNS0101-05-2008-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2008-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2009/CRNS0101-05-2009-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2009-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2010/CRNS0101-05-2010-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2010-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2011/CRNS0101-05-2011-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2011-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2012/CRNS0101-05-2012-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2012-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2013/CRNS0101-05-2013-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2013-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2014/CRNS0101-05-2014-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2014-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2015/CRNS0101-05-2015-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2015-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2016/CRNS0101-05-2016-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2016-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2017/CRNS0101-05-2017-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2017-MI_Gaylord_9_SSW.txt\n", "FOUND http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2018/CRNS0101-05-2018-MI_Gaylord_9_SSW.txt\n", "downloading... ./data/CRNS0101-05-2018-MI_Gaylord_9_SSW.txt\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "04e5ba406f9145a4ac3b3c790f63d37a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, max=12.0)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "reading... ./data/CRNS0101-05-2007-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2008-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2009-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2010-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2011-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2012-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2013-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2014-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2015-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2016-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2017-MI_Gaylord_9_SSW.txt\n", "reading... ./data/CRNS0101-05-2018-MI_Gaylord_9_SSW.txt\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "air_temperatures = get_noaa_temperatures('http://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/', 'Gaylord', 100)\n", "plt.plot(air_temperatures)\n", "# plt.axis([0,1000,-20,80])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ **DO THIS:** With your group, do a code review of the contents of the **noaa_scraper.py** file and figure out what it does. What are the main part of this module and what do they do? Be prepared to discuss this with the class. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put your notes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "\n", "# 2. Installing Beautiful Soup\n", "\n", "For this class we will be trying out [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a Python web parsing module. \n", "\n", "✅ **DO THIS:** Install the ```beautifulsoup4``` library on your computer (the following will work on jupyterhub but should work anywhere). When you are done, help your neighbor and raise your hand if you need help." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#!mkdir packages" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#!pip install -t ./packages/ beautifulsoup4" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append('./packages/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "# 3. Presidential data example\n", "This second example is a **_web scraper_** program. Found this idea by reading the following blog post: https://blog.exploratory.io/scraping-us-presidents-list-from-web-and-transforming-it-to-be-useful-fff534470bb6\n", "\n", "✅ **DO THIS:** Click on the following link and review the page source with your teams. Discuss which tags you need to look for to try and isolate the table data only. Ideally we want to create a ```pandas table``` of this data:\n", "https://www.loc.gov/rr/print/list/057_chron.html\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put notes on what you find here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download and view html\n", "\n", "The following code should download the above website and parse read it into a ```beautifulsoup``` object:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#The following library downloads the data and stores it in a page variable\n", "import requests\n", "page = requests.get(\"https://www.loc.gov/rr/print/list/057_chron.html\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Import and run beautifule should html.parser\n", "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(page.content, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ **DO THIS:** explore the ```soup``` variable using python functions such as; ```type```, ```dir``` and ```help```.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Put your answer to the above here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##ANSWER##\n", "type(soup)\n", "##ANSWER##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##ANSWER##\n", "dir(soup)\n", "##ANSWER##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##ANSWER##\n", "help(soup)\n", "##ANSWER##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Print out the raw html using \"pretty print\" \n", "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the Tables\n", "\n", "Next, the following code finds all of the ```table``` sections in the website:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tables = soup.find_all('table')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(tables)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(tables)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the above the results show that there are 9 ```table``` objects in the document. We are just looking for the one that has our data in it. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ **DO THIS:** Find the table from the nine tables that has only the data we want. Make a variable ```table``` that only includes the information we want. Hint, it is not the first table which we can see by using the following code. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table = tables[0]\n", "print(table.prettify())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##ANSWER##\n", "table = tables[3]\n", "print(table.prettify())\n", "##ANSWER##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parse out all the rows\n", "\n", "The rows of a table are determined by the ```tr``` (table row) tag and the columns are determined by the ```td```. The following code can find all of the rows in the table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows = table.find_all('tr')\n", "rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the column labels\n", "\n", "The first row is the column header row as can be seen by running the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = []\n", "for c in rows[0].find_all('th'):\n", " labels.append(c.get_text())\n", "labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parse Rows\n", "\n", "✅ **DO THIS:** The next step is to loop though the remaining rows and save the data as a list of lists" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#put your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##ANSWER##\n", "\n", "data = [] \n", "for row in rows[1:]:\n", " myrow = []\n", " for c in row.find_all('td'):\n", " myrow.append(c.get_text())\n", " data.append(myrow)\n", "\n", "##ANSWER##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert list of list to Pandas Dataframe\n", "\n", "Assuming the above works, we can convert the list of lists and labels to a Pandas Dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", " \n", "# Create the pandas DataFrame \n", "df = pd.DataFrame(data, columns=labels) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "\n", "\n", "# 4. Dynamic Website example\n", "\n", "The two above examples were fairly simple. However, sometimes websites get a lot more complex. This is especially true when the website includes \"client side\" code. This code (typically javascript) runs on the web browser in your local computer and not the web server. It makes the problem difficult because to pull the data out you often need to either figure out what the code is doing and mimic it in your python program or \"render\" the program using a javascript client and then figure out the output. \n", "\n", "\n", "Fortunatly there are tools that can help. Have your team do a google search and see if you can find some python tools specifically designed to help render dynamic websites. See if you can download/install and test the code.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "\n", "\n", "\n", "# 5. Wrap-up Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Written by Dr. Dirk Colbry, Michigan State University\n", "\"Creative
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 2 }