{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Day 05: ICA: Web Scraping\n",
"\n",
"\n",
"\n",
"Chief Executives by Steve Penley"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Student Identification\n",
"\n",
"Please do not modify the structure or format of this cell. Follow these steps:\n",
"\n",
"1. Look for the \"YOUR SUBMISSION\" section below\n",
"2. Replace `[Your Name Here]` with your name\n",
"3. If you worked with groupmates, include their names after your name, separated by commas\n",
"4. Do not modify any other part of this cell\n",
"\n",
"Examples:\n",
"```\n",
"Single student: John Smith\n",
"Group work: John Smith, Jane Doe, Alex Johnson\n",
"```\n",
"\n",
"YOUR SUBMISSION (edit this line only):\n",
"
✅ [Your Name Here]\n",
"\n",
"Note:\n",
"- Keep the \"✅\" symbol at the start of your submission line\n",
"- Use commas to separate multiple names\n",
"- Spell names exactly as they appear in the course roster\n",
"- This cell has been tagged as \"names\" - do not modify or remove this tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"- [Learning objectives](#learning-objectives)\n",
"- [Project Goal](#project-goal)\n",
"- [Part 1: Understanding HTML Structure for Web Scraping](#part-1:-understanding-html-structure-for-web-scraping)\n",
"- [Part 2. Using BeautifulSoup to Explore HTML](#part-2.-using-beautifulsoup-to-explore-html)\n",
"- [Part 3. Implementing Your Web Scraping Code](#part-3.-implementing-your-web-scraping-code)\n",
"- [Congratulations, you're done!](#congratulations,-you're-done!)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Learning objectives\n",
"\n",
"By the end of this assignment you should be able to:\n",
"\n",
"- Get data out of webpage using `BeautifulSoup`.\n",
"- Explain the major code elements required to deal with a webpage.\n",
"- Lint, format, and auto-document your code.\n",
"\n",
"## Project Goal\n",
"\n",
"The goal of today is to scrap the library of Congress webpage to collect data on the U.S. Presidents and put it in a `pandas.DataFrame` for future data analysis. The link we will scrape is [this](https://www.loc.gov/rr/print/list/057_chron.html). \n",
"\n",
"The original idea of this ICA comes from [this post](https://blog.exploratory.io/scraping-us-presidents-list-from-web-and-transforming-it-to-be-useful-fff534470bb6). \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"## Part 1: Understanding HTML Structure for Web Scraping\n",
"\n",
"In web scraping, the information we want is typically defined within specific [HTML tags](https://www.w3schools.com/TAGS/default.ASP). As with any task in computational analysis, the first step is always to get to know your data.\n",
"\n",
"### 🔍 Exploring the HTML Structure\n",
"\n",
"1. Open [this page](https://www.loc.gov/rr/print/list/057_chron.html) in your browser.\n",
"2. Right-click on the page and select `Inspect` (or `Inspect Element` in some browsers).\n",
"3. In the developer tools panel that opens, you'll see the HTML structure of the page.\n",
"4. As you hover over elements in the inspector, corresponding parts of the webpage will be highlighted.\n",
"\n",
"🗒️ **Task:** Working with your groupmates, answer the following questions:\n",
"\n",
"1. What is the title of the page? \n",
" *Hint:* Look for the `
` tag. Try `Ctrl` + `F` \n",
"2. Which tag(s) contain the main content of the page?\n",
" *Hint:* Look for tags like ``, ``, or `` with specific classes or IDs.\n",
"3. How is the table of presidents structured? \n",
" *Hint:* Look for the `
` tag and its children (`
`, `
`, `
`).\n",
"4. Are there any unique identifiers (like classes or IDs) that could help us locate the table?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✏️ **Answer:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✏️ **Instructor Answer:**\n",
"\n",
"1. The title of the page is \n",
"\n",
"``` html \n",
"\"Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress\"\n",
"```\n",
"\n",
"2. `` contains the main content of the page.\n",
"\n",
"3. The table is in the division `
`. The table is divided into a header and body, `` and `` respectively.\n",
"- `
` stands for a division in HTML.\n",
"- `
` stands for table row.\n",
"- `
` stands for table data.\n",
"- `
` stands for table header.\n",
"\n",
"1. The unique identifier is `
`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2. Using BeautifulSoup to Explore HTML\n",
"\n",
"Now that we've visually inspected the HTML, let's use BeautifulSoup to programmatically explore it. Run the following code"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"url = \"https://www.loc.gov/rr/print/list/057_chron.html\"\n",
"response = requests.get(url)\n",
"soup = BeautifulSoup(response.content, 'html.parser')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Exploring the BeautifulSoup Object**\n",
"\n",
"Now that we've created our `soup` object, let's explore what it contains and how we can interact with it. This exploration will help you understand the structure of the parsed HTML and how to extract information from it.\n",
"\n",
"🖥️ **Task:** Now, use Python commands to investigate `soup` object. Here are some suggestions for exploration:\n",
"\n",
"1. Determine the type of the `soup` object. What class is it?\n",
"2. Use the `help()` function to view the documentation for the BeautifulSoup object. What methods look useful for our task of extracting table data?\n",
"3. Try to display the entire `soup` object. What happens? Why do you think this occurs?\n",
"4. Can you find a way to display just a portion of the `soup` object? Hint: Think about how you might convert it to a string and slice it.\n",
"5. Explore these commonly used BeautifulSoup methods. For each one, try to understand what it does and how it might be useful:\n",
" - `find()`\n",
" - `find_all()`\n",
" - `select()`\n",
" - `get_text()`\n",
"6. Can you use any of these methods to:\n",
" - Find the title of the webpage?\n",
" - Locate the main table in the document?\n",
" - Count how many table rows (`