Day 05: ICA: Web Scraping#
Chief Executives by Steve Penley
Student Identification#
Please do not modify the structure or format of this cell. Follow these steps:
Look for the “YOUR SUBMISSION” section below
Replace
[Your Name Here]
with your nameIf you worked with groupmates, include their names after your name, separated by commas
Do not modify any other part of this cell
Examples:
Single student: John Smith
Group work: John Smith, Jane Doe, Alex Johnson
YOUR SUBMISSION (edit this line only):
✅ [Your Name Here]
Note:
Keep the “✅” symbol at the start of your submission line
Use commas to separate multiple names
Spell names exactly as they appear in the course roster
This cell has been tagged as “names” - do not modify or remove this tag
Table of Contents#
Learning objectives#
By the end of this assignment you should be able to:
Get data out of webpage using
BeautifulSoup
.Explain the major code elements required to deal with a webpage.
Lint, format, and auto-document your code.
Project Goal#
The goal of today is to scrap the library of Congress webpage to collect data on the U.S. Presidents and put it in a pandas.DataFrame
for future data analysis. The link we will scrape is this.
The original idea of this ICA comes from this post.
Part 1: Understanding HTML Structure for Web Scraping#
In web scraping, the information we want is typically defined within specific HTML tags. As with any task in computational analysis, the first step is always to get to know your data.
🔍 Exploring the HTML Structure#
Open this page in your browser.
Right-click on the page and select
Inspect
(orInspect Element
in some browsers).In the developer tools panel that opens, you’ll see the HTML structure of the page.
As you hover over elements in the inspector, corresponding parts of the webpage will be highlighted.
🗒️ Task: Working with your groupmates, answer the following questions:
What is the title of the page? Hint: Look for the
<title>
tag. TryCtrl
+F
Which tag(s) contain the main content of the page? Hint: Look for tags like
<main>
,<article>
, or<body>
with specific classes or IDs.How is the table of presidents structured? Hint: Look for the
<table>
tag and its children (<tr>
,<th>
,<td>
).Are there any unique identifiers (like classes or IDs) that could help us locate the table?
✏️ Answer:
✏️ Instructor Answer:
The title of the page is
<title>"Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress"</title>
<body>
contains the main content of the page.The table is in the division
<div class="table-responsive"><table class="table table-bordered">
. The table is divided into a header and body,<thead>
and<tbody>
respectively.
<div>
stands for a division in HTML.<tr>
stands for table row.<td>
stands for table data.<th>
stands for table header.
The unique identifier is
<table class = "table table-bordered">
Part 2. Using BeautifulSoup to Explore HTML#
Now that we’ve visually inspected the HTML, let’s use BeautifulSoup to programmatically explore it. Run the following code
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/rr/print/list/057_chron.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
Exploring the BeautifulSoup Object#
Now that we’ve created our soup
object, let’s explore what it contains and how we can interact with it. This exploration will help you understand the structure of the parsed HTML and how to extract information from it.
🖥️ Task: Now, use Python commands to investigate soup
object. Here are some suggestions for exploration:
Determine the type of the
soup
object. What class is it?Use the
help()
function to view the documentation for the BeautifulSoup object. What methods look useful for our task of extracting table data?Try to display the entire
soup
object. What happens? Why do you think this occurs?Can you find a way to display just a portion of the
soup
object? Hint: Think about how you might convert it to a string and slice it.Explore these commonly used BeautifulSoup methods. For each one, try to understand what it does and how it might be useful:
find()
find_all()
select()
get_text()
Can you use any of these methods to:
Find the title of the webpage?
Locate the main table in the document?
Count how many table rows (
<tr>
tags) are in the document?Extract the text from the first table row?
Challenge: Can you extract and print the name and term of the first president in the table?
🤔 Discussion Questions:
What challenges did you encounter when exploring the
soup
object?How might the methods you discovered be useful for extracting the presidential data?
Based on your exploration, what strategy would you use to extract all the data from the table?
đź“š Resources:
Remember, web scraping often requires experimentation. Don’t hesitate to try different approaches, and use print statements to understand what each method returns.
### ANSWER
# Put your code here
✏️ Answer: Put your strategy here.
### INSTRUCTOR ANSWER
# 1. Determine the type of the soup object
print("1. Type of soup object:", type(soup))
# 2. Use help() function (this would typically be done in an interactive environment)
# help(soup)
# 3. & 4. Display a portion of the soup object
print("\n3. & 4. First 500 characters of soup:")
print(str(soup)[:500])
# 5. & 6. Explore BeautifulSoup methods
# Find the title of the webpage
print("\n5. & 6. Webpage title:")
print(soup.find('title').get_text())
# Locate the main table
main_table = soup.find('table')
print("\nMain table found:", main_table is not None)
# Count table rows
rows = soup.find_all('tr')
print("\nNumber of table rows:", len(rows))
# Extract text from the first table row
first_row = rows[0]
print("\nText from first table row:")
print(first_row.get_text(strip=True))
# 7. Challenge: Extract name and term of the first president
# Note: We skip the header row (index 0) to get the first data row
first_president_row = rows[1]
cells = first_president_row.find_all('td')
if len(cells) >= 3: # Ensure we have enough cells
name = cells[1].get_text(strip=True)
term = cells[2].get_text(strip=True)
print("\n7. First president:")
print(f"Name: {name}")
print(f"Term: {term}")
else:
print("\n7. Couldn't extract first president's data. Check the table structure.")
# Demonstration of select() method
print("\nDemonstration of select() method:")
headers = soup.select('table th')
print("Table headers:", [header.get_text(strip=True) for header in headers])
1. Type of soup object: <class 'bs4.BeautifulSoup'>
3. & 4. First 500 characters of soup:
<!DOCTYPE html>
<html lang="en"><head><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress</title><meta content="width=device-width, initial-scale=1.0" name="viewport"/><meta content="noarchive" name="robots"><link href="https://www.loc.gov/f
5. & 6. Webpage title:
Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress
Main table found: True
Number of table rows: 69
Text from first table row:
YEARPRESIDENTFIRST SPOUSEVICE PRESIDENT
7. First president:
Name: George Washington
Term: Martha Washington
Demonstration of select() method:
Table headers: ['YEAR', 'PRESIDENT', 'FIRST SPOUSE', 'VICE PRESIDENT']
Part 3. Implementing Your Web Scraping Code#
Now that you’ve explored the structure of the webpage and experimented with BeautifulSoup, it’s time to write code to scrape the data. If you weren’t able to formulate the strategy you can follow the one below.
đź“‹ Web Scraping Strategy Outline:#
Locating the Table:
How will you find the specific table containing the presidential data?
What BeautifulSoup method(s) will you use?
Extracting Table Rows:
How will you separate the header row from the data rows?
What’s your plan for iterating through the rows?
Parsing Row Data:
How will you extract individual pieces of information (e.g., name, term) from each row?
Are there any data cleaning steps you need to consider?
Data Storage:
In what format will you store the extracted data? (e.g., list of dictionaries, pandas DataFrame)
How will you handle any potential missing data?
(Optional, time permitting) Error Handling:
What potential issues might arise during the scraping process?
How will you handle these to make your code more robust?
🖥️ Implementation Task:#
Based on the strategy above, implement your web scraping code. Here’s a suggested structure for your implementation:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_president_data(url):
# Your code here to scrape the data
# Remember to break down your code into smaller functions for each step
pass
url = "https://www.loc.gov/rr/print/list/057_chron.html"
df = scrape_president_data(url)
# Display the first 10 rows of the DataFrame
display(df.head(10))
đź“ť Task:
Implement the scrape_president_data function, breaking it down into smaller functions for each step of the strategy if needed.
Ensure your code creates a pandas DataFrame with the scraped data.
Run your code and verify that it correctly displays the first 10 rows of the scraped data.
### ANSWER
# Use this and other code cells to write your code.
### INSTRUCTOR ANSWER
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_presidents_data(url):
# Fetch the webpage content
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Find the table containing presidents' data
table = soup.find("table")
# Extract table headers
headers = [header.text.strip() for header in table.find_all("th")]
# Parse table rows and extract data
data = []
for row in table.find_all("tr")[1:]: # Skip the header row
cols = [col.text.strip() for col in row.find_all("td")]
if cols:
data.append(cols)
# Create a pandas DataFrame
df = pd.DataFrame(data, columns=headers)
return df
url = "https://guides.loc.gov/presidents-portraits/chronological"
df = scrape_presidents_data(url)
# Display the first 10 rows of the DataFrame
display(df.head(10))
# # Optional: Save the data to a CSV file
# df.to_csv("us_presidents.csv", index=False)
# print("\nData saved to 'us_presidents.csv'")
Congratulations, you’re done!#
Submit this assignment by uploading your notebook to the course Desire2Learn web page. Go to the “In-Class Assignments” folder, find the appropriate submission link, and upload everything there. Make sure your name is on it!
© 2024 Michigan State University. This material was created for the Department of Computational Mathematics, Science and Engineering (CMSE) at Michigan State University.