{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Day 05: ICA: Web Scraping\n",
    "\n",
    "<img src=\"https://assets3.cbsnewsstatic.com/hub/i/r/2016/11/05/53a023c0-5967-40bd-8226-a32a439f3d00/thumbnail/620x308/8ffae95c512029a1e08507fee9391780/gallery-steve-penley-44-presidents-610.jpg\" width=600></img>\n",
    "\n",
    "Chief Executives by Steve Penley"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Student Identification\n",
    "\n",
    "Please do not modify the structure or format of this cell. Follow these steps:\n",
    "\n",
    "1. Look for the \"YOUR SUBMISSION\" section below\n",
    "2. Replace `[Your Name Here]` with your name\n",
    "3. If you worked with groupmates, include their names after your name, separated by commas\n",
    "4. Do not modify any other part of this cell\n",
    "\n",
    "Examples:\n",
    "```\n",
    "Single student:  John Smith\n",
    "Group work:      John Smith, Jane Doe, Alex Johnson\n",
    "```\n",
    "\n",
    "YOUR SUBMISSION (edit this line only):\n",
    "<p style='text-align: left;'> &#9989; [Your Name Here]\n",
    "\n",
    "Note:\n",
    "- Keep the \"&#9989;\" symbol at the start of your submission line\n",
    "- Use commas to separate multiple names\n",
    "- Spell names exactly as they appear in the course roster\n",
    "- This cell has been tagged as \"names\" - do not modify or remove this tag"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "- [Learning objectives](#learning-objectives)\n",
    "- [Project Goal](#project-goal)\n",
    "- [Part 1: Understanding HTML Structure for Web Scraping](#part-1:-understanding-html-structure-for-web-scraping)\n",
    "- [Part 2. Using BeautifulSoup to Explore HTML](#part-2.-using-beautifulsoup-to-explore-html)\n",
    "- [Part 3. Implementing Your Web Scraping Code](#part-3.-implementing-your-web-scraping-code)\n",
    "- [Congratulations, you're done!](#congratulations,-you're-done!)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Learning objectives\n",
    "\n",
    "By the end of this assignment you should be able to:\n",
    "\n",
    "- Get data out of webpage using `BeautifulSoup`.\n",
    "- Explain the major code elements required to deal with a webpage.\n",
    "- Lint, format, and auto-document your code.\n",
    "\n",
    "## Project Goal\n",
    "\n",
    "The goal of today is to scrap the library of Congress webpage to collect data on the U.S. Presidents and put it in a `pandas.DataFrame` for future data analysis. The link we will scrape is [this](https://www.loc.gov/rr/print/list/057_chron.html). \n",
    "\n",
    "The original idea of this ICA comes from [this post](https://blog.exploratory.io/scraping-us-presidents-list-from-web-and-transforming-it-to-be-useful-fff534470bb6). \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "## Part 1: Understanding HTML Structure for Web Scraping\n",
    "\n",
    "In web scraping, the information we want is typically defined within specific [HTML tags](https://www.w3schools.com/TAGS/default.ASP). As with any task in computational analysis, the first step is always to get to know your data.\n",
    "\n",
    "### 🔍 Exploring the HTML Structure\n",
    "\n",
    "1. Open [this page](https://www.loc.gov/rr/print/list/057_chron.html) in your browser.\n",
    "2. Right-click on the page and select `Inspect` (or `Inspect Element` in some browsers).\n",
    "3. In the developer tools panel that opens, you'll see the HTML structure of the page.\n",
    "4. As you hover over elements in the inspector, corresponding parts of the webpage will be highlighted.\n",
    "\n",
    "🗒️ **Task:** Working with your groupmates, answer the following questions:\n",
    "\n",
    "1. What is the title of the page? \n",
    "   *Hint:* Look for the `<title>` tag. Try `Ctrl` + `F` \n",
    "2. Which tag(s) contain the main content of the page?\n",
    "   *Hint:* Look for tags like `<main>`, `<article>`, or `<body>` with specific classes or IDs.\n",
    "3. How is the table of presidents structured? \n",
    "   *Hint:* Look for the `<table>` tag and its children (`<tr>`, `<th>`, `<td>`).\n",
    "4. Are there any unique identifiers (like classes or IDs) that could help us locate the table?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✏️ **Answer:**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "✏️ **Instructor Answer:**\n",
    "\n",
    "1. The title of the page is \n",
    "\n",
    "``` html \n",
    "<title>\"Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress\"</title>\n",
    "```\n",
    "\n",
    "2. `<body>` contains the main content of the page.\n",
    "\n",
    "3. The table is in the division `<div class=\"table-responsive\"><table class=\"table table-bordered\">`. The table is divided into a header and body, `<thead>` and `<tbody>` respectively.\n",
    "- `<div>` stands for a division in HTML.\n",
    "- `<tr>` stands for table row.\n",
    "- `<td>` stands for table data.\n",
    "- `<th>` stands for table header.\n",
    "\n",
    "1. The unique identifier is `<table class = \"table table-bordered\">`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2. Using BeautifulSoup to Explore HTML\n",
    "\n",
    "Now that we've visually inspected the HTML, let's use BeautifulSoup to programmatically explore it. Run the following code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "url = \"https://www.loc.gov/rr/print/list/057_chron.html\"\n",
    "response = requests.get(url)\n",
    "soup = BeautifulSoup(response.content, 'html.parser')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Exploring the BeautifulSoup Object**\n",
    "\n",
    "Now that we've created our `soup` object, let's explore what it contains and how we can interact with it. This exploration will help you understand the structure of the parsed HTML and how to extract information from it.\n",
    "\n",
    "🖥️ **Task:** Now, use Python commands to investigate `soup` object. Here are some suggestions for exploration:\n",
    "\n",
    "1. Determine the type of the `soup` object. What class is it?\n",
    "2. Use the `help()` function to view the documentation for the BeautifulSoup object. What methods look useful for our task of extracting table data?\n",
    "3. Try to display the entire `soup` object. What happens? Why do you think this occurs?\n",
    "4. Can you find a way to display just a portion of the `soup` object? Hint: Think about how you might convert it to a string and slice it.\n",
    "5. Explore these commonly used BeautifulSoup methods. For each one, try to understand what it does and how it might be useful:\n",
    "   - `find()`\n",
    "   - `find_all()`\n",
    "   - `select()`\n",
    "   - `get_text()`\n",
    "6. Can you use any of these methods to:\n",
    "   - Find the title of the webpage?\n",
    "   - Locate the main table in the document?\n",
    "   - Count how many table rows (`<tr>` tags) are in the document?\n",
    "   - Extract the text from the first table row?\n",
    "7. Challenge: Can you extract and print the name and term of the first president in the table?\n",
    "\n",
    "🤔 **Discussion Questions:**\n",
    "1. What challenges did you encounter when exploring the `soup` object?\n",
    "2. How might the methods you discovered be useful for extracting the presidential data?\n",
    "3. Based on your exploration, what strategy would you use to extract all the data from the table?\n",
    "\n",
    "📚 **Resources:**\n",
    "- [BeautifulSoup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)\n",
    "\n",
    "Remember, web scraping often requires experimentation. Don't hesitate to try different approaches, and use print statements to understand what each method returns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "### ANSWER\n",
    "\n",
    "# Put your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-09-07T21:07:21.192829Z",
     "iopub.status.busy": "2024-09-07T21:07:21.192466Z",
     "iopub.status.idle": "2024-09-07T21:07:21.196001Z",
     "shell.execute_reply": "2024-09-07T21:07:21.195112Z",
     "shell.execute_reply.started": "2024-09-07T21:07:21.192814Z"
    }
   },
   "source": [
    "✏️ **Answer:** Put your strategy here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Type of soup object: <class 'bs4.BeautifulSoup'>\n",
      "\n",
      "3. & 4. First 500 characters of soup:\n",
      "<!DOCTYPE html>\n",
      "<html lang=\"en\"><head><meta content=\"IE=Edge\" http-equiv=\"X-UA-Compatible\"/><meta content=\"text/html; charset=utf-8\" http-equiv=\"Content-Type\"/><title>Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress</title><meta content=\"width=device-width, initial-scale=1.0\" name=\"viewport\"/><meta content=\"noarchive\" name=\"robots\"><link href=\"https://www.loc.gov/f\n",
      "\n",
      "5. & 6. Webpage title:\n",
      "Chronological List of Presidents, First Spouses, and Vice Presidents of the United States - Presidents of the United States: Selected Images - Research Guides at Library of Congress\n",
      "\n",
      "Main table found: True\n",
      "\n",
      "Number of table rows: 69\n",
      "\n",
      "Text from first table row:\n",
      "YEARPRESIDENTFIRST SPOUSEVICE PRESIDENT\n",
      "\n",
      "7. First president:\n",
      "Name: George Washington\n",
      "Term: Martha Washington\n",
      "\n",
      "Demonstration of select() method:\n",
      "Table headers: ['YEAR', 'PRESIDENT', 'FIRST SPOUSE', 'VICE PRESIDENT']\n"
     ]
    }
   ],
   "source": [
    "### INSTRUCTOR ANSWER\n",
    "\n",
    "# 1. Determine the type of the soup object\n",
    "print(\"1. Type of soup object:\", type(soup))\n",
    "\n",
    "# 2. Use help() function (this would typically be done in an interactive environment)\n",
    "# help(soup)\n",
    "\n",
    "# 3. & 4. Display a portion of the soup object\n",
    "print(\"\\n3. & 4. First 500 characters of soup:\")\n",
    "print(str(soup)[:500])\n",
    "\n",
    "# 5. & 6. Explore BeautifulSoup methods\n",
    "\n",
    "# Find the title of the webpage\n",
    "print(\"\\n5. & 6. Webpage title:\")\n",
    "print(soup.find('title').get_text())\n",
    "\n",
    "# Locate the main table\n",
    "main_table = soup.find('table')\n",
    "print(\"\\nMain table found:\", main_table is not None)\n",
    "\n",
    "# Count table rows\n",
    "rows = soup.find_all('tr')\n",
    "print(\"\\nNumber of table rows:\", len(rows))\n",
    "\n",
    "# Extract text from the first table row\n",
    "first_row = rows[0]\n",
    "print(\"\\nText from first table row:\")\n",
    "print(first_row.get_text(strip=True))\n",
    "\n",
    "# 7. Challenge: Extract name and term of the first president\n",
    "# Note: We skip the header row (index 0) to get the first data row\n",
    "first_president_row = rows[1]\n",
    "cells = first_president_row.find_all('td')\n",
    "if len(cells) >= 3:  # Ensure we have enough cells\n",
    "    name = cells[1].get_text(strip=True)\n",
    "    term = cells[2].get_text(strip=True)\n",
    "    print(\"\\n7. First president:\")\n",
    "    print(f\"Name: {name}\")\n",
    "    print(f\"Term: {term}\")\n",
    "else:\n",
    "    print(\"\\n7. Couldn't extract first president's data. Check the table structure.\")\n",
    "\n",
    "# Demonstration of select() method\n",
    "print(\"\\nDemonstration of select() method:\")\n",
    "headers = soup.select('table th')\n",
    "print(\"Table headers:\", [header.get_text(strip=True) for header in headers])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3. Implementing Your Web Scraping Code\n",
    "\n",
    "Now that you've explored the structure of the webpage and experimented with BeautifulSoup, it's time to write code to scrape the data. If you weren't able to formulate the strategy you can follow the one below.\n",
    "\n",
    "### 📋 Web Scraping Strategy Outline:\n",
    "\n",
    "1. **Locating the Table**: \n",
    "   - How will you find the specific table containing the presidential data?\n",
    "   - What BeautifulSoup method(s) will you use?\n",
    "\n",
    "2. **Extracting Table Rows**: \n",
    "   - How will you separate the header row from the data rows?\n",
    "   - What's your plan for iterating through the rows?\n",
    "\n",
    "3. **Parsing Row Data**: \n",
    "   - How will you extract individual pieces of information (e.g., name, term) from each row?\n",
    "   - Are there any data cleaning steps you need to consider?\n",
    "\n",
    "4. **Data Storage**: \n",
    "   - In what format will you store the extracted data? (e.g., list of dictionaries, pandas DataFrame)\n",
    "   - How will you handle any potential missing data?\n",
    "\n",
    "5. **(Optional, time permitting) Error Handling**: \n",
    "   - What potential issues might arise during the scraping process?\n",
    "   - How will you handle these to make your code more robust?\n",
    "\n",
    "### 🖥️ Implementation Task:\n",
    "\n",
    "Based on the strategy above, implement your web scraping code. Here's a suggested structure for your implementation:\n",
    "\n",
    "```python\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "def scrape_president_data(url):\n",
    "    # Your code here to scrape the data\n",
    "    # Remember to break down your code into smaller functions for each step\n",
    "    pass\n",
    "\n",
    "url = \"https://www.loc.gov/rr/print/list/057_chron.html\"\n",
    "df = scrape_president_data(url)\n",
    "    \n",
    "# Display the first 10 rows of the DataFrame\n",
    "display(df.head(10))\n",
    "\n",
    "```\n",
    "\n",
    "📝 **Task:** \n",
    "1. Implement the scrape_president_data function, breaking it down into smaller functions for each step of the strategy if needed.\n",
    "2. Ensure your code creates a pandas DataFrame with the scraped data.\n",
    "3. Run your code and verify that it correctly displays the first 10 rows of the scraped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### ANSWER \n",
    "\n",
    "# Use this and other code cells to write your code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSTRUCTOR ANSWER\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "def scrape_presidents_data(url):\n",
    "    # Fetch the webpage content\n",
    "    response = requests.get(url)\n",
    "    soup = BeautifulSoup(response.content, \"html.parser\")\n",
    "\n",
    "    # Find the table containing presidents' data\n",
    "    table = soup.find(\"table\")\n",
    "\n",
    "    # Extract table headers\n",
    "    headers = [header.text.strip() for header in table.find_all(\"th\")]\n",
    "\n",
    "    # Parse table rows and extract data\n",
    "    data = []\n",
    "    for row in table.find_all(\"tr\")[1:]:  # Skip the header row\n",
    "        cols = [col.text.strip() for col in row.find_all(\"td\")]\n",
    "        if cols:\n",
    "            data.append(cols)\n",
    "\n",
    "    # Create a pandas DataFrame\n",
    "    df = pd.DataFrame(data, columns=headers)\n",
    "    \n",
    "    return df\n",
    "\n",
    "url = \"https://guides.loc.gov/presidents-portraits/chronological\"\n",
    "df = scrape_presidents_data(url)\n",
    "\n",
    "# Display the first 10 rows of the DataFrame\n",
    "display(df.head(10))\n",
    "\n",
    "# # Optional: Save the data to a CSV file\n",
    "# df.to_csv(\"us_presidents.csv\", index=False)\n",
    "# print(\"\\nData saved to 'us_presidents.csv'\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Congratulations, you're done!\n",
    "\n",
    "Submit this assignment by uploading your notebook to the course Desire2Learn web page.  Go to the \"In-Class Assignments\" folder, find the appropriate submission link, and upload everything there. Make sure your name is on it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "© 2024 Michigan State University. This material was created for the Department of Computational Mathematics, Science and Engineering (CMSE) at Michigan State University."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cmse802",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}