Day 05: Pre Class Assignemnt:
Regular expression and web scraping#

From How to Make Scrape Art with Acrylic Paints (Step by Step)

Student Identification#

Please do not modify the structure or format of this cell. Follow these steps:

  1. Look for the “YOUR SUBMISSION” section below

  2. Replace [Your Name Here] with your name

  3. If you worked with groupmates, include their names after your name, separated by commas

  4. Do not modify any other part of this cell

Examples:

Single student:  John Smith
Group work:      John Smith, Jane Doe, Alex Johnson

YOUR SUBMISSION (edit this line only):

✅ [Your Name Here]

Note:

  • Keep the “✅” symbol at the start of your submission line

  • Use commas to separate multiple names

  • Spell names exactly as they appear in the course roster

  • This cell has been tagged as “names” - do not modify or remove this tag

Table of Contents#

Learning objectives#

At the end of the exercise, you should be able to:

  • Explain what web scraping is.

  • Use regex101 to practice using regular expression.

  • Put together simple regular expression to get infomation out of text documents


Part 1. Setup an environment#

🗒️ Task: RealPython has an excellent article “Python Virtual Environments: A Primer”.

Read the section on Why Do You Need Virtual Environments? and provide the reasons in the cell below.

✏️ Answer:

DO THIS:

If you don’t have a cmse802 virtual environment already, follow the instructions in this link to create an environment with Python 3.11 called cmse802. Note this might take a while.

Then activate the environment and install the following packages:

  • ipykernel

  • matplotlib

  • autopep8

  • pydocstyle

  • bs4

  • pandas

  • requests

  • ipywidgets

Check required modules#

DO THIS:

  1. Open this notebook in VS Code.

  2. Open the Command Palette:

    • Windows/Linux: Ctrl+Shift+P

    • macOS: Cmd+Shift+P

  3. Type “Python: Select Interpreter” in the Command Palette and select it.

  4. VS Code will display available Python interpreters, including:

    • System-wide installations

    • Virtual environments in your project

    • Conda environments

  5. Choose the appropriate interpreter for your project. cmse802 in this case.

  6. The selected interpreter will appear in the bottom left corner of the VS Code window.

  7. VS Code will now use this interpreter for running Python code, debugging, and linting in your current workspace. If you don’t see it, close VS Code, open it back up, and try again.

  8. Run the following code block to import modules needed for in-class exercise. If any module leads to a ModuleNotFound error:

  • Open your terminal,

  • Activate the cmse802 environment,

  • Install the missing module via the terminal. You might have to google how to specific packages.

import matplotlib, autopep8, pydocstyle, bs4, pandas, requests
from html.parser import HTMLParser  
from urllib import parse
from urllib.request import urlopen  
from urllib.request import urlretrieve
from ipywidgets import FloatProgress
from IPython.display import display
from glob import glob

DO THIS: Test pdoc3, pylint, and pydocstyle by running the following cell:

!pdoc3
!pylint
!pydocstyle

Part 2. Web scraping#

Web scraping is a powerful technique for collecting data from websites through automated processes. This method allows researchers to tap into the vast wealth of information available on the internet, much of which can be highly relevant to various research projects.

Web Scraping Tools#

Several tools are available for web scraping:

  • Regular expressions: Used to define search patterns within textual data.

  • Requests: A Python module that allows sending Get/Post requests to retrieve content.

  • Beautiful Soup: Useful for parsing HTML or XML documents into a readable format, making it easier to find specific elements on a webpage.

  • Selenium: Enables automation of various website interactions like clicking and scrolling.

  • Scrapy: A comprehensive web crawling and scraping framework for extracting structured data from web pages.

🗒️ Question: Do you know of any other company that recently put their data behind a paywall?

✏️ Answer:


Part 3. Regular expression#

What is this#

DO THIS: If you have not used regex or forgot about much of it:

There are a lot of resources on regular expressions. Here are a few to check out if you’d like to learn more in the future.

Python re module#

For webscaping, or generally for parsing infomation out of text documents, regular expressions (also referred to as regex or regexp) is frequently used. It can be thought of as a powerful language for pattern matching in text. In the following sections, we will practice using regular expression with the re module.

The search function#

The python module re provides support for regular expressions. A typical regular expression search in python looks like

match = re.search(pattern, text)

Where:

  1. pattern: is a string with the instructions of what to look for and how to look for it

  2. text: is a string on which the pattern matching will be performed

DO THIS: Run the following code:

import re

text  = 'Go green, go white!'
match = re.search('green', text )

print(type(match))
print(match)

🗒️ Coding Task: In the following code block, use the search function to search for ‘MSU’ in text and print out the search result.

### ANSWER 
### INSTRUCTOR ANSWER
match_msu = re.search('MSU', text)

print(match_msu)

Defining a pattern#

The power of regular expressions comes from the fact that three types of patterns can be represented in the expression:

  • Regular characters: e.g., ‘g’ and ‘M’

  • Metacharacters: character with special meaning, examples:

    • [a-m] (any char a~m)

    • [^ab] (not matching a or b)

    • x|y (match either x or y)

    • \ (special sequence, see below)

    • . (any character)

    • ^ (start of the line)

    • $ (match at the end of the line)

    • * (>0 occurrence(s))

    • + (>1 occurences)

    • ? (<=1 occurence)

    • {2} (exactly 2 occurences).

    • () (capturing group)

  • Special sequences: examples,

    • \d (any digit)

    • \s (white space)

    • \w (alphanumeric)

    • \W (non-alphanumeric)

Here is a good list of regex characters and other expressions.

🗒️ Task: Before running the next code block, what do you think will be printed out?

✏️ Answer:

text  = 'an example word-cat!!'
match = re.search('word-\w*', text)

# If-statement after search() tests if it succeeded
if match:                      
    print('Found:', match.group())
else:
    print('Did not find')

🗒️ Task: The regex101 website is an excellent place to test out regular expression AND to find out what complicated regular expression means.

  • Go to regex101 and paste in the expression and text in the previous cell. No quotes!

  • Replace the * by another expression so that the result includes just one alphanumeric character after the dash (i.e., it should find word-c). In the code block below, include the working regular expression and print out the match group result.

### ANSWER
### INSTRUCTOR ANSWER
text  = 'an example word-cat!!'
match = re.search('word-\w?', text)

match.group()

The findall function#

To find all texts that match the regular expresson, you can use the findall function. The syntax is:

matches = re.findall(pattern, text)

DO THIS: Run the code block below where an emails string is defined.

emails = 'deep.purple@msu.edu, alice-b@google.com,monkey@msu.edu, sparty@msu.edu'

🗒️ Coding Task: Use the regex101 tool to experiment with regular expressions that find all occurences of a pattern that starts with @ and ends after . (i.e., edu or com are not included). In the code block below, include both non-working and working regular expression you have tried.

# put your codes here
### ANSWER
### INSTRUCTOR ANSWER
pattern = '@\w*'
re.search(pattern, emails).group()

🗒️ Coding Task: In the code block below, use the working regular expression from the answer above to find the intended strings in emails with the findall function and print out all matches.

### ANSWER
## INSTRUCTOR ANSWER
re.findall(pattern, emails)

Compiling regular expression#

Since the search pattern in a regular expression is essentially a set of instructions (i.e., a program), you can compile it and reuse it:

compiled_pattern = re.compile(pattern)
compiled_pattern.findall(text)

🗒️ Question: What does the following pattern specify? If it is applied to the emails string, what will the output be?

compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')

✏️ Answer:

🗒️ Coding Task: Write code that uses the emails object and find all occurrence of the compiled_pattern specified above.

### ANSWER
### INSTRUCTOR ANSWER
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
re.findall(compiled_pattern, emails)

Part 4. Parsing webpage with regex#

Here is an example using re to get information out of webpages.

🗒️ Coding Task: Take a look at the following code block and comment on what each line does.

### ANSWER

import re
import requests

url = "https://colbrydi.github.io/pages/contact.html"

source_code = requests.get(url)

plain_text = source_code.text

regex = re.compile("\(?\d{3}\)?\s?\d{3}[-.]\d{4}")

res = regex.findall(plain_text)

print(res)

🗒️ Coding Task: Use the type() function and figure out the types of objects that are returned when:

  • The get function of the request module is called.

  • The text attribute of thesource_code object.

### ANSWER
### INSTRUCTOR ANSWER
print(type(source_code))
print(type(plain_text))

Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment! If running the cell doesn’t work in VS Code copy the link src and paste in the browser. Make sure to sign in with your MSU email.

from IPython.display import HTML
HTML(
'''
<iframe 
    src="https://forms.office.com/r/AEc6LS6xKF" 
    width="800px" 
    height="600px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
'''
)

Congratulations, you’re done with your pre-class assignment!#

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for the appropriate pre-class submission folder. (Don’t forget to add your name in the first cell).

© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University