Day 05: Pre Class Assignemnt:
Regular expression and web scraping

Day 05: Pre Class Assignemnt:
Regular expression and web scraping#

From How to Make Scrape Art with Acrylic Paints (Step by Step)

Student Identification#

Please do not modify the structure or format of this cell. Follow these steps:

Look for the “YOUR SUBMISSION” section below
Replace [Your Name Here] with your name
If you worked with groupmates, include their names after your name, separated by commas
Do not modify any other part of this cell

Examples:

Single student:  John Smith
Group work:      John Smith, Jane Doe, Alex Johnson

YOUR SUBMISSION (edit this line only):

✅ [Your Name Here]

Note:

Keep the “✅” symbol at the start of your submission line
Use commas to separate multiple names
Spell names exactly as they appear in the course roster
This cell has been tagged as “names” - do not modify or remove this tag

Table of Contents#

Learning objectives
Part 1. Setup an environment
Part 2. Web scraping
Part 3. Regular expression
Part 4. Parsing webpage with regex
Assignment wrap-up
Congratulations, you’re done with your pre-class assignment!

Learning objectives#

At the end of the exercise, you should be able to:

Explain what web scraping is.
Use regex101 to practice using regular expression.
Put together simple regular expression to get infomation out of text documents

Part 1. Setup an environment#

🗒️ Task: RealPython has an excellent article “Python Virtual Environments: A Primer”.

Read the section on Why Do You Need Virtual Environments? and provide the reasons in the cell below.

✏️ Answer:

✅ DO THIS:

If you don’t have a cmse802 virtual environment already, follow the instructions in this link to create an environment with Python 3.11 called cmse802. Note this might take a while.

Then activate the environment and install the following packages:

ipykernel
matplotlib
autopep8
pydocstyle
bs4
pandas
requests
ipywidgets

Check required modules#

✅ DO THIS:

Open this notebook in VS Code.
Open the Command Palette:
- Windows/Linux: Ctrl+Shift+P
- macOS: Cmd+Shift+P
Type “Python: Select Interpreter” in the Command Palette and select it.
VS Code will display available Python interpreters, including:
- System-wide installations
- Virtual environments in your project
- Conda environments
Choose the appropriate interpreter for your project. cmse802 in this case.
The selected interpreter will appear in the bottom left corner of the VS Code window.
VS Code will now use this interpreter for running Python code, debugging, and linting in your current workspace. If you don’t see it, close VS Code, open it back up, and try again.
Run the following code block to import modules needed for in-class exercise. If any module leads to a ModuleNotFound error:

Open your terminal,
Activate the cmse802 environment,
Install the missing module via the terminal. You might have to google how to specific packages.

import matplotlib, autopep8, pydocstyle, bs4, pandas, requests
from html.parser import HTMLParser  
from urllib import parse
from urllib.request import urlopen  
from urllib.request import urlretrieve
from ipywidgets import FloatProgress
from IPython.display import display
from glob import glob

✅ DO THIS: Test pdoc3, pylint, and pydocstyle by running the following cell:

!pdoc3
!pylint
!pydocstyle

Part 2. Web scraping#

Web scraping is a powerful technique for collecting data from websites through automated processes. This method allows researchers to tap into the vast wealth of information available on the internet, much of which can be highly relevant to various research projects.

Legal Considerations#

When considering web scraping, it’s crucial to understand its legal implications. An informative article on the legality of web scraping can be found here. The article outlines three key considerations: types of data, website terms of service, and content accessibility.

Regarding data types, scraping publicly available data for non-commercial purposes is generally legal. However, it’s illegal to scrape private or personal data without consent, as well as copyrighted material that constitutes intellectual property.

It’s essential to review a website’s terms of service before scraping. Many sites outline their policies on data access, which can often be found in the robots.txt file. For example, you can check Google Scholar’s robots.txt file here.

Content behind logins or paywalls usually comes with terms of service that prohibit scraping. It’s important to note that “publicly available” refers to information anyone can access without special permissions or subscriptions, such as content on Wikipedia or Google search results. However, even publicly available content may be subject to copyright restrictions.

Web Scraping Tools#

Several tools are available for web scraping:

Regular expressions: Used to define search patterns within textual data.
Requests: A Python module that allows sending Get/Post requests to retrieve content.
Beautiful Soup: Useful for parsing HTML or XML documents into a readable format, making it easier to find specific elements on a webpage.
Selenium: Enables automation of various website interactions like clicking and scrolling.
Scrapy: A comprehensive web crawling and scraping framework for extracting structured data from web pages.

🗒️ Question: Do you know of any other company that recently put their data behind a paywall?

✏️ Answer:

Part 3. Regular expression#

What is this#

✅ DO THIS: If you have not used regex or forgot about much of it:

Watch this excellent introduction video.

There are a lot of resources on regular expressions. Here are a few to check out if you’d like to learn more in the future.

Python `re` module#

For webscaping, or generally for parsing infomation out of text documents, regular expressions (also referred to as regex or regexp) is frequently used. It can be thought of as a powerful language for pattern matching in text. In the following sections, we will practice using regular expression with the re module.

The `search` function#

The python module re provides support for regular expressions. A typical regular expression search in python looks like

match = re.search(pattern, text)

Where:

pattern: is a string with the instructions of what to look for and how to look for it
text: is a string on which the pattern matching will be performed

✅ DO THIS: Run the following code:

import re

text  = 'Go green, go white!'
match = re.search('green', text )

print(type(match))
print(match)

🗒️ Coding Task: In the following code block, use the search function to search for ‘MSU’ in text and print out the search result.

### ANSWER

### INSTRUCTOR ANSWER
match_msu = re.search('MSU', text)

print(match_msu)

Defining a pattern#

The power of regular expressions comes from the fact that three types of patterns can be represented in the expression:

Regular characters: e.g., ‘g’ and ‘M’
Metacharacters: character with special meaning, examples:
- [a-m] (any char a~m)
- [^ab] (not matching a or b)
- x|y (match either x or y)
- \ (special sequence, see below)
- . (any character)
- ^ (start of the line)
- $ (match at the end of the line)
- * (>0 occurrence(s))
- + (>1 occurences)
- ? (<=1 occurence)
- {2} (exactly 2 occurences).
- () (capturing group)
Special sequences: examples,
- \d (any digit)
- \s (white space)
- \w (alphanumeric)
- \W (non-alphanumeric)

Here is a good list of regex characters and other expressions.

🗒️ Task: Before running the next code block, what do you think will be printed out?

✏️ Answer:

text  = 'an example word-cat!!'
match = re.search('word-\w*', text)

# If-statement after search() tests if it succeeded
if match:                      
    print('Found:', match.group())
else:
    print('Did not find')

🗒️ Task: The regex101 website is an excellent place to test out regular expression AND to find out what complicated regular expression means.

Go to regex101 and paste in the expression and text in the previous cell. No quotes!
Replace the * by another expression so that the result includes just one alphanumeric character after the dash (i.e., it should find word-c). In the code block below, include the working regular expression and print out the match group result.

### ANSWER

### INSTRUCTOR ANSWER
text  = 'an example word-cat!!'
match = re.search('word-\w?', text)

match.group()

The `findall` function#

To find all texts that match the regular expresson, you can use the findall function. The syntax is:

matches = re.findall(pattern, text)

✅ DO THIS: Run the code block below where an emails string is defined.

emails = 'deep.purple@msu.edu, alice-b@google.com,monkey@msu.edu, sparty@msu.edu'

🗒️ Coding Task: Use the regex101 tool to experiment with regular expressions that find all occurences of a pattern that starts with @ and ends after . (i.e., edu or com are not included). In the code block below, include both non-working and working regular expression you have tried.

# put your codes here

### ANSWER

### INSTRUCTOR ANSWER
pattern = '@\w*'
re.search(pattern, emails).group()

🗒️ Coding Task: In the code block below, use the working regular expression from the answer above to find the intended strings in emails with the findall function and print out all matches.

### ANSWER

## INSTRUCTOR ANSWER
re.findall(pattern, emails)

Compiling regular expression#

Since the search pattern in a regular expression is essentially a set of instructions (i.e., a program), you can compile it and reuse it:

compiled_pattern = re.compile(pattern)
compiled_pattern.findall(text)

🗒️ Question: What does the following pattern specify? If it is applied to the emails string, what will the output be?

compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')

✏️ Answer:

🗒️ Coding Task: Write code that uses the emails object and find all occurrence of the compiled_pattern specified above.

### ANSWER

### INSTRUCTOR ANSWER
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
re.findall(compiled_pattern, emails)

Part 4. Parsing webpage with regex#

Here is an example using re to get information out of webpages.

🗒️ Coding Task: Take a look at the following code block and comment on what each line does.

### ANSWER

import re
import requests

url = "https://colbrydi.github.io/pages/contact.html"

source_code = requests.get(url)

plain_text = source_code.text

regex = re.compile("\(?\d{3}\)?\s?\d{3}[-.]\d{4}")

res = regex.findall(plain_text)

print(res)

🗒️ Coding Task: Use the type() function and figure out the types of objects that are returned when:

The get function of the request module is called.
The text attribute of thesource_code object.

### ANSWER

### INSTRUCTOR ANSWER
print(type(source_code))
print(type(plain_text))

Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment! If running the cell doesn’t work in VS Code copy the link src and paste in the browser. Make sure to sign in with your MSU email.

from IPython.display import HTML
HTML(
'''
<iframe 
    src="https://forms.office.com/r/AEc6LS6xKF" 
    width="800px" 
    height="600px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
'''
)

Congratulations, you’re done with your pre-class assignment!#

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for the appropriate pre-class submission folder. (Don’t forget to add your name in the first cell).

Day 05: Pre Class Assignemnt: Regular expression and web scraping

Contents

Day 05: Pre Class Assignemnt: Regular expression and web scraping#

Student Identification#

Table of Contents#

Learning objectives#

Part 1. Setup an environment#

Check required modules#

Part 2. Web scraping#

Legal Considerations#

Web Scraping Tools#

Part 3. Regular expression#

What is this#

Python re module#

The search function#

Defining a pattern#

The findall function#

Compiling regular expression#

Part 4. Parsing webpage with regex#

Assignment wrap-up#

Congratulations, you’re done with your pre-class assignment!#

Day 05: Pre Class Assignemnt:
Regular expression and web scraping

Day 05: Pre Class Assignemnt:
Regular expression and web scraping#

Python `re` module#

The `search` function#

The `findall` function#