Day 05: Pre Class Assignemnt:
Regular expression and web scraping#
From How to Make Scrape Art with Acrylic Paints (Step by Step)
Student Identification#
Please do not modify the structure or format of this cell. Follow these steps:
Look for the “YOUR SUBMISSION” section below
Replace
[Your Name Here]
with your nameIf you worked with groupmates, include their names after your name, separated by commas
Do not modify any other part of this cell
Examples:
Single student: John Smith
Group work: John Smith, Jane Doe, Alex Johnson
YOUR SUBMISSION (edit this line only):
✅ [Your Name Here]
Note:
Keep the “✅” symbol at the start of your submission line
Use commas to separate multiple names
Spell names exactly as they appear in the course roster
This cell has been tagged as “names” - do not modify or remove this tag
Table of Contents#
Learning objectives#
At the end of the exercise, you should be able to:
Explain what web scraping is.
Use regex101 to practice using regular expression.
Put together simple regular expression to get infomation out of text documents
Part 1. Setup an environment#
🗒️ Task: RealPython has an excellent article “Python Virtual Environments: A Primer”.
Read the section on Why Do You Need Virtual Environments? and provide the reasons in the cell below.
✏️ Answer:
✅ DO THIS:
If you don’t have a cmse802
virtual environment already, follow the instructions in this link to create an environment with Python 3.11 called cmse802
. Note this might take a while.
Then activate the environment and install the following packages:
ipykernel
matplotlib
autopep8
pydocstyle
bs4
pandas
requests
ipywidgets
Check required modules#
✅ DO THIS:
Open this notebook in VS Code.
Open the Command Palette:
Windows/Linux:
Ctrl+Shift+P
macOS:
Cmd+Shift+P
Type “Python: Select Interpreter” in the Command Palette and select it.
VS Code will display available Python interpreters, including:
System-wide installations
Virtual environments in your project
Conda environments
Choose the appropriate interpreter for your project.
cmse802
in this case.The selected interpreter will appear in the bottom left corner of the VS Code window.
VS Code will now use this interpreter for running Python code, debugging, and linting in your current workspace. If you don’t see it, close VS Code, open it back up, and try again.
Run the following code block to import modules needed for in-class exercise. If any module leads to a
ModuleNotFound
error:
Open your terminal,
Activate the
cmse802
environment,Install the missing module via the terminal. You might have to google how to specific packages.
import matplotlib, autopep8, pydocstyle, bs4, pandas, requests
from html.parser import HTMLParser
from urllib import parse
from urllib.request import urlopen
from urllib.request import urlretrieve
from ipywidgets import FloatProgress
from IPython.display import display
from glob import glob
✅ DO THIS: Test pdoc3
, pylint
, and pydocstyle
by running the following cell:
!pdoc3
!pylint
!pydocstyle
Part 2. Web scraping#
Web scraping is a powerful technique for collecting data from websites through automated processes. This method allows researchers to tap into the vast wealth of information available on the internet, much of which can be highly relevant to various research projects.
Legal Considerations#
When considering web scraping, it’s crucial to understand its legal implications. An informative article on the legality of web scraping can be found here. The article outlines three key considerations: types of data, website terms of service, and content accessibility.
Regarding data types, scraping publicly available data for non-commercial purposes is generally legal. However, it’s illegal to scrape private or personal data without consent, as well as copyrighted material that constitutes intellectual property.
It’s essential to review a website’s terms of service before scraping. Many sites outline their policies on data access, which can often be found in the robots.txt file. For example, you can check Google Scholar’s robots.txt file here.
Content behind logins or paywalls usually comes with terms of service that prohibit scraping. It’s important to note that “publicly available” refers to information anyone can access without special permissions or subscriptions, such as content on Wikipedia or Google search results. However, even publicly available content may be subject to copyright restrictions.
Web Scraping Tools#
Several tools are available for web scraping:
Regular expressions: Used to define search patterns within textual data.
Requests: A Python module that allows sending Get/Post requests to retrieve content.
Beautiful Soup: Useful for parsing HTML or XML documents into a readable format, making it easier to find specific elements on a webpage.
Selenium: Enables automation of various website interactions like clicking and scrolling.
Scrapy: A comprehensive web crawling and scraping framework for extracting structured data from web pages.
🗒️ Question: Do you know of any other company that recently put their data behind a paywall?
✏️ Answer:
Part 3. Regular expression#
What is this#
✅ DO THIS: If you have not used regex or forgot about much of it:
There are a lot of resources on regular expressions. Here are a few to check out if you’d like to learn more in the future.
Python re
module#
For webscaping, or generally for parsing infomation out of text documents, regular expressions (also referred to as regex or regexp) is frequently used. It can be thought of as a powerful language for pattern matching in text. In the following sections, we will practice using regular expression with the re
module.
The search
function#
The python module re provides support for regular expressions. A typical regular expression search in python looks like
match = re.search(pattern, text)
Where:
pattern: is a string with the instructions of what to look for and how to look for it
text: is a string on which the pattern matching will be performed
✅ DO THIS: Run the following code:
import re
text = 'Go green, go white!'
match = re.search('green', text )
print(type(match))
print(match)
🗒️ Coding Task: In the following code block, use the search
function to search for ‘MSU’ in text
and print out the search result.
### ANSWER
### INSTRUCTOR ANSWER
match_msu = re.search('MSU', text)
print(match_msu)
Defining a pattern#
The power of regular expressions comes from the fact that three types of patterns can be represented in the expression:
Regular characters: e.g., ‘g’ and ‘M’
Metacharacters: character with special meaning, examples:
[a-m]
(any char a~m)[^ab]
(not matching a or b)x|y
(match either x or y)\
(special sequence, see below).
(any character)^
(start of the line)$
(match at the end of the line)*
(>0 occurrence(s))+
(>1 occurences)?
(<=1 occurence){2}
(exactly 2 occurences).()
(capturing group)
Special sequences: examples,
\d
(any digit)\s
(white space)\w
(alphanumeric)\W
(non-alphanumeric)
Here is a good list of regex characters and other expressions.
🗒️ Task: Before running the next code block, what do you think will be printed out?
✏️ Answer:
text = 'an example word-cat!!'
match = re.search('word-\w*', text)
# If-statement after search() tests if it succeeded
if match:
print('Found:', match.group())
else:
print('Did not find')
🗒️ Task: The regex101 website is an excellent place to test out regular expression AND to find out what complicated regular expression means.
Go to regex101 and paste in the expression and text in the previous cell. No quotes!
Replace the
*
by another expression so that the result includes just one alphanumeric character after the dash (i.e., it should findword-c
). In the code block below, include the working regular expression and print out the match group result.
### ANSWER
### INSTRUCTOR ANSWER
text = 'an example word-cat!!'
match = re.search('word-\w?', text)
match.group()
The findall
function#
To find all texts that match the regular expresson, you can use the findall
function. The syntax is:
matches = re.findall(pattern, text)
✅ DO THIS: Run the code block below where an emails
string is defined.
emails = 'deep.purple@msu.edu, alice-b@google.com,monkey@msu.edu, sparty@msu.edu'
🗒️ Coding Task: Use the regex101 tool to experiment with regular expressions that find all occurences of a pattern that starts with @
and ends after .
(i.e., edu
or com
are not included). In the code block below, include both non-working and working regular expression you have tried.
# put your codes here
### ANSWER
### INSTRUCTOR ANSWER
pattern = '@\w*'
re.search(pattern, emails).group()
🗒️ Coding Task: In the code block below, use the working regular expression from the answer above to find the intended strings in emails
with the findall
function and print out all matches.
### ANSWER
## INSTRUCTOR ANSWER
re.findall(pattern, emails)
Compiling regular expression#
Since the search pattern in a regular expression is essentially a set of instructions (i.e., a program), you can compile it and reuse it:
compiled_pattern = re.compile(pattern)
compiled_pattern.findall(text)
🗒️ Question: What does the following pattern specify? If it is applied to the emails
string, what will the output be?
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
✏️ Answer:
🗒️ Coding Task: Write code that uses the emails
object and find all occurrence of the compiled_pattern
specified above.
### ANSWER
### INSTRUCTOR ANSWER
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
re.findall(compiled_pattern, emails)
Part 4. Parsing webpage with regex#
Here is an example using re
to get information out of webpages.
🗒️ Coding Task: Take a look at the following code block and comment on what each line does.
### ANSWER
import re
import requests
url = "https://colbrydi.github.io/pages/contact.html"
source_code = requests.get(url)
plain_text = source_code.text
regex = re.compile("\(?\d{3}\)?\s?\d{3}[-.]\d{4}")
res = regex.findall(plain_text)
print(res)
🗒️ Coding Task: Use the type()
function and figure out the types of objects that are returned when:
The
get
function of therequest
module is called.The
text
attribute of thesource_code
object.
### ANSWER
### INSTRUCTOR ANSWER
print(type(source_code))
print(type(plain_text))
Assignment wrap-up#
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment! If running the cell doesn’t work in VS Code copy the link src
and paste in the browser. Make sure to sign in with your MSU email.
from IPython.display import HTML
HTML(
'''
<iframe
src="https://forms.office.com/r/AEc6LS6xKF"
width="800px"
height="600px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
'''
)
Congratulations, you’re done with your pre-class assignment!#
Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for the appropriate pre-class submission folder. (Don’t forget to add your name in the first cell).
© Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University