# Day 05: Pre Class Assignemnt:<br> Regular expression and web scraping

<img src="https://cdn.shopify.com/s/files/1/0551/1125/4191/files/The_Rainbow_Band.jpg" width=600>

From [How to Make Scrape Art with Acrylic Paints (Step by Step)](https://novacolorpaint.com/blogs/nova-color/make-scrape-art-with-acrylic-paints)

### Student Identification

Please do not modify the structure or format of this cell. Follow these steps:

1. Look for the "YOUR SUBMISSION" section below
2. Replace `[Your Name Here]` with your name
3. If you worked with groupmates, include their names after your name, separated by commas
4. Do not modify any other part of this cell

Examples:
```
Single student:  John Smith
Group work:      John Smith, Jane Doe, Alex Johnson
```

YOUR SUBMISSION (edit this line only):
<p style='text-align: left;'> &#9989; [Your Name Here]

Note:
- Keep the "&#9989;" symbol at the start of your submission line
- Use commas to separate multiple names
- Spell names exactly as they appear in the course roster
- This cell has been tagged as "names" - do not modify or remove this tag

## Table of Contents

- [Learning objectives](#learning-objectives)
- [Part 1. Setup an environment](#part-1.-setup-an-environment)
- [Part 2. Web scraping](#part-2.-web-scraping)
- [Part 3. Regular expression](#part-3.-regular-expression)
- [Part 4. Parsing webpage with regex](#part-4.-parsing-webpage-with-regex)
- [Assignment wrap-up](#assignment-wrap-up)
- [Congratulations, you're done with your pre-class assignment!](#congratulations,-you're-done-with-your-pre-class-assignment!)

## Learning objectives

At the end of the exercise, you should be able to:
- Explain what web scraping is.
- Use regex101 to practice using regular expression.
- Put together simple regular expression to get infomation out of text documents

---
## Part 1. Setup an environment

üóíÔ∏è **Task:**  RealPython has an excellent article "[Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/#why-do-you-need-virtual-environments)". 

Read the section on __Why Do You Need Virtual Environments?__ and provide the reasons in the cell below.

‚úèÔ∏è **Answer:**

&#9989; <font color=blue>**DO THIS:**</font>

If you don't have a `cmse802` virtual environment already, follow the instructions in this [link](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) to create an environment with Python 3.11 called `cmse802`. Note this might take a while.

Then activate the environment and install the following packages:
  - `ipykernel`
  - `matplotlib`
  - `autopep8`
  - `pydocstyle`
  - `bs4`
  - `pandas`
  - `requests`
  - `ipywidgets`

### Check required modules

&#9989; <font color=blue>**DO THIS:**</font>

1. Open this notebook in VS Code.

2. Open the Command Palette:
   - **Windows/Linux**: `Ctrl+Shift+P`
   - **macOS**: `Cmd+Shift+P`

3. Type "Python: Select Interpreter" in the Command Palette and select it.

4. VS Code will display available Python interpreters, including:
   - System-wide installations
   - Virtual environments in your project
   - Conda environments

5. Choose the appropriate interpreter for your project. `cmse802` in this case.

6. The selected interpreter will appear in the bottom left corner of the VS Code window.

7. VS Code will now use this interpreter for running Python code, debugging, and linting in your current workspace. If you don't see it, close VS Code, open it back up, and try again.

8. Run the following code block to import modules needed for in-class exercise. If any module leads to a `ModuleNotFound` error: 
  - Open your terminal,
  - Activate the `cmse802` environment,
  - Install the missing module via the terminal. You might have to google how to specific packages.

In [None]:
import matplotlib, autopep8, pydocstyle, bs4, pandas, requests
from html.parser import HTMLParser  
from urllib import parse
from urllib.request import urlopen  
from urllib.request import urlretrieve
from ipywidgets import FloatProgress
from IPython.display import display
from glob import glob


&#9989; <font color=blue>**DO THIS:**</font> Test `pdoc3`, `pylint`, and `pydocstyle` by running the following cell:

In [None]:
!pdoc3
!pylint
!pydocstyle

---
## Part 2. Web scraping

Web scraping is a powerful technique for collecting data from websites through automated processes. This method allows researchers to tap into the vast wealth of information available on the internet, much of which can be highly relevant to various research projects.

### Legal Considerations

When considering web scraping, it's crucial to understand its legal implications. An informative article on the legality of web scraping can be found [here](https://www.grepsr.com/blog/overview-web-scraping-legality/). The article outlines three key considerations: types of data, website terms of service, and content accessibility.

Regarding data types, scraping publicly available data for non-commercial purposes is generally legal. However, it's illegal to scrape private or personal data without consent, as well as copyrighted material that constitutes intellectual property.

It's essential to review a website's terms of service before scraping. Many sites outline their policies on data access, which can often be found in the [robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) file. For example, you can check Google Scholar's robots.txt file [here](https://scholar.google.com/robots.txt).

Content behind logins or paywalls usually comes with terms of service that prohibit scraping. It's important to note that "publicly available" refers to information anyone can access without special permissions or subscriptions, such as content on Wikipedia or Google search results. However, even publicly available content may be subject to copyright restrictions.

### Web Scraping Tools

Several tools are available for web scraping:

- [Regular expressions](https://docs.python.org/3/library/re.html): Used to define search patterns within textual data.
- [Requests](https://realpython.com/python-requests/): A Python module that allows sending Get/Post requests to retrieve content.
- [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/): Useful for parsing HTML or XML documents into a readable format, making it easier to find specific elements on a webpage.
- [Selenium](https://www.selenium.dev/): Enables automation of various website interactions like clicking and scrolling.
- [Scrapy](https://scrapy.org/): A comprehensive web crawling and scraping framework for extracting structured data from web pages.


üóíÔ∏è **Question:**  Do you know of any other company that recently put their data behind a paywall?

‚úèÔ∏è **Answer:**

---

## Part 3. Regular expression

### What is this

&#9989; <font color=blue>**DO THIS:**</font> If you have not used regex or forgot about much of it:

- Watch [this excellent introduction video](https://www.youtube.com/watch?v=nxjwB8up2gI).

There are a lot of resources on regular expressions.  Here are a few to check out if you'd like to learn more __in the future__.

* https://docs.python.org/3/howto/regex.html
* http://www.pyregex.com/
* http://www.bogotobogo.com/python/python_regularExpressions.php
* http://howardabrams.com/regexp/

### Python `re` module

For webscaping, or generally for parsing infomation out of text documents, __regular expressions__ (also referred to as **regex** or **regexp**) is frequently used. It can be thought of as a powerful  language for pattern matching in text. In the following sections, we will practice using regular expression with the `re` module.

### The `search` function

The python module **re** provides support for regular expressions. A typical regular expression search in python looks like

```python
match = re.search(pattern, text)
```

Where:
1. **pattern**: is a string with the instructions of what to look for and how to look for it
1. **text**: is a string on which the pattern matching will be performed 

&#9989; <font color=blue>**DO THIS:**</font> Run the following code:

In [None]:
import re

text  = 'Go green, go white!'
match = re.search('green', text )

print(type(match))
print(match)

üóíÔ∏è **Coding Task:** In the following code block, use the `search` function to search for 'MSU' in `text` and print out the search result.

In [None]:
### ANSWER 

In [None]:
### INSTRUCTOR ANSWER
match_msu = re.search('MSU', text)

print(match_msu)

### Defining a pattern

The power of regular expressions comes from the fact that three types of patterns can be represented in the expression:

- **Regular characters**: e.g., 'g' and 'M'
- **Metacharacters**: character with special meaning, examples:
  - `[a-m]` (any char a~m)
  - `[^ab]` (not matching a or b)
  - `x|y` (match either x or y)
  - `\` (special sequence, see below)
  - `.` (any character)
  - `^` (start of the line)
  - `$` (match at the end of the line)
  - `*` (>0 occurrence(s))
  - `+` (>1 occurences)
  - `?` (<=1 occurence)
  - `{2}` (exactly 2 occurences). 
  - `()` (capturing group) 
- **Special sequences**: examples,
  - `\d` (any digit)
  - `\s` (white space)
  - `\w` (alphanumeric)
  - `\W` (non-alphanumeric)

[Here](https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet) is a good list of regex characters and other expressions.

üóíÔ∏è **Task:** Before running the next code block, what do you think will be printed out?

‚úèÔ∏è **Answer:**

In [None]:
text  = 'an example word-cat!!'
match = re.search('word-\w*', text)

# If-statement after search() tests if it succeeded
if match:                      
    print('Found:', match.group())
else:
    print('Did not find')

üóíÔ∏è **Task:** The [regex101](https://regex101.com/) website is an excellent place to test out regular expression AND to find out what complicated regular expression means.

- Go to regex101 and paste in the expression and text in the previous cell. No quotes!
- Replace the `*` by another expression so that the result includes just one alphanumeric character after the dash (i.e., it should find `word-c`). In the code block below, include the working regular expression and print out the match group result.

In [None]:
### ANSWER


In [None]:
### INSTRUCTOR ANSWER
text  = 'an example word-cat!!'
match = re.search('word-\w?', text)

match.group()

### The `findall` function

To find all texts that match the regular expresson, you can use the `findall` function. The syntax is:

```python
matches = re.findall(pattern, text)
```

&#9989; <font color=blue>**DO THIS:**</font> Run the code block below where an `emails` string is defined.

In [None]:
emails = 'deep.purple@msu.edu, alice-b@google.com,monkey@msu.edu, sparty@msu.edu'

üóíÔ∏è **Coding Task:** Use the [regex101](https://regex101.com/) tool to experiment with regular expressions that find all occurences of a pattern that starts with `@` and ends after `.` (i.e., `edu` or `com` are not included). In the code block below, include both non-working and working regular expression you have tried.

In [None]:
# put your codes here

In [None]:
### ANSWER

In [None]:
### INSTRUCTOR ANSWER
pattern = '@\w*'
re.search(pattern, emails).group()

üóíÔ∏è **Coding Task:** In the code block below, use the working regular expression from the answer above to find the intended strings in `emails` with the `findall` function and print out all matches. 

In [None]:
### ANSWER


In [None]:
## INSTRUCTOR ANSWER
re.findall(pattern, emails)


### Compiling regular expression

Since the search pattern in a regular expression is essentially a set of instructions (i.e., a program), you can compile it and reuse it:

```python
compiled_pattern = re.compile(pattern)
compiled_pattern.findall(text)
```

üóíÔ∏è **Question:** What does the following pattern specify? If it is applied to the `emails` string, what will the output be?

```python
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
```

‚úèÔ∏è **Answer:**

üóíÔ∏è **Coding Task:** Write code that uses the `emails` object and find all occurrence of the `compiled_pattern` specified above.

In [None]:
### ANSWER

In [None]:
### INSTRUCTOR ANSWER
compiled_pattern = re.compile('@[\w\d.]+\.+(com|org|edu)')
re.findall(compiled_pattern, emails)

---
## Part 4. Parsing webpage with regex

Here is an example using `re` to get information out of webpages.

üóíÔ∏è **Coding Task:** Take a look at the following code block and comment on what each line does.

In [None]:
### ANSWER

import re
import requests

url = "https://colbrydi.github.io/pages/contact.html"

source_code = requests.get(url)

plain_text = source_code.text

regex = re.compile("\(?\d{3}\)?\s?\d{3}[-.]\d{4}")

res = regex.findall(plain_text)

print(res)

üóíÔ∏è **Coding Task:** Use the `type()` function and figure out the types of objects that are returned when:

- The `get` function of the `request` module is called.
- The `text` attribute of the`source_code` object.

In [None]:
### ANSWER

In [None]:
### INSTRUCTOR ANSWER
print(type(source_code))
print(type(plain_text))

---
## Assignment wrap-up
Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!** If running the cell doesn't work in VS Code copy the link `src` and paste in the browser. Make sure to sign in with your MSU email. 


In [None]:
from IPython.display import HTML
HTML(
'''
<iframe 
    src="https://forms.office.com/r/AEc6LS6xKF" 
    width="800px" 
    height="600px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
'''
)


## Congratulations, you're done with your pre-class assignment!

Now, you just need to submit this assignment by uploading it to the course <a href="https://d2l.msu.edu/">Desire2Learn</a> web page for the appropriate pre-class submission folder. (Don't forget to add your name in the first cell).


&#169; Copyright 2024, Department of Computational Mathematics, Science and Engineering at Michigan State University
