Pre-Class Assignment: Principal Component Analysis#

Day 22#

CMSE 202#

✅ Put your name here

#

Goals for today’s pre-class assignment#

  1. Introduction to Principal Component Analysis

  2. Example Application: The Breast Cancer dataset

  3. Understanding the importance of scaling data

Assignment instructions#

This assignment is due by 11:59 p.m. the day before class and should be uploaded into the appropriate “Pre-class assignments” submission folder in the Desire2Learn website.


Importing modules#

Run the following cell to import the modules we will be using in this pre-class assignment.

import numpy as np
import sklearn.decomposition as dec
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

sns.set_context("notebook")

1. Developing some intuition about principal component analysis#

The following videos (developed at Georgia Tech) are to help you gain an understanding and intuition about principal component analysis (PCA). PCA is one of the main techniques used in data science, exploratory data analysis and modeling.

You can watch the entire course here:

https://youtu.be/Ki2iHgKxRBo?list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo

It’s a great video series but we don’t have time to cover it all.

Do This: Watch the following video for an overview of PCA (don’t worry about any references to “readings” from their course).

from IPython.display import YouTubeVideo
YouTubeVideo("kw9R0nD69OU",width=640,height=360)

Question: PCA is trying to find the directions with maximal what?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)

Question: What does it mean when two components are orthogonal?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)

from IPython.display import YouTubeVideo
YouTubeVideo("_nZUhV-qhZA",width=640,height=360)

Question: This video introduces a concept of “features” in a dataset. What are the names of the two original features represented in the graph shown in this video? What parts of the graph would represent the new features after the PCA is performed?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)

from IPython.display import YouTubeVideo
YouTubeVideo("kuzJJgPBrqc",width=640,height=360)

Question: What does it mean if eigenvalue of a dimension is zero? How might performing PCA allow one to reduce the number of features we need to model to our data to get accurate results when making predictions?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)


2. Example Application: The Breast Cancer Dataset#

Let’s go back to the Breast Cancer dataset from a previous assignment. We’ll need to download the dataset, which you can do using the following URL:

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-supplemental-data/main/data/WIBreastCancer_Cleaned.csv

Remember: this is real data, corresponding to real people, with real diseases. It can be easy to forget this when working with data for the purposes of learning how to use new modeling and data analysis tools. You should always strive to see the human aspect of whatever you are working on when working with data.

Question: Use Pandas to load the breast cancer dataset.

# Load the dataset using Pandas

Step A: Try to visualize the features by plotting them.#

Question: Modify the following code to draw a scatterplot of the data for just the first and second axes of the data matrix (index 0 and 1). Note: since the data features have integer values, you may want to add a small amount of random noise to the data to make it easier to see the overlapping data points or make the points transparent (e.g alpha=0.25) – or both!

# DO THIS: Modify the code below to do a scatter plot with respect to the first two variables of the data
# i.e. all rows but just the first and second columns.

#plt.scatter(  ,   , c=target, s=30, cmap=plt.cm.rainbow);

# Don't forget to add axis labels

If done correctly the above should show different color dots for each of the malignant and benign tumors. As you can see, the classes do not separate clearly as two of the classes have a significant amount of overlap. Perhaps there are two new directions (axes) that separate the data better?

Step B: Transform the data in terms of its principal components#

Now we will use a PCA algorithm. Fortunately there is a simple PCA function available in the scikit-learn module link

#Note: make sure that your dataframe is called df

pca = dec.PCA()
pca_data = pca.fit_transform(df.drop(columns=["label"]))

Out of curiosity, let’s print the eigenvalues. The eigenvalues are stored in the attribute explained_variance. Remember from the video that low eigenvalues indicate less information. Big eigenvalues indicate more information. However, the eigenvalues are just numbers and do not indicate much so let’s print also their ratio.

print("The eigenvalues are: ", pca.explained_variance_)
print("Their ratios are: ", pca.explained_variance_ratio_)

As you can see the last few eigenvalues are pretty small and do not provide much information.

However, these eigenvalues do not tell us which is the most important feature in the dataset. Principal Component Analysis is a global algorithm that “rotates” the data into new dimensions. Mathematically speaking the components are the eigenvectors. Let’s print the components and check that they are orthogonal

for i, eigv in zip(["first", "second", "third", "fourth"], pca.components_):
    print(f"The elements of the {i} eigenvector are: ", eigv) 

Do this: Write some code to check that all the eigenvectors are orthogonal to each other. Hint: The dot product of two orthogonal vectors is zero (or numerically close to zero).

# Put your code here

Step C: Now plot the transformed data in terms of its first two principal components#

fig, ax = plt.subplots(1,2, figsize = (16, 6))
# Plot the original data

ax[0].scatter(df.iloc[:,0], df.iloc[:,1], c=df["label"], s=30, cmap=plt.cm.rainbow, alpha=0.25)
ax[0].set_title("Original Data")
ax[0].set_xlabel("clump")
ax[0].set_ylabel("Norm-Nuc")

# Plot the transformed data
ax[1].scatter(pca_data[:,0], pca_data[:,1], c=df["label"], s=30, cmap=plt.cm.rainbow, alpha=0.25)
ax[1].set_title("Transformed Data")
ax[1].set_xlabel("PCA Component 1")
ax[1].set_ylabel("PCA Component 2")

Question: Describe in words the differences between the above graphs. They are representing the same data. Why might we prefer to use the features produced by Step C?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)


3. Scaling Data#

As mentioned above PCA finds the dimensions with the highest variance. However, it can be the case that some of the features in your dataset lie on very different ranges. For example, think of a dataset containing the salaries and heights of all the people in a company. The range of salaries is much wider than the range of heights, e.g. from less than 15 USD/hour for the clerk to Millions of USD for the CEO while the heights of employees changes only over several inches.

Do this: Read the guides below

Why scale? For many machine learning algorithms, like k-NN or K-means, scaling your data can improve performance and uniformity. For the PCA algorithm, it relies heavily on the variance of features, and unscaled data can mislead the PCA results.

Read this article to understand more about feature scaling: Importance of Feature Scaling

What kind of scalars are there? However not all scalars are created equal and we must use different scalars to handle different datasets.

Read up to section 6.3.1.3 to learn about the different scalars you can use: Preprocessing Data

How do different scalars deal with outliers? As previously mentioned, there are different types of scalars and they all scale data for certain situations. One common reason to scale is for outliers.

Read this article to understand how the different scaling types works with outliers: Compare the effect of different scalers on data with outliers

Now do this Answer the following questions

  1. Why is it important to scale data?

  2. What scaler is used in the first article?

  3. How many linear scalers can you find in scikit-learn? List all the ones you find.

  4. How do you pre-process the data when it has outliers? What scaler do you use and why?

  5. List at least two non-linear transformers?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text)


Follow-up Questions#

Copy and paste the following questions into the appropriate box in the assignment survey include below and answer them there. (Note: You’ll have to fill out the assignment number and go to the “NEXT” section of the survey to paste in these questions.)

  1. When doing a principle component analysis (PCA), the goal is to find a new set of axes that maximize what?

  2. When you perform PCA, what do large eigenvalues indicate versus small eigenvalues?

  3. Why is important to scale data?

  4. How do you deal with outliers?


Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://cmse.msu.edu/cmse202-pc-survey" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

Congratulations, you’re done with your pre-class assignment!#

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today’s submission folder (Don’t forget to add your name in the first cell).

© Copyright 2023 Department of Computational Mathematics, Science and Engineering at Michigan State University