Week 05: Pre-Class Assignment:
Statistics and outliers#

✅ Put your name here.#

xkcd


Goals for this week’s pre-class assignment#

In this Pre-Class Assignment you are going to use some statistics measures in the context of machine learning. The main learning goals are:

  • learn some useful methods in Python,

  • practice making nice plots,

  • become more familiar with multivariate gaussians,

  • practice some statistics,

If you are not yet a Python expert, you might look these up to help you this week:

  • np.atleast_2d

  • np.ravel

  • np.meshgrid

  • np.concatenate

  • sns.pairplot

  • np.stack

Total number of points: 55 points

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-Class Assignments” submission folder on D2L. Submission instructions can be found at the end of the notebook.

# Import the needed libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from scipy.stats import multivariate_normal

import matplotlib.ticker as mticker


import pandas as pd
import seaborn as sns

Part 1: Violin and Box Plots Using Seaborn (20 points)#

When you start a machine learning project from scratch, but you have a data set in mind, the next step is Exploratory Data Analysis (EDA). Most of EDA is looking for statistical trends in your data, and that is just what seaborn is designed to help you do!

The seaborn library offers capability for visualizing data sets and performing simple statistics on them. Making such visualizations can be very useful for exploring data sets that are simple, where “simple” means low-enough dimensionality that 2D projections on your screen are meaningful. (Or, you can perform PCA first on a higher dimensional dataset before attempting the visualization.) As with some of the other libraries, seaborn has some simple datasets built in: we’ll grab the very common iris dataset. In this problem you are going to explore one of seaborn’s capabilities.

# Load the Iris dataset
iris = load_iris()
data = iris.data
target = iris.target
feature_names = iris.feature_names
species = iris.target_names

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(data, columns=feature_names)
df['species'] = [species[i] for i in target]
df

Let’s explore the pairplot, which reveals correlations in the data.

sns.pairplot(df, hue="species") 

Task 1.1: (8 points) Here, you will explore an alternate tool that gives similar information. Using the internet, look up what violin plots and box plots are, and describe in detail what they show:

There are a lot of options when making box plots. In your answer consider some of these options - the Seaborn page on boxplots has a lot of examples.

Put your answers here!

  1. A box plot has these features…. (for example, how are outliers shown?)

  2. A violin plot has these features…. (for example, what does a kernel density plot show?)

As you can see, making violin and box plots would be very time consuming with matplotlib, but very desirable for pre-ML EDA. This is exactly the niche that Seaborn fills. Seaborn has a nice gallery of ideas to choose from as well as good documentation to get you exactly what you need.

Task 1.2: (2 points) Next, make four violin plots, one for each feature of the dataset. There is a bit of code to guide you.

fig, ax = plt.subplots(2,2, figsize = (18,10))
sns.violinplot(data = df, x = "species", y = "sepal length (cm)", ax = ax[0,0])
sns.violinplot(data = df, x = "species", y = "sepal width (cm)", ax = ax[0,1])
sns.violinplot(data = df, x = "species", y = "petal length (cm)", ax = ax[1,0])
sns.violinplot(data = df, x = "species", y = "petal length (cm)", ax = ax[1,1])

Question 1.3: (4 points) What do you observe and learn from these plots?

Put your answers here!

Task 1.4: (2 points) Now make four box plots, one for each feature of the dataset.

fig, ax = plt.subplots(2,2, figsize = (18,10))
sns.boxplot(data = df, x = "species", y = "sepal length (cm)", ax = ax[0,0])
sns.boxplot(data = df, x = "species", y = "sepal width (cm)", ax = ax[0,1])
sns.boxplot(data = df, x = "species", y = "petal length (cm)", ax = ax[1,0])
sns.boxplot(data = df, x = "species", y = "petal length (cm)", ax = ax[1,1])

Question 1.5: (4 points) What do you observe and learn from these plots?

Put your answers here!


Part 2: Covariance Matrix, Linear Algebra and Multivariate Gaussians (10 points)#

You may have noticed many deer roaming around the East Lansing area. What you may know less about is that some of them carry a prion disease called “Chronic Wasting Disease”, or CWD for short. Researchers at MSU in both the CMSE and the Fisheries and Wildlife departments are developing computational models to understand the spread of this disease. One of the measurements this team uses is GPS data from collars attached to some of the deer. The data reveal what is best known as a “random walk”, and the researchers would like to make some sense out of this for their models. What information can we extract from data about deer positions? Let’s imagine the location of an individial deer comes to you as a dataset with positions of the deer taken over some time frame, and we want to use just statistics to extract information from this data.

Multivariate Gaussians enter just about every corner of ML, and they make connections between all of those corners. Here, you will explore connections between linear algebra and statistics in the context of building multivariate Gaussians.

Here, all you are going to do is make fake data, compute its statistical properties, plug them into a bivariate Gaussian and plot it with the original data. Imagine that the fake data you create be the positions of a deer carrying a GPS on its collar - let’s see if we can extract meaningful information from this data about deer movement patterns. The goal of the research is to build a model that has the same statistical properties as the data.

The multivariate Gaussian distribution is given by:

\[\mathcal N(x_1, \ldots, x_p) = \frac{1}{\sqrt{(2\pi)^p {\rm det}({\boldsymbol \Sigma})}}\exp\left(-\frac{1}{2}({\bf x} - {\bf \mu})^T{\boldsymbol \Sigma}^{-1}({\bf x} - {\boldsymbol \mu}) \right).\]

Recall that \(\bf \Sigma\) is the covariance matrix, \(p\) is the number of variables (aka predictors, aka features) and \(\mu\) is the mean.

For this problem we only need a bivariate Gaussian, of course, since the deer roam only on the ground (thankfully).

Question: (5 points) Assume you have position data, i.e. longitude and latitude coordinates, for \(N\) deers. What are the dimensions of \({\bf x}\), \({\bf \mu}\), \({\bf \Sigma}\), the product in the argument of the exponential, i.e. \(({\bf x} - \mu)^{T} {\bf \Sigma}^{-1} ({\bf x} - \mu)\), and \(\mathcal N\) ?

  1. \({\bf x}\) has size \(2 \times 1\)

  2. \({\bf \mu}\) has size \(2 \times 1\), where \(\mu = \frac{1}{N} \sum_{i = 1}^N x_i\)

  3. \({\bf \Sigma}\) has size \( 2\times 2\)

  4. \(({\bf x} - {\bf \mu})^T{\boldsymbol \Sigma}^{-1}({\bf x} - {\bf \mu} ) \) has size 1. It’s a scalar

  5. \(\mathcal N\) is a scalar so dimension = 1

ChatGPT#

Given:

  • \(X\) is the dataset with dimensions \(M \times P\) (where \(M\) is the number of observations and \(P\) is the number of variables or features).

  • \(\Sigma\) is the covariance matrix with dimensions \(P \times P\).

Let’s break down the product \((X - \mu)^T \Sigma^{-1} (X - \mu)\):

  1. \((X - \mu)\) has dimensions \(M \times P\). This is because you’re subtracting the mean \(\mu\) from each row of \(X\).

  2. \((X - \mu)^T\) has dimensions \(P \times M\).

  3. \(\Sigma^{-1}\) is the inverse of the covariance matrix and has dimensions \(P \times P\).

Now, let’s compute the product:

The product of \((X - \mu)^T\) (which is \(P \times M\)) and \(\Sigma^{-1}\) (which is \(P \times P\)) is not directly computable since the number of columns of the first matrix doesn’t match the number of rows of the second matrix.

However, if you meant the product in the context of a quadratic form for each observation, then the product for each observation \(x_i\) (a row vector) would be:

\[ (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \]

This product results in a scalar for each observation \(x_i\). When computed for all observations, you’ll get a column vector of dimension \(M \times 1\), where each entry is the result of the quadratic form for the corresponding observation.

Task 2.1: (2 points) Assume that \(N = 2\), workout the math of the product \(({\bf x} - {\bf \mu})^T{\boldsymbol \Sigma}^{-1}({\bf x} - {\bf \mu} ) \) and write the argument of the exponential for each deer. For the sake of simplicity, you can write the elements of \({\bf \Sigma}^{-1}\) as

\[\begin{split}{\bf \Sigma}^{-1} = \begin{pmatrix} \sigma_{11} & \sigma_{12}\\\sigma_{21} &\sigma_{22} \end{pmatrix}.\end{split}\]

Note that this is needed for the next part.

Put your answers here!

Let’s define

\[\begin{split}{\bf X} = \begin{pmatrix} \bf x \\ \bf y\end{pmatrix},\end{split}\]

where \(\bf x, y\) are vectors \(\mathbf x = (x_1, x_2)^T\), \(\mathbf y = (y_1, y_2)^T\) Then we have \({\bf \mu} = ( \mu_x, \mu_y )\) which leads to

\[\begin{split}{\bf X - \mu} = \begin{pmatrix} \mathbf x - \mu_x \\ \mathbf y - \mu_y \end{pmatrix} = \begin{pmatrix} \bf d_{1} \\ \bf d_{2} \end{pmatrix}\end{split}\]

the product is then

\begin{align*} ({\bf X - \mu})^{T} {\bf \Sigma}^{-1} ({\bf X - \mu}) &= \begin{pmatrix} \mathbf{d}{1} & \mathbf{d}{2} \end{pmatrix} \begin{pmatrix} \sigma_{11} & \sigma_{12} \ \sigma_{21} &\sigma_{22} \end{pmatrix} \begin{pmatrix} \mathbf{d}{1} \ \mathbf{d}{2} \end{pmatrix} \ &= \begin{pmatrix} \mathbf{d}{1} & \mathbf{d}{2} \end{pmatrix} \begin{pmatrix} \sigma_{11}\mathbf{d}{1} + \sigma{12}\mathbf{d}{2} \ \sigma{21}\mathbf{d}{1} + \sigma{22}\mathbf{d}{2} \end{pmatrix} \ & = \mathbf{d}{1}(\sigma_{11}\mathbf{d}{1} + \sigma{12}\mathbf{d}{2}) + \mathbf{d}{2}(\sigma_{21}\mathbf{d}{1} + \sigma{22}\mathbf{d}{2}) \ & = \sigma{11}\mathbf{d}1^2 + 2 \sigma{12}\mathbf{d}{2}\mathbf{d}1 + \sigma{22}\mathbf{d}{2}^2 \end{align*}

Task 2.2: (4 points) Let’s code. Complete the code below

#  Let's quickly make some fake data. 
num_deers = 100

# Generate fake position data. Here I am using a multivariate gaussian.
# DON'T USE THIS. IT'S TOO BORING. CREATE SOME DATA AND ADD SOME NOISE TO IT. 
x_data = np.random.normal(loc = (0.2, 0.5), scale = (0.1, 0.2), size = (num_deers, 2))

x_1 = np.linspace(0.1,1, num_deers)
# try playing with this function to generate "new data"
x_2 = 0.6*x_1**2 + 10 * np.sin(x_1) + 0.2*np.random.randn(x_1.size)
x_data = np.c_[x_1, x_2]

# Calculate the mean of the longitude and latitude
x_mean = x_data.mean(axis = 0)

# Calculate the covariance matrix of your dataset
x_sigma = np.cov(x_data, rowvar=False)

# Calculate its inverse
sigma_inv = np.linalg.inv(x_sigma)

# Plotting the result
num_points = 120
# Range of latitudes
x_a = np.linspace(-1.2*np.abs(np.min(x_data[:,0])), np.max(x_data[:,0])*1.2, num_points)
# Range of longitudes
y_a = np.linspace(-1.2*np.abs(np.min(x_data[:,1])), np.max(x_data[:,1])*1.2, num_points)
# Create a grid
X, Y = np.meshgrid(x_a, y_a)

fig = plt.figure(figsize=(10,10))
plt.xlabel('latitude')
plt.ylabel('longitude')
plt.grid(alpha=0.2)
plt.title('Deer Positions')

# Calculate the argument of the exponential
exp_argument = sigma_inv[0,0]*(X - x_mean[0])**2 + sigma_inv[1,1]*(Y - x_mean[1])**2  + 2.0 * sigma_inv[0,1]*(Y - x_mean[1])*(X - x_mean[0])
multivariate_gaussian = np.exp(- 0.5 * exp_argument) / np.sqrt( 4.0 * np.pi**2 * np.linalg.det(x_sigma))

# Plot the countours of the multivariate gaussian
plt.contourf(X, Y, multivariate_gaussian, levels = 10)

# Plot the deers' position
plt.scatter(x_data[:,0], x_data[:,1], c = 'w', s = 20)
plt.colorbar()

Question 2.3: (4 points) Do to the contours look like the data? Why or why not?

Put your answers here!


Part 3: Anomaly Detection (25 points)#

Part 3.1: Creating and displaying synthetic data (13 points)#

You are at a career fair on campus and you decide to meet with companies interested in hiring experts on machine learning. While talking to recruiters for one of the companies, they ask you what your experience and skill levels are. You tell them that currently they are very low, but you are in CMSE 491. Their jaws drop, and they immediately offer you a job starting upon completion of your AML class.

At your first day on the job your boss comes to your office and gives you your first assignment. Because your clearance hasn’t come through yet, she can’t tell you all of the details, but you can get your code base functional in the meantime. What she does tell you is that one of the defense agencies is receiving emails from what might be a foreign hacker entity. These emails tend to have unusual values of certain traits (features), which she can only tell you (for now) are called \(X_1\) and \(X_2\). The agency is one of the largest in the world, so there is no way a human can examine all of the email traffic for the presence of unusual values of \(X_1\) and \(X_2\); you need to develop a code that examines the emails for values of these two traits, which will come to you as real numbers, and send an email to a human if you believe it is likely from an enemy actor. The problem is that all emails, even good emails, have some of \(X_1\) and \(X_2\), so what you need to do is look for what are called “anomalies”.

Anomaly detection is a fairly large branch of ML, used widely in business to detect anything unusual, such as fraud. If you look up above at your box plots you will see dots that represent outliers - you can think of those as the anomalies in the iris data set. Who knows, maybe for those cases a completely different species of flower was accidently measured? Or, maybe that feature just has a weird distribution where there is some possiblity of “unusual” values? Usually, you need to flag those and take a closer look.

First, you obtain a dataset of values from emails known to originate from within the agency, so they are presumed “safe”. Because you have no idea what \(X_1\) and \(X_2\) are, you decide to plot them first - what else can you do? And, you want the plot to look good and reveal trends in the data to show to your boss, so you use seaborn.

Comment the code below. There are a few tricky steps in there; feel free to use the internet to help you. Include in the comments what jointplot, multivariate_normal and kdeplot are. Note the similarity to the type of information in a violin plot.

# Parameters for the multivariate normal distribution
mean_original = np.array([2, 2])
cov_original = np.array([[1, 0.5], [0.5, 1]])

total_size = int(1e4)
# Generate 99.7% of the data from a multivariate normal distribution
num_samples = int(.997*total_size)
rng = np.random.default_rng(seed = 42)
X_original = rng.multivariate_normal(mean_original, cov_original, num_samples)

# Draw contours of the multivariate 
lim_mult = 2.0
x_array = np.linspace(lim_mult*X_original[:,0].min(), lim_mult*X_original[:,0].max(), 100)
y_array = np.linspace(lim_mult*X_original[:,1].min(), lim_mult*X_original[:,1].max(), 100)
X, Y = np.meshgrid(x_array, y_array)

inv_cov = np.linalg.inv(cov_original)

exp_argument = inv_cov[0,0] * (X - mean_original[0])**2 + 2.0 * inv_cov[0,1]*(X - mean_original[0])*(Y - mean_original[1]) + inv_cov[1,1]*(Y - mean_original[1])**2
multivariate_matrix = np.exp(- 0.5 * exp_argument)/ np.sqrt(4.0 * np.pi**2 * np.linalg.det(cov_original) )

# Choose the levels to plot. 
peak = multivariate_matrix.max()
levels_to_show = peak * np.array([0.0003335 , 0.01, 0.13 , 0.6 , 1.0])
# Plot the countours of the multivariate gaussian
fig, ax = plt.subplots(1,1, figsize=(8,6))
# I am plotting the log of the distribution because it is easier to show the colorbar
cntr = ax.contourf(X, Y, np.log(multivariate_matrix), levels = np.log(levels_to_show), cmap = "Set3")
# Plot the data. We need to plot the multivariate first otherwise the points don't show.
ax.scatter(X_original[:,0], X_original[:,1], s = 5, c = 'k',  alpha = 0.1, label = "Data")

cbar = fig.colorbar(cntr,
                    ticks=np.log(levels_to_show),
                    format=mticker.FixedFormatter([r'4$\sigma$', r'3$\sigma$', r'2$\sigma$', r'$\sigma$', 'Peak']),
                    extend='both'
                    )

ax.set( xlabel = r"$X_1$", ylabel = r"$X_2$")

As you can see most of the data falls into the \(3\sigma\) distance from the center (0, 0). Let’s add some outliers

# Generate outliers in the top left corner. These are easy to spot on a 2D plot
mean = mean_original + np.array([-4, 4])
cov_matrix = np.array([[1, 0.], [0., 1]])
num_outliers = int(0.0015*total_size)
easy_outliers = np.random.multivariate_normal(mean, cov_matrix, num_outliers)

# Generate outliers that are a little closer to the real data. 
mean = mean_original + np.array([5, 5])
cov_matrix = np.array([[1, 0.0], [0.0, 1]])
num_outliers = int(0.0015*total_size)
hard_outliers = np.random.multivariate_normal(mean, cov_matrix, num_outliers)
# Combine the data and outliers
X_data = np.vstack([X_original, easy_outliers, hard_outliers])

# Put everything into a Dataframe 
data_df = pd.DataFrame( X_data, columns=["X_1", "X_2"])

Let’s now remake the original plot and show the outliers.

# Remake the previous plot and show the outliers now
fig, ax = plt.subplots(1,1, figsize=(8,6))
cntr = ax.contourf(X, Y, np.log(multivariate_matrix), levels = np.log(levels_to_show), cmap = "Set3")
# Plot the data. We need to plot the multivariate first otherwise the points don't show.
ax.scatter(X_original[:,0], X_original[:,1], s = 5, c = 'k', alpha = 0.1, label = "Data")

cbar = fig.colorbar(cntr,
                    ticks=np.log(levels_to_show),
                    format=mticker.FixedFormatter([r'4$\sigma$', r'3$\sigma$', r'2$\sigma$', r'$\sigma$', 'Peak']),
                    extend='both'
                    )

# Plot the easy to spot outliers as cyan points
ax.scatter(easy_outliers[:,0], easy_outliers[:,1], s = 10, marker = '*', c = 'c', label = 'Easy to spot outliers')

# Plot the not so easy to spot outliers as magenta points
ax.scatter(hard_outliers[:,0], hard_outliers[:,1], s = 10, marker  ='*', c = 'm',label = 'Hard to spot outliers')
# Don't forget the labels !!!
ax.set( xlabel = r"$X_1$", ylabel = r"$X_2$")

Well, that was a lot of work. Can seaborn make a similar plot with less code?

Task 3.1.1: (3 points) Run the code below and add comments

# # comment this line in detail
g = sns.JointGrid(data = data_df, x="X_1", y="X_2")
h1 = g.plot_joint(sns.histplot, pthresh = 0.001)
# s1 = g.plot_joint(sns.scatterplot, s=20, c = 'k')
k1 = g.plot_joint(sns.kdeplot, color="y", levels = levels_to_show)
m1 = g.plot_marginals(sns.histplot, bins = 'fd', kde = True)
h1.set_axis_labels( xlabel = r"$X_1$", ylabel = r"$X_2$")
# # take your time figuring this one out! change the kind to { “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }

Task 3.1.2: (10 points) Using your own words (not the documentation’s) answer the following questions

  1. What does JointGrid do?

  2. What does the option pthresh in sns.histplot do? Play with it.

  3. What does sns.kdeplot do?

  4. Why are the contours different that the ones we draw using contourf?

  5. What does bins = 'fd' do? What are other options?

Put your answers here!

Part 3.2. What is an Anomaly Anyway? (12 points)#

What you need first is a definition of “anomaly”; you decide to do this by first exploring your data. We have done this by making plots above. The plots show that most of the data (99% of it to be exact) can be described by a multivariate normal distribution centered at the coordinates given by mean_original. Thus, we can define the anomalies as all those points that are outside the normal distribution, to be precise all those points that are \(3 \sigma\) and more away from the center.

However, we would like to have a better way to find outliers when our data has more than two dimensions. As mentioned in class, one way to do this would be to calculate the distance of all the points and set a threshold value. In the next part you are going to compare two distance measures.

Task 3.2.1: (6 points) Imagine you didn’t create the data above and it was just given to you. In the next code cell calculate the mahalanobis distance, the euclidean distance, and the standard euclidean distance for each row of the data_df dataset.

Note: You need to find the distance functions in scipy.

from scipy.spatial import distance

# 3 points for getting the right cov 
mean = X_data.mean(axis = 0)
cov_mat = np.cov(X_data, rowvar=False)
inv_sigma = np.linalg.inv(cov_mat)

# 2 points for calculating correctly the distances 

# Calculate the three distances for each data point and append them to these lists
mahalanobis_distances = []
euclidean_distances = []
seuclidean_distances = []
for i in range(X_data.shape[0]):
    mahalanobis_distances.append(distance.mahalanobis(X_data[i], mean, inv_sigma))
    euclidean_distances.append(distance.euclidean(X_data[i], mean))
    seuclidean_distances.append(distance.seuclidean(X_data[i], mean, X_data.var(axis = 0)))
    
# Convert the distances to a pandas series
mahalanobis_series = pd.Series(mahalanobis_distances, name='Mahalanobis_Distance')
euclidean_series = pd.Series(euclidean_distances, name='Euclidean_Distance')
seuclidean_series = pd.Series(euclidean_distances, name='SEuclidean_Distance')
# Attach the distances to the dataframe
df = pd.concat([data_df, mahalanobis_series, euclidean_series,seuclidean_series], axis=1)
df

Task 3.2.2: (2 points) Now that you have calculated the distances you need to choose a threshold.

# Choose a threshold for each distance and plot
threshold = df["Mahalanobis_Distance"].mean() + 4 * df["Mahalanobis_Distance"].std()
M_data = df[ df["Mahalanobis_Distance"] > threshold] 
threshold = df["Euclidean_Distance"].mean() + 3.5 * df["Euclidean_Distance"].std()
E_data = df[ df["Euclidean_Distance"] > threshold] 
threshold = df["SEuclidean_Distance"].mean() + 3.5 * df["SEuclidean_Distance"].std()
SE_data = df[ df["SEuclidean_Distance"] > threshold] 

# Let's plot
fig, ax = plt.subplots(1,4, figsize = (16, 4), sharey=True)
ax[0].scatter(X_original[:,0], X_original[:,1], alpha= 0.1, c = 'k', s = 10)
ax[0].scatter(easy_outliers[:,0], easy_outliers[:,1],  c = 'b', s = 10)
ax[0].scatter(hard_outliers[:,0], hard_outliers[:,1],  c = 'b', s = 10)
ax[1].scatter(X_data[:,0], X_data[:,1], alpha= 0.1, c = 'k', s = 10)
ax[1].scatter(M_data["X_1"], M_data["X_2"],  c = 'r', s = 10)
ax[2].scatter(X_data[:,0], X_data[:,1], alpha= 0.1, c = 'k', s = 10)
ax[2].scatter(E_data["X_1"], E_data["X_2"],  c = 'r', s = 10)
ax[3].scatter(X_data[:,0], X_data[:,1], alpha= 0.1,c = 'k', s = 10)
ax[3].scatter(SE_data["X_1"], SE_data["X_2"],  c = 'r', s = 10)

for a, t in zip(ax, ["Original + outliers", "Mahalanobis outliers", "Euclidean outliers", "Standard Euclidean outliers"]):
    a.set( xlabel = r'$X_1$', title = t)
ax[0].set_ylabel(r'$X_2$')
# plt.title('Data in Zero Region Are Possible Attacks')

Task 3.2.3: (4 points) Do your distances find all the outliers? Why or why not?

Put your answers here!


Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/QyrbnptkyA" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

© Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.