In order to successfully complete this assignment, you must follow all the instructions in this notebook and upload your edited ipynb file to D2L with your answers on or before 11:59pm on Friday September 25th.
BIG HINT: Read the entire homework before starting.
In this homework, we will download and explore some widely available datasets using the Python Scikit-learn module. We want you to start thinking of data samples of "feature vectors". Each sample in a dataset is composed of $n$ measurements. Each individual measurement does not necessarily have to relate to other measurements in the vector but each measurement $v_i \in V$ corresponds to a "similar" measurement in another sample $u_i \in U$.
image from : pixabay.com
</p>
The "Wine" is provided with the Sklearn library and is an easy dataset use as an example.
✅ **DO THIS:** Run the following code to download the wine dataset.
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
import sklearn.datasets as sdata
sk_data = sdata.load_wine()
Let's inspect the sk_data
class by using the dir
command:
dir(sk_data)
The DESCR
object looks interesting. Let's print it out and see what is going on...
print(sk_data.DESCR)
✅ **DO THIS:** How many features are there in this dataset? (You could count them manually but it is highly recommended that you write code to find the right number. We may be using similar datasets in the future and it is generally better to try to make code that is portable for when things change. Store the number of features in the fector as the python variable N
so that you can check your answer below:
#put your answer to the above question here.
from answercheck import checkanswer
checkanswer(N,'c51ce410c124a10e0db5e4b97fc2af39');
✅ **DO THIS:** In this dataset, How many different wines were tested using these $n$ features? Again, write code to calculate the answer instead of just "hard coding" the number. Store the size in a variable named M
#put your answer to the above question here.
from answercheck import checkanswer
checkanswer(M,'8f85517967795eeef66c225f7883bdcb');
The following figure graphs each feature for the entire dataset.
plt.figure(figsize=(20,10))
plt.plot(sk_data.data);
plt.legend(sk_data.feature_names)
Another way to look at this dataset is as a large 2D array (or matrix) which can be viewed as an image using the imshow
function. In this case we choose the "Reds" colormap
to keep with the wine theme.
plt.figure(figsize=(20,2))
plt.imshow(sk_data.data.T,cmap="Reds")
plt.colorbar()
The pandas
library can be helpful for viewing data. Here we will just show the basics for turning the raw data stored as a numpy
array into a pandas.DataaFrame
object. Each row is a particular wine and the columns are the individual feature measurements.
import pandas
df = pandas.DataFrame(sk_data.data, columns = sk_data.feature_names)
df
Another useful pandas
funciton is the describe
function which gives some basic statistics for the measurements. Check to make sure these measurements match up with the statistics provided in the dataset DESCRIB
df.describe()
Now that you have a feel for the type of data available in the wine dataset we need to build a measure to compare the wines. The following is a "stub" function. Modify it to return the euclidean distance between two different input features.
HINT a good solution is one that does not "Hard-code" properties of the wine datset (such as vector length), instead a good function is one that will calculate the distance between any two vectors in $R^n$ for any size of $n$.
def dist(u,v):
d = 0
return d
Lets test our functions on a couple of simple examples. The following are common examples for which we know the values:
dist([0,0],[0,1]) == 1
dist([0,0, 0],[1,0, 0]) == 1
dist([0,0],[3,4]) == 5
anyvec = [1,22,3,444,5.123,69,2229,42.0]
dist(anyvec,anyvec) == 0
from answercheck import checkanswer
checkanswer(dist(sk_data.data[0,:],sk_data.data[51,:]),'7a502b88ac326e0d79fe2cc8f33efd15');
Assuming the distance measure is working above, we can calculate the distance between all wines relative to each other. Notice that this graph is symmetric because the distance between Wine A and B is the same as the distance between Wine B and A. Also Notice that the diagonal for the matrix is aways zero. i.e. the distance between wine A and wine A is zero.
distmatrix= np.zeros((M,M))
def distance_matrix(A):
for i in range(M):
for j in range(M):
distmatrix[i,j] = dist(A[i,:], A[j,:])
plt.figure(figsize=(20,10))
plt.imshow(distmatrix, cmap="Reds")
plt.colorbar();
distance_matrix(sk_data.data)
Large datasets such as the wine dataset are considered "high dimensional" when the number of measurements in their feature vector gets large. What is considered "large" depends a little on what you are trying to do. If you are trying to visualize the data, anything large is bigger than 2 or 3 dimensions (it is hard for the human brain to visualize data in more than three dimensions).
Later in the semester you will learn how to use something called "Eigenvectors" and "EigenValues" to do Principal Component Analysis (PCA) of high dimensional Data. PCA is probably the most common algorithm in a class of algorithms used for "dimensionality reduction". The purpose of these algorithms is to summaries complex, high dimensional data into a smaller set of dimensions that fit the problem you are trying to solve. In our case, we want to visualize feature values of $N$ measurements only using 2 axes.
To learn more about PCA try checking out the PCA wikipedia page.
Before we get into all of the details about how to do the PCA math using eigenvalues/eigenvectors we will just use a PCA function avaliable in the sklearn library.
The following code imports the PCA function and reduces the wine data (sk_data.data
) down to it's two largest principal components. Think of these principal components as a weighted sum of the original data specifically designed to maintain the most information.
#Reduce the data down to two priciple compoents to make plotting easier.
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(sk_data.data)
Now we plot the original dataset with different colors corresponding to the three wine classes which was included with the data (sk_data.target
):
#Strip out the three classes of data and plot
class0 = reduced_data[sk_data.target==0,:]
class1 = reduced_data[sk_data.target==1,:]
class2 = reduced_data[sk_data.target==2,:]
plt.scatter(class0[:,0],class0[:,1])
plt.scatter(class1[:,0],class1[:,1])
plt.scatter(class2[:,0],class2[:,1])
plt.xlabel('First principal component')
plt.ylabel('Secoind principal component')
We can now see each of the sample wines and their "relative" relationship to each other. Unfortunately, the first two principal components do not have any units so it is hard to interpret their meaning. In the next section we will use "normalization" to clean up the data and make it easier to visualize.
One problem with the above PCA is we treat each measurement in the feature vector as having the same units. This means some measurements seem to have more "weight" in the PCA analysis just because they are bigger. One way we can fix this problem is to "Normalize" all the measurements between zero (0) and one (1). This normalization step allows us to better compare the measurements.
Let us assume that the above data is stored in a matrix $data$ with each row ($i \in M$) representing a wine and each column representing a feature ($j \in N$}. We want to "normalize" each measurement to a value between zero (0) and one (1) using the following equation:
For each wine and each feature ($j \in N$): $$A_{i,j} = \frac{data_{i,j} - min_j}{max_j-min_j}$$
where $min_j$ is the minimum value of the $j$th feature and $max_j$ is the maximum value of the $j$th feature.
✅ **DO THIS:** Write a program to normalize all of the values in the sk_data.data
dataset. Store the normalized values in a matrix $A$. HINT avoid writing lots of loops, libraries such as numpy
, pandas
and scikit-learn
all have functions that may help turn a 20 line program into 3 lines of code.
#Put your answer to the above question here.
from answercheck import checkanswer
checkanswer(A,'85608294aee283f63b58cfdc8da99a7c');
plt.figure(figsize=(20,2))
plt.plot(A);
%matplotlib inline
import matplotlib.pylab as plt
plt.figure(figsize=(20,2))
plt.imshow(A.T, cmap='Reds');
plt.colorbar();
✅ **Do This:** Copy and paste the code from the above PCA section and replace the sk_data.data
with the normalized vector A
.
# YOUR CODE HERE
raise NotImplementedError()
✅ **Question:** Compare and contrast the graph generated in part 3 with the one generated in part 4. In your own words, explain why the normalized data is "better".
YOUR ANSWER HERE
Turn in your assignment using D2L no later than 11:59pm on the day of class. See links at the end of this document for access to the class timeline for your section.
Written by Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.