Jupyter - Lecture 1#

Python Review#

In today’s assignment, we are going to go over some of the python commands you’ve (hopefully!) seen before.
In order to successfully complete this assignment you need to participate both individually and in groups during class. This lab loosely follows the lab content from the the ISLP textbook, Ch 2.3.

IMPORTANT: If you are looking at this notebook prior to class with the intention of doing it early PLEASE DON’T! The intention of these notebooks is for group work in class, so it will be much more interesting if you don’t have it done already and can talk with the rest of the group members.

1. The Dataset#

In this module, we will be using the Auto.csv data set from the textbook website. You can download the data directly by finding the link on the the data set page on the course website.

Info about the data set#

Auto: Auto Data Set#

Description

Gas mileage, horsepower, and other information for 392 vehicles. Usage

Format

A data frame with 392 observations on the following 9 variables.

  • mpg: miles per gallon

  • cylinders: Number of cylinders between 4 and 8

  • displacement: Engine displacement (cu. inches)

  • horsepower: Engine horsepower

  • weight: Vehicle weight (lbs.)

  • acceleration: Time to accelerate from 0 to 60 mph (sec.)

  • year: Model year (modulo 100)

  • origin: Origin of car (1. American, 2. European, 3. Japanese)

  • name: Vehicle name

The orginal data contained 408 observations but 16 observations with missing values were removed.

Source

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.


2. Load the data set#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

First, load in your csv file as a pandas data frame. Save this data frame as auto.

# If you are pulling straight from the course's git repo, 
# this command will open the file already on your system.
# However, if you are doing something else to access the 
# notebook, such as downloading from the github website,
# you might need to modify this command to point to the 
# right place.
auto = pd.read_csv('../../DataSets/Auto.csv')

If that worked and you managed to load the file, the following command should show you the top of your data frame.

auto.head()

…and the following command show show you the column labels

auto.columns

The shape command tells us about the size of the dataframe.

auto.shape

Q: How many data points do we have? How many variables do we have?

Your answer here

3. Cleaning up the data set#

Here’s one thing this class won’t really show you….. real data is MESSY. You almost never are handed a data set that’s ready to go for analysis off the bat. You’ll have to spend a bit of time cleaning up your data before you can use the awesome tools we have in this class. So, to that end, let’s do some careful checking of this data set before we get started. My favorite place to start with any data set is the describe command.

auto.describe()

Q: What columns are missing from the describe output?

Your answer here

The next thing to check is to see if there is any missing data. Usually, this is an entry in your dataframe that shows up as np.nan, which is a special value from numpy that just means the data is missing from that cell.

We can use the isna() command to see if there are any null values around.

auto[auto.isna().any(axis=1)]

Hey cool, no NaN’s to be found! You know what, maybe it’s a good idea just to take a second look. Check out the first 40 rows of the data set:

auto.head(40)

Q: What symbol(s) is this data set using to represent missing data?

Your answer here

DO THIS: Well, it’s not the end of the world, but there are lots of built in commands in pandas that make life easier when we have np.nan used as the missing value entry. So, use the replace command to swap out the missing value they used for np.nan everywhere it shows up.

#---- your code goes in here! ---#

If you did that right, the following command should show you the 5 rows that now have a np.nan entry somewhere.

# Find the rows with a NaN somewhere
auto[auto.isna().any(axis=1)]

DO THIS: Finally, let’s just get rid of those rows from the data set entirely. Overwrite the auto data frame with the version that deletes those 5 rows using the dropna command.

#---- your code goes in here! ---#
# If that worked, you now have 392 rows
auto.shape

Fixing horsepower#

One last weird data cleanup for us to do on this data set. Check out the horsepower column.

auto['horsepower']

Compare that to, for example, the weight and name columns.

auto['weight']
auto['name']

The dtype tells us what kind of data pandas thinks is contained in there. int64 is for numbers, like horsepower, but for some reason* pandas is treating it like object data, which is what pandas uses for basically anything else, like data inputs that are strings. This makes sense for the name column, but we’d like to fix it for the horsepower column now that we’ve fixed the np.nan issue. The code below returns the horsepower column with dtype: int64.

*The reason is related to the weird choice of null entry for this data set. It’s also why describe above didn’t have the horsepower column.

auto['horsepower'].astype('int')

DO THIS: Overwrite the horsepower column in the auto data frame with this fixed version.

#---- Your code here! ----#

2. Extracting data from a frame#

Ok, I know everyone needs a reminder on this (It’s me. I can never remember any of this without googling it…..), let’s just do a quick refresh on how to get out portions of your data table.

First, you can get a whole column (which is known as a pandas Series) like this:

auto['weight']

For more fine-grained control, there are two commands that are used: loc, and iloc.

auto.head()
# `loc` takes the labels as inputs to find a particular point
auto.loc[3,'weight']
# 'iloc' takex the indices. Here's how to get the same number
auto.iloc[3,4]

In this case, the row entry is the same for both (3) because the rows happen to be labeled with their number. However, for .loc, we need the name of the column we want (weight), while for .iloc we need the number of the column (4 because we count from zero…).

DO THIS: Extract a data frame with rows 3,4,5, and 6, and with information on displacement, horsepower, and weight.

#---- Your code here!----#

3. Plotting#

The third-ish thing I do with a new data set that I’m trying to understand is to just start plotting random things. This is great for getting a sense of ranges for values, as well as to start looking for simple correlations.

In this class, we will use two python modules for plotting, depending on which has the tools we want:

  • matplotlib. This is basically the standard plotting tool. It does basically anything you want (albeit with a bit of pain and suffering and a few choice four letter words along the way). You’ve already seen this package in CMSE 201 at least.

  • seaborn. This is helpful for some prepackaged figure generation that we will make use of. It’s actually built on top of matplotlib, but has often simplified syntax.

# Make sure you run this to import seaborn. If this doesn't run
# for some reason, check that it's installed. 
import seaborn as sns
auto.head()

DO THIS: First, use matplotlib’s hist command to show a histogram of the weight data.

Hint if you ever forget how to use a command, you can of course google it, but you can also type ? before the name of the command to see the help info from inside the jupyter notebook, e.g.:

?plt.hist
#----- Your code here!-----#

Plot the same histogram using seaborn’s histplot.

#---- Your code here-----_#

The next useful tool is to see data points scattered with each other in 2 dimensions.

DO THIS: Draw a scatter plot of the weight variable vs mpg variable.

# This command should get you approximately the same thing, 
# but with the added perk of automatically labeling axes
sns.relplot(x="weight", y="mpg", data=auto);

Now here is what I think is one of the most useful tools in seaborn for use when you’re starting to understand a dataset….

sns.pairplot(auto)

Q: What is each graph on this giant grid showing you?

Your answer here

Put your answer to the above question here.

A note on the ISLP package#

The new version of the textbook has been setup in python, and with it they have setup a python package with the stuff needed for the labs.

In theory, you should be able to get it running using the command

pip install ISLP

however this caused a bunch of headaches when I tried to do it on my machine.

For the moment, I’m not going to try to set up all labs without using this package. The issue is that their package is VERY restrictive about versions of dependent packages (numpy, matplotlib, etc), which means you would need to downgrade many standard packages on your system, or be comfortable with using conda environments. If that is something you are comfortable with, you can still follow the Installation Instructions to install the package.


Congratulations, we’re done!#

Written by Dr. Liz Munch, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.