In-Class Assignment: Reviewing Pandas and exploring data#

Day 5#

CMSE 202#

✅ Put your name here

#

✅ Put your group member names here

#

Goals for today’s In-class assignment#

  1. Make sure that everyone remembers how to do the basics with Pandas.

  2. Do a little data analysis with some existing data sets.

  3. Make some simple visualizations of the data.

Assignment instructions#

Work through the notebook making sure to write all necessary code and answer any questions.

This assignment is due by the end of class, and should be uploaded into the appropriate “In-Class assignments” submission folder. Submission instructions can be found at the end of the notebook.

Agenda for today:#

  1. Review of pre-class assignment

  2. Loading and exploring a dataset

  3. Visualizing your data

Useful imports (make sure to execute this cell!)#

Let’s get a few of our imports out of the way. If you find others you need to add, consider coming back and putting them here.

# necessary imports for this notebook
import pandas as pd
from pandas.plotting import scatter_matrix

1. Review of pre-class assignment#

Did anyone have any specific issues with the pre-class assignment?

Let’s take a moment to highlight some key concepts. Discuss with your table the following prompts and write down a brief definition of each of these concepts.

If you don’t feel like you have good working definitions yet, try doing a quick internet search to see if you can find a definition that makes sense to you.

Question 1: What are the features of a Pandas Series?

Do This - Write your disussion notes here.

Question 2: What are the differences between a DataFrame and a Series?

Do This - Write your disussion notes here.


2. Loading and exploring a dataset#

Though you review how one can build a DataFrame by hand in your pre-class assignment, the goal is typically to read some sort of preexisting data into a DataFrame so we can work with it.

Pandas is pretty flexible about reading in data and can read in a variety of formats. However, it sometimes needs a little help. Let’s start with a “toy” dataset, the Iris Data set .

A toy dataset is often one that has a particularly nice set of features or is of a manageable size such that it can be used as a good dataset for learning new data analysis techiques or testing out code. This allows one to focus on the code and data science methods without getting too caught up in wrangling the data. However, data wrangling is an authentic part of doing any sort of meaningful data analysis as data more often messier than not.

Although you will be working with a toy dataset today, you may still have to do a bit of wrangling along the way.

2.1 Getting used to looking for useful information on the internet#

Another authentic part of working as a computational professional is getting comfortable with searching for help on the internet when you get stuck or run into something that is unfamiliar to you. The Python data analysis/computional modeling world can be complicated and is ever-evolving. There is also a large number of publicly available toolsets with varying levels of complexity and nuance. Through practice and experience, we become better computational folks by learning how to search for better, more efficient, clearer ways to do things.

The Iris data#

The iris data set is pretty straight forward (review the wikipedia link above if you haven’t yet), especially if someone has done all of the data wrangling and cleaning for you. To make it a little more interesting we provide it in a raw form that you might encounter with other data.

✅ Do This: To get started, you’ll need to download the following two files:

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F22-data/main/data/iris.data

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F22-data/main/data/iris.names

Once you’ve done so, you should have access to the following : iris.data and iris.names. Open them both and discuss what you see. This is a good opportunity to use you favorite text editor or use something new. Feel free to ask your group members or instructors what they prefer to use.

Question 3: Describe the data and the information in the names file. What characteristics are provided there? Perhaps the iris dataset link above will help.

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text). Practice using appropriate Markdown formating to make your answer’s easy to read.

2.2 Reading in a file#

Pandas supports a number of file formats, but one of the most common is the one provided by the read_csv function. Typically, the phrase csv stands for “comma separated values”, a text file format that excel and other spread sheets use. In a typical csv file, each line represents a line in a spreadsheet and each cell value is separated from the next by a comma.

However, you can use other separators as well. Look at the documentation for read_csv.

✅ Do This: Read the iris.data file into your notebook and assign appropriate column headers. Display the DataFrame to make sure it looks reasonable.

# replace this cell with code to read in the iris.data file with column headers. Use appropriate headers!

2.3 Describing your data#

The .describe() methods tells you about the data. If the data is too big to see (height and width) you can use the two options below to show more of the data. You can change the values to suit your needs.

# expand what is shown (rows and cols) by pandas, any time you change default options -- be thoughtful about your choices!
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

✅ Do This: Run the .describe() method on the data you read in. If you think about how you might classify the species contained in the dataset based on the four available features, can you discern anything helpful? Put your answer in the cell below.

# Put your code here

Do This - Erase the contents of this cell and replace it with your answer. Did you find any useful features? What makes those features useful in classfying the species of the irises?

2.4 Grouping your data#

You can perform operations to group elements using the .groupby() method. The result of the use of .groupby is a GroupBy object. If you were to print it you see only the class instance name as nothing is computed until an operation is performed on the groups, much like other iterators. However, you can use .describe() on the result of a .groupby() call. If you haven’t used this before, you may need to consult the documentation for .groupby()

✅ Do This: Make a GroupBy object using the iris DataFrame and group by the class/species of the data. Then run .describe() on the group result. Now take a look at the data from the group. Thinking again about classification, do you see any data features you could use to classify the data when you look at it this way?

# Put your code here

Do this - Erase this and put your answer here. Did you find any useful features? What makes those features useful in classfying the species of the irises? How did using .groupby() help (or not) in finding useful features?


3 Visualizing your data#

We are often limited in understanding our data because it is complex, has many features, or is quite large. In these situations, visualizations (plots, charts, and graphs) can help represent our data in ways that help us gain greater insight into the trends, features, and classes we want to understand.

3.1 Making a scatter matrix#

Rather than just looking at the raw values or exploring basic statistical values for our data, it would be better if we were to visualize the differences. Since the number of features is small, we can use a plotting feature of Pandas called scatter_matrix(). Did you notice that we imported scatter_matrix at the start of this notebook?

✅ Do This: Try calling the scatter_matrix function on your dataframe. Use a marker that you prefer. Look up the documentation for scatter_matrix on the Pandas website for more information! You may need to make the figure bigger to get a good look at it.

Note: There is a similar sort of plot that can be made with the seaborn package, which you may have seen before. If so, do you remember what it is?

# your code here

Question 4: Does this visualization help you to determine the features that might be useful for classification? Why or why not?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text). Practice using appropriate Markdown formating to make your answer’s easy to read.

3.2 Color coding your data#

The default scatter matrix probably isn’t as helpful as we might hope as it’s impossible to tell which points on the plot represent our classes/sepcies.

We could really use a separate color indication for each dot so we can tell the species apart. Essentially, we need to create an array such that, for each observation in the DataFrame, we can assign a color value to associate it with the appropriate species. It could be just an integer, or it could be one of the standard colors such as “red”.

✅ Do This: Create a new list, array, or Pandas Series object that maps each species to a particular color. Then recreate the scatter matrix using the c argument to give the points in the plot different colors.

# Put your code here

Hope you got something that’s a little more useful! OK, how do we read this now?

  • The diagonal shows the distribution of the four numeric variables of our example data.

  • In the other cells of the plot matrix, we have the scatterplots (i.e. correlation plot) of each variable combination of our dataframe.

Question 5: Are you better able to discern features that might be useful in classfying irises? What can you say about the features of each iris species? Can you separate one species easily? If so, using which feature(s)?

Do This - Erase the contents of this cell and replace it with your answer to the above question! (double-click on this text to edit this cell, and hit shift+enter to save the text). Practice using appropriate Markdown formating to make your answer’s easy to read.

3.3 Separating species of irises#

Now we will use the feature(s) you found above to try to separate iris species. In future parts of the course, we may explore how to do this using models, but for now, you will try to do this by slicing the DataFrame.

✅ Do This: One of these species is obviously easier to separate than the others. Can you use a Boolean mask on the dataframe and isolate that one species using the features and not the species label? Try to do so below and confirm that you were successful.

# Put your code here

✅ Do This: Can you write a sequence of Boolean masks that separate the other two as well? Are these as effective as the first? Can you think of anything you might do to improve the separation?

# Put your code here

Do This - Erase the contents of this cell and replace it with your answer. How effective your Boolean mask separation of the 3 species? Any ideas for improvements?


4 What if we didn’t have Pandas to read in this data?#

While Pandas makes life pretty easy for reading in csv and other file formats, all of this functionality is built on top of the standard Python capability. Python can natively open and read files lines by line, but working with files in this context might not be something you’re familiar with. For this part of the assignment, you’re going to explore how you can use just Python without any special packages to read in the iris data.

4.1 Code review: Loading the iris data with basic Python#

✅ Do This: To get started, you’ll need to download the following file:

https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F22-data/main/code_samples/my_pandas.py

✅ Do This: Once you have the file downloaded and in the same location as this notebook open the file and read through it carefully. As you read the file add comments to the lines marked with # to explain what you understand the code to be doing.

There may be bits of Python code that you’ve not encounted before so make sure to talk with your group and search for information on the internet as you try to make sense of the code!

Jot down some of the things you learned or struggled you had in the cell below.

Do This - Erase the contents of this cell and make note of anything new you learned from reading through the code or highlight anything you don’t feel like you understand.

4.2 Using the code: importing the functions and reading in the data#

Now that you have the hand-made functions for reading in the iris data, we’re going to practice importing functions from script and then using them in your notebook.

The technique for doing this is the following:

from [PACKAGENAME] import [FUNCTION]

Where “PACKAGENAME” is the name of the script but without the “.py” part. So for my_pandas.py, the PACKAGENAME would be my_pandas. Then the FUNCTION is the name of the function from the script that you want to import, e.g. my_pandas_idx.

We’ll using this functionality in future assignments, so make sure you feel comfortable with this idea.

✅ Do This: Import the my_pandas function and the my_pandas_idx function (you can do this on one line, check to see if anyone one in your group knows how). Then, use these two functions to first read in the data and then extract just the second column from the data.

Once you’ve done this, print the first three rows of the iris data and all of the second column.

# Put your code here

5. Experimenting with generative AI#

In this part of the assignment we’re going to revisit using generative AI tools. Using your preferred AI tool, such as ChatGPT or Claude, see if you can develop code to perform the separation of the IRIS plant species better than your method above.

Basically, see if you can prompt the large language model (LLM) (in ChatCPT, Claude, or other) to develop code that will take the first four columns of the Iris dataset and predict the species.

Important note: Don’t forget to include comments in your code or in a markdown cell citing that you used ChatGPT (or another tool) to help you write the code. You should also make sure to include the prompt you used to get to your final code.

Note: This is intended to test the limits of current LLMs. You may or may not succeed in coaxing an LLM to generate a useful algorithm.

# Put your AI-assisted species identification code here
# Make sure to cite the tool you use and the key prompt you used to get to things working


# Example Citation:
# Code adapted from: 
# OpenAI. (2023). ChatGPT-3.5. Accessed from https://chat.openai.com/ on Sept 7, 2023. 
# Prompt chain:{'write python code to take first 4 columns of iris data set and predict the species of the flower'; 'do this only with threshold operations'; 'avoid using machine learned classifier'}

🛑 You made it to the end!#

Is there anything that we covered today that you’re still not feeling certain about? Talk to your group or check in with an instructor.



Congratulations, you’re done with your in-class assignment!#

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today’s in-class assignment submission folder. (Don’t forget to add all of the appropriate names in the first cell).

© Copyright 2024, Department of Computational Mathematcs, Science and Engineering at Michigan State University