Homework 3: Loading, Cleaning, Analyzing, and Visualizing Data with Pandas (and using resources!).

Homework 3: Loading, Cleaning, Analyzing, and Visualizing Data with Pandas (and using resources!).#

Exploring Nahuatl names of plants 🌿 and ahuacatl (🥑 avocado) production#

✅ Put your name here#

Learning Goals#

Using pandas to work with data and clean it
Make meaningful visual representations of the data
Fitting curves to data and evaluating model fits

Assignment instructions#

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

This assignment is due at 11:59pm on Friday, March 27, 2026

It should be uploaded into D2L Homework #3. Submission instructions can be found at the end of the notebook.

Table of Contents#

Part 0. Academic Integrity Statement (2 points)

Part 1. Reading, describing, and cleaning data (22 points)

Part 2. Exploratory Data Analysis and Data visualization (21 points)

Part 3. Fitting curves to data. (20 points)

Part 0. Academic integrity statement (2 points)#

Back to Top

In the markdown cell below, paste your personal academic integrity statement. By including this statement, you are confirming that you are submitting this as your own work and not that of someone else.

✎ Put your personal academic integrity statement here.

Before we read in the data and begin working with it, let’s import the libraries that we would typically use for this task. You can always come back to this cell and import additional libraries that you need.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.optimize import curve_fit

Part 1: Reading, describing, and cleaning data (22 total points)#

Back to Top

This year, a new special education abroad section of CMSE201 (Section 750, Coding across Cultures in San Miguel de Allende), will have returned from visiting Mexico after Spring Break, around the time that this homework will be released. Mexico recognizes 63 indigenous languages. By far, the largest of these languages is Nahuatl, with over 1.6 million speakers in Mexico. The name “Mexico” itself is derived from the mexica people, who established Tehochtitlan which is now Mexico City. Nahuatl and related languages are spoken throughout Mexico and spread all the way north into what is now the United States of America and all the way south to Central America.

It is not surprising that domesticated plants from Mexico retain their names in Nahuatl, and that these names–and the cultural uses of the plants–have spread globally. For example, Nahuatl words like chīlli (🌶️ chile), cacaotl (🍫 chocolate), tomatl (🍅 tomato), and āhuacatl (🥑 avocado) have been incorporated into langauges around the world, just as the uses of these plants have been globally incorporated as well. Remarkably, we have a record of both Nahuatl names for plants and illustrations of them. The De la Cruz-Badiano Codex of 1552, written shortly after Conquest, was written by a Nahuatl physician (tīcitl), and demonstrated the medicinal value of plants Nahua people used, as well as documenting botanical details about plants in Mexico. We have been studying the De la Cruz-Badiano codex and recently wrote a manuscript applying data science techniques to analyze the Nahuatl names, text, and botanical illustrations in it. We also created a website that you can browse the text and illustrations in this manuscript, in English y español.

In the first part of the assignment, you will use the provided dataset nahuatl_names.csv to explore the Nahua classification of plants into 5 major groups:

🌱 xihuitl (“herb or leaf or green”)
🌳 quahuitl (“tree or woody”)
🌻 xochitl (“flower”)
❤️‍🩹 patli (“medicine or remedy”)
🥬 quilitl (“edible green”)

1.1 Read the data (1 point)#

✅ Task

Read in the data from nahuatl_names.csv into a Pandas dataframe (0.5 pt) and display the head of the data (0.5 pt).

## your code here

1.2 Describe the data (3 points)#

1.2.1 ✅ Task (1 point)#

Use describe to display several summary statistics from the data frame.

## your code here

1.2.2 ✅ Task (2 points)#

The columns in this dataset represent the following:

nahuatl_name: each entry has a unique name in Nahuatl
type: one of five classes, indicating if the name represents a:
- 🪴 plant
- 🪨 stone
- 🐆 animal
- 🐦 bird
- other
class: one of six classes are represented in the nahuatl name:
- 🌱 xihuitl (“herb or leaf or green”)
- 🌳 quahuitl (“tree or woody”)
- 🌻 xochitl (“flower”)
- ❤️‍🩹 patli (“medicine or remedy”)
- 🥬 quilitl (“edible green”)
- multiple (there are multiple classes of the above five represented in the name)

Using the results from describe above, write in the markdown cell below:

Relative to the total number of entries in nahuatl_name, how many entries are present in the type and class columns (1 pt)
How do you explain the discrepancy between the count of the type and class columns?

Hint: refer to the results of using head on the df from the question above. Beyond the classes listed above, what else do you see? (1pt)

✎ Put your answer here:

Part 1 (1pt):

Part 2 (1pt):

1.3 Isolating and performing basic statistics on data (4 points)#

1.3.1 ✅ Task (1 point)#

Display the type column on its own using the name of the column.

## your code here

1.3.2 ✅ Task (1 points)#

Using .iloc, display the first five rows of just the type column.

## your code here

1.3.3 ✅ Task (2 points)#

Using value_counts and mode functions, print out the count of each level of type and the most abundant level.

Note: we don’t typically think of using mode on categorical data, but just like continuous data, it represents the most prevalent class. Also note that the De la Cruz-Badiano is a codex/herbal mostly about plants

## your code here

1.4 Filter the data using masking (10 points)#

1.4.1 ✅ Task (5 points)

For each of the six plant classes in class, you want to know how many entries are represented.

For each of the six classes below, calculate the total number of entries using masking. (5 pts)

🌱 xihuitl (“herb or leaf or green”)
🌳 quahuitl (“tree or woody”)
🌻 xochitl (“flower”)
❤️‍🩹 patli (“medicine or remedy”)
🥬 quilitl (“edible green”)
multiple (there are multiple classes of the above five represented in the name)

Print your results for each of six class levels. From your mask, the function .sum() can be used to find the count of each class.

Your results should look like:

The count of class xihuitl is X
The count of class quahuitl is X
The count of class xochitl is X
The count of class patli is X
The count of class quilitl is X
The count of class multiple is X

## your code here

1.4.2 ✅ Task (5 pts)

One of the classes of class is multiple, in which multiple plant classes are represented.

Using masking, display or print out the entries in nahuatl_name that belong to the multiple class in class (3 pts)
Choose one name of the multiple class. List the name in Nahuatl and two of the plant class names that you see within it. (2pts)

Note: Within a name, xihuitl can be represented as xiuh- and quahuitl as quauh- or quauhtla

# your code here printing out "nahuatl_name" of entries of "multiple" class (3 pts)

✎ Put your answer here, choosing a nahuatl name from above and the classes you see in the name:

Part 1: Choose a Nahuatl name with multiple classes (1pt):

Part 2: Which plant classes do you see in the name? (1pt):

1.5 Clean out the NaN values (4 points)#

1.5.1 ✅ Task (1 point)#

There are so many NaN values in the class column!! Create a new, “clean” dataframe called clean_nahuatl_df (0.5 pt) that removes any entry with NaN values (0.5 pt). Use the built in pandas function dropna to do this (google it to learn more!).

## your code here

1.5.2 ✅ Task (2 points)#

Now that you have the original and clean datasets, determine the number of rows in each of them (1 pt). Print out your results (1 pt).

## your code here

1.5.3 ✅ Task (1 point)#

Using the code in the cell below, you find that once you remove NaN of the plant classes, that there are no animals, and only a couple stones, an “other”, and one bird. The plant class names can also refer to the color of objects, and xiuh- can mean blue-green, turquoise, or jade color. Using masking on your clean dataset, find and print out the name of the one bird with a plant class morpheme that indicated “turquoise” in its name.

Your answer should be Xiuh-quechol-tototl (literally “turquoise brightly colored bird”), and you can read more about this turquoise bird and its medicinal uses here. Some believe it is the turquoise continga.

clean_nahuatl_df["type"].value_counts()

type
plant    110
stone      2
other      1
bird       1
Name: count, dtype: int64

## your code here

Part 2: Exploratory Data Analysis (21 total points) 🥑#

Back to Top

One of the most popular Nahuatl-named plants in recent years is avocado, en español aguacate and in Nahuatl āhuacatl. You think to yourself that to accomodate the global demand of guacamole (in Nahuatl āhuacamōlli) that surely Mexico must be producing a lot more avocados in recent years! In fact, you wonder what other countries are the top avocado produces, and find Colombia and Dominican Republic just after Mexico.

A rich resource for the statistics of global agricultural production of most crops is the statistical service of the Food and Agriculture Organization of the United Nations (FAOSTAT). There, you find data relating to avocados in these three countries from 1961 to 2024.

The avocado data is provided to you as the ahuacatl.csv dataset.

It’s time to explore the data! Let’s visualize our data and look for correlations in our ahuacatl 🥑 dataset.

Part 2.1: Correlations (5 points)#

From 1961 to 2024, we have the avocados produced in tonnes for three countries in the columns mexico, colombia, and dominican_republic.

✅ Do this (2 points): 1) Read in the file ahuacatl.csv as a dataframe (1pt) and 2) Print or display a correlation matrix between the values in the columns of mexico, colombia, and dominican_republic.

Hint1: Look up the pandas corr function for dataframes.

Hint2: To select multiple columns, use the following notation: df[["col1"], ["col2"]]

# Put your code here

✅ Answer this question (3 points, 1 point for each part): In your opinion, would you consider any of the correlations between the columns to be strong? (1 pt). What direction are the correlations, positive or negative? (1 pt). In your own words, describe what you believe the correlations demonstrate for avocado production between the three countries? (1pt)

✎ Put your answer here

Part 1 (1pt):

Part 2 (1pt):

Part 3 (1pt):

Part 2.2: Visualizing the data (16 points)#

The numbers above gives us a quantitative measure of the correlations in the dataset. But we need to see the data!!

Visualization in data science is a very important skill!!! In the next exercise, you will visualize the relationships between these variables as scatterplots. You will create a plot for each relationship: 1) mexico vs. year, 2) colombia vs. year, 3) dominican_republic vs. year, and 4) plots of mexico, colombia, and dominican_republic together vs. year.

We want to make a plot similar to the one below:

✅ Do this (13 points): Use matplotlib to make four plots like the ones above. For full points, do the following:

Create 4 subplots using plt.subplot. Use 2 rows and 2 columns (4 points).
Use plt.figure and the argument figsize to make a plot 8 inches wide x 5 inches long (1 point).
Plot 1) mexico vs. year, 2) colombia vs. year, 3) dominican_republic vs. year, and 4) mexico, colombia, and dominican_republic together vs. year (1 point).
Provide a title for each subplot as follows: Mexico, Colombia, Dominican Republic, and Comparison (1 point).
Give the overall plot the title Avocado production in tonnes, 1961 to 2024 (1 point)
Provide x-axis labels as year (1 point).
Provide y-axis labels as tonnes (1 point).
For the final Comparison plot, use the label argument for each country and create a legend (2 points).
Use tight_layout to give your plot optimal sizing (1 points).

# Put your code here

✅ Answer the following (3 points): Looking at the plots,

Explain how the plots explain the correlation coefficients you calcualted in the previous question (1 point).
Which country is producing the most avocados overall? (1 point).
Describe any interesting patterns you see in production over time for any country (1 point).

✎ Put your answer here

Part 1 (1pt):

Part 2 (1pt):

Part 3 (1pt):

Part 3: Fitting curves to data. Harvest date as a function of temperature anomaly (20 points)#

Back to Top

Now that we have visualized our data we can formulate a question in a guided way. In this section, we will ask:

What is the relationship between year and avocado production in Mexico? Can we predict avocado production by year and into the future?

Part 3.1: Model#

In the above plots notice that production in tonnes (mexico) exponentially increases in value as year increases. Specifically,

\[ production(t) = a e^{b t} + c \]

Where \(production\) is the production of avocados in tonnes for Mexico represented by the column mexico, and \(t\) is the time in years represented by the column year.

a, b, and c are parameters to be modeled that represent the following:

a → initial scale (how much production “takes off”)

b → growth rate (the star of the show)

c → baseline offset (prevents forcing the curve through zero)

✅ Do this (4 points):

Write a function called exp_growth that calculates \(production\) based on \(t\) in years using the equation above. The equation should be constructed so that you can build a model using the curve_fit function in the next section.

# Put your code here

Part 3.2: Fit the model#

✅ Do this (8 points):

For the time variable t, you need to subtract 1961. This is because the data starts on year 1961 and we need to start at 0 for the model to be fit (1 point)
To fit your model successfully, you must use the suggested initial parameter values argument p0 with curve_fit that was discussed in class. You should use the following p0 argument to fit your model successully (1 point): p0 = [2e6, 0.04, 1e5]
Now use curve_fit with your exp_growth function to find the \(a\), \(b\), and \(c\) parameters. (3 points)
Print out the value of \(a\), write “The value of a is…” (1 point)
Print out the value of \(b\), write “The value of b is…” (1 point)
Print out the value of \(c\), write “The value of c is…” (1 point)

# Put your code here

Part 3.3 Check your model (8 points)#

✅ Do this (6 points): Make a plot comparing your model and the data.

Plot the actual data (2 points)
Plot your modeled data (2 points)
Use x and y axis labels and title (1 point)
Remember to substract 1961 from the year to zero out the time variable! (1 point)

Your plot should look like:

# Put your code here

✅ Answer the following (2 points): Do you think your model is a good model for the years represented in the data? (1pt) Do you think your model we accurately predict avocado production in Mexico in the future? (1pt) Clearly justify your answer (looking for more than a “yes” or “no” answer).

✎ Put your answer here

Congratulations, you’re done!#

Back to Top

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” section, find the submission folder link for Homework #3, and upload it there.