Day 12 In-Class: Overview of data and file formats; US Population and Representation#
✅ Put your name here
#Learning Goals:#
By the end of this assignment you should be able to:
Review some basics of file formats
Load in a variety of file formats
Clean data loaded from files
Practice using online research to learn new programming skills
Sort arrays and dataframes by their values
File Formats Review#
As discussed in the pre-class, there are thousands of accepted file formats. For an abbreviated list, see this summary on wikipedia.
In general, all computer files store information as a series of bits (binary 1’s and 0’s). Typically 8 bits are combined into a single byte. All files can be categorized as:
Text files: where the bytes represent characters using a standard encoding scheme like UTF-8.
Binary files: where the bytes represent a custom organization of information. A single binary files can information encoded as strings, floats, and/or ints.
Refer back to the pre-class for tips on how to identify file types and the key information in order to load data from a text file (ex: delimiter, line feed, encoding, etc).
Part 1. Load data files#
Download all of the files that start with us_pop_by_state*
. These files contain population estimates from the US Census. (For population estimates methodology statements, see http://www.census.gov/programs-surveys/popest/technical-documentation/methodology.html.) The US Census has a fascinating 230 year history (https://www.census.gov/history/).
✅ Using the information you found in the pre-class, load all of these files into pandas data frames. When successful, you should have the population of all states in the US for 2010 through 2019 stored in pandas data frames. There are some hints and/or suggestions embedded in the comments in the cells below.
# import all necessary modules (matplotlib, numpy, pandas)
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# import the pandas module
import pandas as pd
# Load us_pop_by_state_2010_2011
pop1011 = pd.read_csv("us_pop_by_state_2010_2011.csv",delimiter=';')
pop1011
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_100656/2628978734.py in <module>
1 # Load us_pop_by_state_2010_2011
----> 2 pop1011 = pd.read_csv("us_pop_by_state_2010_2011.csv",delimiter=';')
3 pop1011
C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in __init__(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in __init__(self, src, **kwds)
49
50 # open handles
---> 51 self._open_handles(src, kwds)
52 assert self.handles is not None
53
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\base_parser.py in _open_handles(self, src, kwds)
220 Let the readers open IOHandles after they are done with their potential raises.
221 """
--> 222 self.handles = get_handle(
223 src,
224 "r",
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
700 if ioargs.encoding and "b" not in ioargs.mode:
701 # Encoding
--> 702 handle = open(
703 handle,
704 ioargs.mode,
FileNotFoundError: [Errno 2] No such file or directory: 'us_pop_by_state_2010_2011.csv'
1.1 Loading UTF-16be Data#
Load us_pop_by_state_2012_2013_encoding_utf-16be (See tips in pre-class if having trouble)
# Load us_pop_by_state_2012_2013_encoding_utf-16be
1.2 Loading Windows Linefeed Data#
Load us_pop_by_state_2014_2015_windows_linefeed
Tip: This file needs to be cleaned/edited some before loading;
Hint: What is the delimiter for this file? How could that lead to trouble?
# Load us_pop_by_state_2014_2015_windows_linefeed
1.3 Loading Binary Data#
Below is an example of how to load a .csv binary file using Numpy.
# Load binary file with numpy
pop16 = np.fromfile('Data_Files//us_pop_by_state_2016a.csv', dtype='float32') # Population of all states in 2016
That’s a binary file, with one number per state in alphabetical order (same order as other files)
Now you try loading the binary data file us_pop_by_state_2017a.yaff
This file has a 32-bit data type as well, but it is 32-bit integers instead. Try dtype='int32'
when you load in the data!
# Load "us_pop_by_state_2017a.yaff"
1.4 Loading the Rest#
Read following code and ensure you understand what is happening
# Load us_pop_by_state_2018_2019.bin
pop1819 = np.fromfile('Data_Files//us_pop_by_state_2018_2019.bin', dtype='int32')
print('Size pop1819:',pop1819.shape)
pop18 = pop1819[0:51] # Population of all states in 2018
pop19 = pop1819[51:102] # Population of all states in 2019
Size pop1819: (102,)
Question: How is the pd.read_csv()
function different from the np.fromfile()
function? Discuss with your group members and put your answer below.
✎ Put your answer here.
An aside: Sorting and finding max/min values in numpy arrays#
Now let’s equip ourselves with some new tools that can be used to answer some of the questions below. In a numpy array, you might already know that we can find the maximum and minimum values like so:
a = np.array([2,30,111,41,9,16,17,2,1,-6,33])
print('The biggest value of a is:',a.max())
print('The smallest value of a is:',a.min())
The biggest value of a is: 111
The smallest value of a is: -6
But did you know we can also find out where the biggest and smallest entries are? We can do this with .argmin()
and .argmax()
:
print('The biggest value of a is in position:',a.argmax())
print('The smallest value of a is in position:',a.argmin())
The biggest value of a is in position: 2
The smallest value of a is in position: 9
Similarly, we can sort an array like this:
a.sort()
print(a)
[ -6 1 2 2 9 16 17 30 33 41 111]
But this doesn’t always give us the information we need. In fact, note that this changes our array!
Sometimes we just want to know the indices that are required to sort the array. We can get these with .argsort()
:
a = np.array([2,30,111,41,9,16,17,2,1,-6,33])
sorting_indices = a.argsort()
sorted_a = a[sorting_indices]
print('Sorting indices:', sorting_indices)
print('Sorted array:',sorted_a)
Sorting indices: [ 9 8 0 7 4 5 6 1 10 3 2]
Sorted array: [ -6 1 2 2 9 16 17 30 33 41 111]
Note that the first element of sorting_indices
is 9, and the last one is 2, which is what we should expect.
Using .argsort()
is similar to defining a mask. Once you have these indices, you can use them to select particular indices of an array or a dataframe. The difference is that a mask allows you to select a different subset of the data but the sorting_indices
allow you to select the same data but in a different order.
Part 2. Analyze and visualize data#
✅ Question 2.1 Determine which state had the greatest percent increase in population between 2010 to 2019. Work with your team members to break this into subproblems and find the solution. Print the answer as:
The great state of ____ had a ___% increase in population between 2010 and 2019.
# Put your code here
✅ Question 2.2 Determine which state had the second greatest percent increase in population from 2010 to 2019. Work with your team members to break this into subproblems and to find the solution using python. Print the answer as:
The great state of ____ had a ___% increase between 2010 and 2019.
# Put your code here
✅ Question 2.3 Make a plot of the population in 2019 versus the states (i.e. the state names should be on the x-axis). Order the states from the least populous to the most populous for the year 2019. Label all of your axes. (Try to change figure size and the x-tick label to permit visualization of all the state names).
# Put your code here
✅ Question 2.4 Make a plot of the 2019 population versus only the 10 most populous states. On the x-axis include the name of the state. Order the states from the least populous to the most populous for the year 2019. Label all of your axes.
# Put your code here
🛑 STOP#
Check in with an instructor before you leave class!
Assignment wrapup#
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
from IPython.display import HTML
HTML(
"""
<iframe
src="https://cmse.msu.edu/cmse201-ic-survey"
width="800px"
height="600px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)
Congratulations, you’re done!#
Submit this assignment by uploading your notebook to the course Desire2Learn web page. Go to the “In-Class Assignments” folder, find the appropriate submission link, and upload it there. Make sure your name is on it.
See you next class!
Copyright © 2021, Department of Computational Mathematics, Science and Engineering at Michigan State University, All rights reserved.