Week 02: Pre-Class Assignment: Exploratory Data Analysis#

✅ Put your name here.#

CA

Goals for today’s pre-class assignment#

In this Pre-Class Assignment you are going to complete the data science portion of the book’s End-to-End project (chapter 2). The main learning goals are:

  • understand how to build and manage a real ML project, much like your project will be,

  • practice using some of the data science tools (e.g., Pandas),

  • learn some new tools that will help you in your project.

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-Class Assignments” submission folder on D2L. Submission instructions can be found at the end of the notebook.


Machine Learning Housing Corp.

The author of your book thoughtfully provided the entire code base that he used to build Chapter 2. You should be able to find all of the code for your textbook at GitHub. Be sure you work with this document and the code from the textbook so that you don’t end up writing a ton of code yourself, which is not the point. (If you want to write your own code, that is totally fine too! We are just using the code from the textbook to save time.)

Note: It will be very useful to have your textbook handy.

Follow these steps:

  1. Download the Chapter 2 notebook from GitHub

  2. Run the notebook up to Part 3 (Prepare the Data for Machine Learning Algorithms) inclusive and make sure you understand what every code cell is doing.

  3. Answer the questions below.

  4. Turn in this notebook with your answers in the usual way (no need to resubmit the notebook from the textbook).

What you will do is read through the textbook’s notebook and answer questions about it. Some of the answers are in the textbook itself, some in the notebook.

Part 1. Pandas and Data#

Once you are certain the textbook’s notebook is working (run all of the cells - it needs to go out to the web to get information), go through the first portion and answer these questions:

  1. Describe in your own words what the goals of this project are.

  2. Read through the code. See if there are interesting ideas/tricks there that you didn’t not know about. What did you find?

  3. What form is the data in, and are there any problems with it? For example, are all of the potential features all integers or floats and what pandas function can help you answer this question?

  4. What does .value_counts() do?

  5. What does .describe() do?

  6. What do .iloc and .loc do?

Put your answers here!

Part 2. Histogram#

Let’s move below the first 3x3 array of histograms. Answer these questions in detail.

  1. In the first set of 3x3 histograms, do you see anything there that seems odd/interesting/useful/bothersome to you? How would you deal with that problem?

  2. What does the author choose to do in terms of splitting the data into testing and training? Does the author use cross validation?

  3. What is StratifiedShuffleSplit and why would you use it? What problem does it solve for you?

  4. How is ocean_proximity handled?

Put your answers here!

Part 3. Visualization#

Ok, let’s move into the visualization part. The author may use plotting tools you would not normally use, so let’s see what he did. (For example, how was the 3x3 histrogram made? Seaborn? Or?)

  1. What tool is the author using to make these plots? Straight matplotlib, or something else?

  2. Go through the code below very carefully. What are all of these options?

    housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,  s=housing["population"]/100, label="population", figsize=(10,7),  c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,  sharex=False)

  3. The author ends up with a very nice plot that uses a real map. How did he do that? What tools did he need to do that?

Put your answers here!

Part 4. Correlations#

Next, the author spends a lot of time looking for correlations. Go through this section very carefully!

  1. What is the author trying to achieve by looking at correlations? Give a very detailed answer.

  2. What does corr_matrix["median_house_value"].sort_values(ascending=False) do?

  3. What is scatter_matrix?

  4. What do the scatter plots tell you?

  5. Move into the ML portion of the notebook. Go to sklearn’s webpages and learn what this does and why you would use it:

    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy="median")`
    

Put your answers here!

Hopefully you learned a lot of new techniques for handling real data. Think about how these will help you in your project.

Be sure to read Chapter 2 very carefully before the ICA.


Assignment wrap-up#

Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/QyrbnptkyA" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

© Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.