Week 02: In Class Assignment: End-to-End Project#

✅ Put your name here.#

✅ Put your group member names here.

houses

Machine Learning Housing Corp.

This In Class Assignment completes what you have done in the Pre Class Assignment. If you haven’t completed the Pre Class, do so now.

As we did last time, follow these steps:

  1. read this notebook first so that you know what to expect for today

  2. answer the questions below

  3. turn in this notebook with your answers in the usual way (no need to resubmit the notebook from the textbook)

Last time you explored the nature of the problem, what the data generally looked like and examined some properties, such as correlations and statistics (thanks to nice functionality in pandas). Now it is time to clean the data before it goes into ML algorithms.

Each of the sections below follow through the process described in the textbook and the notebook that comes with it. Read through that notebook and follow the steps in there. As you work through the notebook, answer the questions below.

Part 1. Data Cleaning#

Let’s think about the steps you have done so far:

  • You obtained the data. (This came to you through a link in notebook, which made this part easy!)

  • You examined the data using pandas.

  • You visualized the data in a few ways, including using maps and pairplots that reveal correlations among the features.

  • You hopefully noticed some characteristics of this data, including potential problems with it.

Next, we want to use the data for ML, but we now must repair all of the problems - sklearn will not know what to do with erroneous data! This is a fairly important step in ML: sometimes your data has errors, missing values or is simply represented in a way sklearn (or whatever you are using) can’t process (e.g., string information, rather than floats).

What to do? There are many approaches and generally you need to apply them in ways that depend on your specific problem. For example, suppose there is a row in your data that has a missing value. A simple fix is to simply remove that row. But, if you really need that data point and the other columns are perfectly fine, what should you do? Let’s examine these questions in the context of the data you have been working with.

  1. Your dataset has missing values. Using only pandas, you can use the methods dropna, drop and fillna - write in a markdown cell what each of these accomplish and why you would use each of them.

  2. sklearn has a sub-package called impute that handles imputation.
    Read this article and answer these questions:

    a. What does imputation mean?
    b. What strategies are there to handle missing data?
    c. What type of imputer does sklearn provide? What does each one of them do?

  3. Review the code below and in a markdown cell show the math behind it.

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

Put your answers here!

1.1 Data Cleaning with Categorical Data#

Let’s move to more data cleaning, with a focus on categorical data and standard transformation/scaling operations.

  1. What is “one-hot coding” and what it is used for?

  2. What are the defaults in OneHotEncoder?

  3. Have a discussion in your group about transformations and scaling. What does this mean? Why is this important?

  4. Give an example, in which you would use one-hot encoding

Put your answers here!

Part 2. Pipelines#

As you have likely gathered by now, the ML process can have a lot of steps. And, as you can see, the steps are fairly common across different ML settings. It would be very nice if sklearn provided some functionality to help organize some of these steps.

  1. Read about and summarize what pipelines are.

  2. Which transformations are used in the pipeline for this project?

  3. Is it important that the tranformers in the pipeline are done in a certain order?

  4. How does the author handle the exponential distributed data?

  5. How does the author handle multimodal distribution?

Put your answers here!


Now that you are done, follow these steps:

  • Submit your notebook to D2L.

  • Be sure to include the names of everyone in your group.

© Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.