Week 03: In Class Assignment: End-to-End Project (Part 2)#

✅ Put your name here.#

✅ Put your group member names here.

houses

Machine Learning Housing Corp.

This In Class Assignment completes what you have done last week with the Pre-Class and In-Class assignment.

As we did last time, follow these steps:

  1. read this notebook first so that you know what to expect for today

  2. answer the questions below

  3. turn in this notebook with your answers in the usual way (no need to resubmit the notebook from the textbook)

Last time you explored the nature of the problem, what the data generally looked like and examined some properties, such as correlations and statistics (thanks to nice functionality in pandas), cleaned the data before it goes into ML algorithms. Now it is time to apply the ML algorithms and get a prediction.

Each of the sections below follow through the process described in the textbook and the notebook that comes with it. Read through that notebook and follow the steps in there. As you work through the notebook, answer the questions below.

Part 1. Regression#

If you have your book handy, the ML starts on page 72.

The three ML methods you will use are:

  • LinearRegression: you have certainly used linear regression before, but examine what new options this library provides for you,

  • DecisionTreeRegressor: we have not covered decision trees yet (we will soon!) - you might want to use decision trees in your project so this is a good time to see what they do,

  • RandomForestRegressor: what does this do? (again, we will get to ensemble methods later, after decision trees); what are its methods and attributes and options?

Task: Answer these questions in the markdown cell below:

  1. How many columns are in the final dataset, the one that will be fed to the ML algorithm?

  2. Why does the author pick these algorithms?

  3. How does the author choose to measure the performance of the models? Are there any sklearn libraries that help here?

  4. What does the number returned by mean_absolute_error represent?

  5. Why does the author choose to predict using the training dataset?

Put your answers here!


Part 2. Cross-validation#

As already mentioned cross-validation is a powerful technique to estimate the performance of your model.

Task: Using the book’s notebook as reference answer the questions below:

  1. What sklearn function does the author use for cross validation?

  2. What are the inputs of this function? Does the author pass the transformed dataset or the original dirty dataset?

  3. What does each element of the arrays lin_rmses, tree_rmses, forest_rmses represent?

  4. Take those arrays from the other notebook and plot them as a stacked histogram in this notebook. What does the plot tell you?

  5. Which of the three methods performs best, explain your answer?

  6. Which of the three models is the most precise, explain your answer?

  7. Research the difference between cross_validate and cross_val_score. Why would you use one or the other?

Put your answers here!

# Put your code here

Part 3: Hyperparameter Tuning/Optimization#

At this point, what have you done? A lot of data science and lot of data cleaning and a lot of exploring accuracy of algorithms with CV. Now you can pick which estimator you want to use.

The next phase of the ML workflow is making your estimator really work for you. As you saw above, each of the algorithms comes with a host of options. The parameters associated with those options are called “hyperparameters” because they are internal to your estimator and are separate from the parameters of your model. Like some of the other steps you have been following, hyperparameter tuning is so ubiquitous that sklearn has nice tools to help you.

The difference between a parameter of your model and the hyperparameters of your estimator can be confusing. Here is a good way to understand the difference: look at the documentation for each of the estimators. You will see that each one accepts a large number of inputs, many of which are set to some default value. (You probably should look at something like the decision tree rather than linear regression, which is too simple to have many options.) Most of these options, which you may typically ignore, are the hyperparameter that define the way the estimator is set up. This is because, for example, there isn’t one decision tree, there are infinitely many – which one do you want to use?

To find the hyperparameters, you might need to do the training hundreds or thousands of times! But, the payoff is the most accurate algorithm!

How do we search for the best hyperparameters?

search

Task: Research GridSearchCV and give a short summary of what it can do for you.

Put your answers here!

Below is a snippet of the code of the notebook.

    param_grid = [
        {'preprocessing__geo__n_clusters': [5, 8, 10],
        'random_forest__max_features': [4, 6, 8]},
        {'preprocessing__geo__n_clusters': [10, 15],
        'random_forest__max_features': [6, 8, 10]},
    ]
    grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
                            scoring='neg_root_mean_squared_error', n_jobs = -1)
    grid_search.fit(housing, housing_labels)

Task: Answer these questions in the markdown cell below:

  1. Why are there two dictionaries in the params_grid list? Do they refer to different estimators in the pipeline? Why not putting them all together ?

  2. What does the option cv = 3 do?

  3. What does n_jobs do?

Put your answers here!

Task: Research RandomizedSearchCV and give a short summary of what it can do for you.

  • How does it differ from GridSearchCV ?

  • Why would you choose one over the other?

Put your answers here!


Part 4. Conclusion#

Task: Write a paragraph addressing this point. In this two-part PCA-ICA, what did you learn? How important did you find the various steps? Which steps made the biggest difference in the power of the ML approach? Where do you think you should spend the most time in your projects to get the best results?

Put your answers here!


Part 5. Explore other ML algorithms (Time Permitting)#

Task: Using the same dataset try different estimators maybe \(k\)-nearest neighbor regressor, or support vector regressor, etc.

# Put your code here

Question: How does your new model compare with the previous ones?

Put your answers here!


Congratulations, you’re done!#

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “In-class assignments” folder, find the appropriate submission link, and upload it there.

© Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.