HW6 Assigned Problems#

# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier

from sklearn import tree

# Optional: display settings
pd.set_option('display.max_columns', None)

Homework 6 Spring 2026#

In this homework, you will work with classification methods, focusing on decision trees and ensemble methods such as random forests. You will explore how model complexity affects performance, use cross-validation to select tuning parameters, and compare models using training and testing error rates.

In addition, you will interpret tree-based models and examine variable importance to better understand how predictors influence the response. As in previous assignments, be sure to clearly justify your choices, report relevant metrics, and provide brief interpretations of your results.


Question 1. Classification Tree on the OJ Data Set (30 points)#

(a) (6 points)#

Create a training set containing a random sample of 800 observations and a test set containing the remaining observations.

Report the percentage of observations that are included in the training set.

(b) (6 points)#

Fit a classification tree to the training data, with Purchase as the response and all remaining variables as predictors. To control model complexity, we restrict the tree to have a maximum depth of 3.

Report:

  • the fitted model (e.g., a plot or summary of the tree)

  • the training error rate

(c) (6 points)#

Plot the fitted classification tree and use it to answer the following:

  • How many terminal nodes does the tree have?

  • Which variables appear near the top of the tree?

  • Briefly interpret two of the main splits.

(d) (6 points)#

Use export_text() to display a text summary of the fitted classification tree.

Select one terminal node and describe:

  • the path of splits leading to the node

  • the observations in that node

  • the predicted class

(e) (6 points)#

Predict on the test set and report:

  • the confusion matrix

  • the test error rate

Based on your results, briefly comment on the model’s performance.


Question 2. Cross-Validation and Tree Size (20 points)#

(a) (15 points)#

Use cross-validation on the training set to determine an appropriate tree size. Vary the maximum depth of the tree over a range of 1 to 20 and use 5-fold cross-validation to evaluate model performance. Plot tree size versus cross-validated classification error rate.

(b) (5 points)#

Which tree size achieves the lowest cross-validated classification error rate? Briefly explain your choice based on the plot.


Question 3. Pruned vs. Unpruned Tree (20 points)#

(a) (10 points)#

Fit both an unpruned tree and a pruned (or restricted) tree using the optimal tree size identified in Question 2. If the cross-validation results do not clearly support pruning, fit a smaller reasonable tree instead and briefly justify your choice.

Report:

  • the chosen tree size for the pruned/restricted model

  • the training error rate for each model

  • the testing error rate for each model

(b) (3 points)#

Which model has the higher training error rate, the pruned or unpruned tree?

(c) (3 points)#

Which model performs better on the test set, the pruned or unpruned tree?

(d) (4 points)#

Briefly explain why the pruned and unpruned trees may perform differently on the training and test data.


Question 4. Random Forest (20 points)#

Using the same training and test sets from Question 1:

(a) (15 points)#

Fit a random forest model with Purchase as the response variable (use at least 500 trees).

Report:

  • the test error rate

  • the variable importance values

(b) (5 points)#

Compare the random forest and classification tree using their test error rates, and briefly interpret the most important predictors.


Question 5. Reflection (10 points)#

Write a short paragraph addressing the following:

  • one advantage of a single decision tree

  • one disadvantage of a single decision tree

  • why ensemble methods (e.g., random forests) can improve prediction performance


Grading distribution#

  • Question 1: 30 points

  • Question 2: 20 points

  • Question 3: 20 points

  • Question 4: 20 points

  • Question 5: 10 points

Total: 100 points

Data#

This homework uses the OJ data set. Update the file path below if needed.

url = "https://msu-cmse-courses.github.io/CMSE381-S26/_downloads/73b56f6984db80ab92105d5e86279f38/OJ.csv"
OJ_df = pd.read_csv(url, index_col=0)

# Convert the Store7 column from Yes/No to 1/0 so scikit-learn can use it.
if 'Store7' in OJ_df.columns:
    OJ_df['Store7'] = OJ_df['Store7'].replace({'Yes': 1, 'No': 0})

# Preview the data
print(OJ_df.info())

OJ_df.head()
<class 'pandas.core.frame.DataFrame'>
Index: 1070 entries, 1 to 1070
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Purchase        1070 non-null   object 
 1   WeekofPurchase  1070 non-null   int64  
 2   StoreID         1070 non-null   int64  
 3   PriceCH         1070 non-null   float64
 4   PriceMM         1070 non-null   float64
 5   DiscCH          1070 non-null   float64
 6   DiscMM          1070 non-null   float64
 7   SpecialCH       1070 non-null   int64  
 8   SpecialMM       1070 non-null   int64  
 9   LoyalCH         1070 non-null   float64
 10  SalePriceMM     1070 non-null   float64
 11  SalePriceCH     1070 non-null   float64
 12  PriceDiff       1070 non-null   float64
 13  Store7          1070 non-null   int64  
 14  PctDiscMM       1070 non-null   float64
 15  PctDiscCH       1070 non-null   float64
 16  ListPriceDiff   1070 non-null   float64
 17  STORE           1070 non-null   int64  
dtypes: float64(11), int64(6), object(1)
memory usage: 158.8+ KB
None
Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM PctDiscCH ListPriceDiff STORE
1 CH 237 1 1.75 1.99 0.00 0.0 0 0 0.500000 1.99 1.75 0.24 0 0.000000 0.000000 0.24 1
2 CH 239 1 1.75 1.99 0.00 0.3 0 1 0.600000 1.69 1.75 -0.06 0 0.150754 0.000000 0.24 1
3 CH 245 1 1.86 2.09 0.17 0.0 0 0 0.680000 2.09 1.69 0.40 0 0.000000 0.091398 0.23 1
4 MM 227 1 1.69 1.69 0.00 0.0 0 0 0.400000 1.69 1.69 0.00 0 0.000000 0.000000 0.00 1
5 CH 228 7 1.69 1.69 0.00 0.0 0 0 0.956535 1.69 1.69 0.00 1 0.000000 0.000000 0.00 0

Question 1. Classification Tree on the OJ Data Set#

(a) (6 points)#

Create a training set containing a random sample of 800 observations and a test set containing the remaining observations.

Report:

  • the percentage of observations in the training set

####YOUR CODE HERE###

(b) (6 points)#

Fit a classification tree to the training data, with Purchase as the response and all remaining variables as predictors. To control model complexity, we restrict the tree to have a maximum depth of 3.

Report:

  • the fitted model

  • the training error rate

###YOUR CODE HERE###

(c) (3 + 3 points): Create a plot of the fitted tree and interpret the results.#

  1. Create a plot of the fitted tree.

###YOUR CODE HERE###

(c)#

  1. Interpret the results based on the above plot.

Answer the following:

  • How many terminal nodes does the tree have?

  • Which variables appear near the top of the tree?

  • Briefly interpret two of the main splits.

###YOUR ANSWER HERE###

(d) (3 + 3 points)#

  1. Use export_text() to produce a text summary of the fitted tree.

###YOUR CODE HERE###
  1. Choose one terminal node and interpret:

  • the path of splits leading to that node

  • what type of observations fall into that node

  • the predicted class for that node

###YOUR ANSWER HERE####

(e) (3 + 3 points)#

  1. Predict the response on the test set and produce a confusion matrix comparing the test labels to the predicted test labels.

Report:

  • the confusion matrix

  • the test error rate

###YOUR CODE HERE###

(e)#

  1. Based on your results from Question 1(e), briefly comment on the model’s performance.

####YOUR ANSWER HERE###

Question 2. Cross-Validation and Tree Size#

(a) (12 points)#

Use cross-validation on the training set to determine an appropriate tree size. Vary the maximum depth of the tree over a range of 1 to 20 and use 5-fold cross-validation to evaluate model performance. Plot tree size versus cross-validated classification error rate.

###YOUR CODE HERE##

✅ Question (a-b): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (a) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###

  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(b) (5 points)#

Which tree size achieves the lowest cross-validated classification error rate? Briefly explain your choice.

###YOUR ANSWER HERE##

Question 3. Pruned vs. Unpruned Tree#

(a) (10 points)#

Fit both an unpruned tree and a pruned (or restricted) tree using the optimal tree size identified in Question 2. If the cross-validation results do not clearly support pruning, fit a smaller reasonable tree instead and briefly justify your choice.

For each model, report both the training error rate and the testing error rate, as these will be used for comparison in part (b) and (c).

Be sure to clearly state the chosen tree size (e.g., depth or number of terminal nodes) for the pruned/restricted model.

###YOUR CODE HERE###

(b) (3 points)#

Compare the training error rates of the pruned and unpruned trees. Which is higher?

###YOUR ANSWER HERE##

(c) (3 points)#

Compare the test error rates of the pruned and unpruned trees. Which model performs better?

###YOUR ANSWER HERE##

(d) (4 points)#

Briefly explain why the pruned and unpruned trees may perform differently on training and test data.

###YOUR ANSWER HERE###

Question 4. Random Forest#

Using the same training and test sets from Question 1:

(a) (12 points)#

Fit a random forest model with Purchase as the response variable. Use at least 500 trees.
Report the test error rate and display the variable importance values.

###YOUR CODE HERE##

✅ Question (a-b): Documenting Your Solution Process (3 points)#

Please answer the following clearly and completely:

  1. Prior Knowledge vs. External Resources (1 point)
    Indicate which parts of Question (A) you completed using your own prior knowledge, and which parts you completed using external resources (e.g., generative AI, past assignments, Stack Overflow, Google, etc.).

###YOUR ANSWER HERE###

  1. Required Documentation (2 points)

    • For any part where you used generative AI, you must include the exact prompts you entered and the corresponding AI outputs. Copy and paste them directly.

    • For any part where you used other external resources, list those sources.

    • For parts completed without external resources, briefly state what prior knowledge you relied on (no detailed explanation required).

Responses that do not include prompts and AI outputs (when applicable) will not receive full credit.

### YOUR ANSWER HERE##
#YOUR PROMPTS##

##AI OUTPUTS##

(b) (5 points)#

Compare the random forest with the classification tree from Question 3 using their test error rates.
Identify the most important predictors and briefly interpret the results.

###YOUR ANSWER HERE##

Question 5. Reflection (10 points)#

Write a short paragraph addressing the following:

  1. What is one advantage of a single decision tree?

  2. What is one disadvantage of a single decision tree?

  3. Why can ensemble methods such as random forests often improve prediction performance?

###YOUR ANSWER HERE###