blackblack
picture picture

Ch 8.2.1, 8.2.2: Bagging and Random Forests

Lecture 25 - CMSE 381
Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering
Fri, Mar 27, 2026
Announcements

Last time:

This lecture:

Announcements:

Course schedule for weeks 21-33 covering
splines, trees, SVM, neural networks, with
Midterm 3 and project deadlines marked.

Section 1 picture picture

Previously
First decision tree example

Table displaying
baseball hitter data
with columns for Hits,
Years, and LogSalary
Regression tree for baseball
player salaries splitting on Years
and Hits to produce three leaf
nodes
Viewing Regions Defined by Tree

Regression tree for baseball
player salaries splitting on Years
and Hits to produce three leaf
nodes

Scatter plot of Hits versus Years
partitioned into three regions R1, R2, and
R3 by a regression tree.

How do we actually get the tree? Two steps

picture

We divide the predictor space – that is, the set of possible values for X1,X2,,Xp — into J distinct and non-overlapping regions, R1,R2,,RJ.

picture

For every observation that falls into the region Rj, we make the same prediction = the mean of the response values for the training observations in Rj.

A two-dimensional feature space
partitioned into five rectangular regions
R1 through R5 by vertical and horizontal
splits
Recursive binary splitting

Goal:

Find boxes R1,,RJ that minimize

j=1J iRj(yiŷRj)2

ŷRj = mean response for training observations in jth box

Pick s so that splitting into {XXj /mo> s} and {XXj s} results in largest possible reduction in RSS:

R1(j,s) = {XXj /mo> s} R2(j,s) = {XXj s}
ixi R1(j,s)(yiŷR1)2+ ixi R2(j,s)(yiŷR2)2
Decision
tree
diagram
illustrating
the
sequence
of
binary
splits
on
X1
and
X2
resulting
in
regions
R1
through
R5. A
two-dimensional
feature
space
partitioned
into
five
rectangular
regions
R1
through
R5
by
vertical
and
horizontal
splits
Pruning

Complex regression tree for
baseball player salaries
featuring multiple splits on
Years, Hits, RBI, Walks,
Runs, and Putouts
Regression
tree
for
baseball
player
salaries
splitting
on
Years
and
Hits
to
produce
three
leaf
nodes
Classification version

Large, complex classification tree for heart
disease predicting Yes or No based on
numerous clinical features and splits

Plot of classification error versus tree size
for Training, Cross-Validation, and Test
sets, with error bars for each Classification
tree for heart disease predicting Yes or No
based on splits of Thal, Ca, MaxHR, and
ChestPain

Evaluating the splits:
G = k=1K p^ 𝑚𝑘(1 p^𝑚𝑘)
Linear models vs trees

Two plots comparing a linear decision
boundary with a decision tree
approximation using axis-aligned
rectangular splits in feature space
Two plots showing a decision tree
perfectly capturing a rectangular boundary
with axis-aligned splits, compared to a
linear boundary
What will your learn today?

Section 2 picture picture

8.2.1 Bagging
Use ensemble of trees to reduce variance

Want to do (but can’t):
Build separate models from independent training sets, and average resulting predictions:

f ^𝑎𝑣𝑔(x) = 1 B b=1Bf ^b(x)

Boostrap modification:

Tree version

Visual representation of bagging showing a blue data block leading to multiple red sample
blocks and green trees.
Prediction on new data point

Cartoon illustration of multiple green trees, representing an ensemble of decision trees.
Example: Heart classification data

Line plot comparing Test and OOB error
for Bagging and Random Forest models as
the number of trees increases.
Out of Bag Error Estimation

Diagram showing several red data
samples, each corresponding to an
individual green decision tree, representing
a bagging ensemble process.
Error using OOB

Line plot comparing Test and OOB error
for Bagging and Random Forest models as
the number of trees increases

Test your understanding: PollEv

Section 3 picture picture

Random Forests
The idea

Example on gene expression

Plot of test classification error versus
number of trees comparing m=p, m=p/2,
and m=sqrt(p) for random forests.
Coding time!

TL:DR

Next time

Course
schedule
for
weeks
21-33
covering
splines,
trees,
SVM,
neural
networks,
with
Midterm
3
and
project
deadlines
marked.