Ch 2.1: What is Statistical Learning?

Lecture 2 - CMSE 381

Michigan State University
::
Dept of Computational Mathematics, Science /span> Engineering

Weds, Jan 14 , 2026

Announcements

Last time:

Discussed where to find everything
- Course webpage
- D2L
Check out the syllabus!

Announcements:

First homework due Sun Jan 25
First office hours next week

Covered in this class

Input/output variables
Prediction vs inference
Reduceable vs irreduceable error
Overfitting
Classification vs regression
Supervised vs Unsupervised learning

Please note: no jupyter notebook for today’s class, slides only

An example data set: Advertising

Sales of a product in 200 markets, along with amount spent on three differnt types of advertising
Goal: Predict Sales based on amount spent in each type of advertising
Input variables: TV, Radio, Newspaper
Output variable: Sales

Data available at msu-cmse-courses.github.io/CMSE381-S26/DataSets/DataSets.html

Notation and Big Assumption

Input variables:

X_{1}, X_{2}, \dots, X_{p}

Output variable: $Y$

Y = f (X) +

$f$ is the systemic information that $X$ provides about $Y$ .
It is the ground truth that we want but can’t access
So our goal is to come up with an estimated model $\hat{f}$ .
is a random error term which is independent of $X$ and has mean 0

Advertising Example

More examples

Section 1

Prediction vs Inference

Prediction

Given a value $X$ , try to provide an estimate for $f (X)$ .

Build a model:

Ŷ = \hat{f} (X)

Want to get a good guess for $f$ , which is unknown blue
Model is $\hat{f}$ is green dashed lines

Group question:

The blue solid line is

f

. The green dashed line is

\hat{f}

.

What is the predicted sales for the first three data points using the green dashed line $\hat{f}$ shown in the graph?
- Note all values approximate
- $\hat{f} (230.1) = 19$ ,
- $\hat{f} (44.5) = 7$ ,
- $\hat{f} (17.2) = 5$ ,
Using the dashed green line as the predicted model $\hat{f}$ , what is the error in each of the three predicions?

Reduceable vs irreducable error All models are wrong, some are useful.

$Y - Ŷ$

Reducible Error

$\hat{f}$ will not be a perfect estimate for $f$ .
We can potentially improve the irreducible accuracy of $\hat{f}$ by using the most appropriate statistical learning technique

Irreducible Error

Model was $Y = f (X) +$ ,
Variability of also affects predictions
Not matter how well we estimate $f$ , we can’t get rid of this error.
Would expect this in real life though:

More on error

Given estimate $\hat{f}$ (fixed)
Set of predictors $X$ (fixed)
Prediction $Ŷ = \hat{f} (X)$

$E {(Y - Ŷ)}^{2} = E {[f (X) + - \hat{f} (X)]}^{2} = {[f (X) - \hat{f} (X)]}^{2} + 𝑉𝑎𝑟 ()$

${[f (X) - \hat{f} (X)]}^{2}$ is reducible, $𝑉𝑎𝑟 ()$ is irreducible

Inference

Want

f

, but not for prediction

(or possibly combined with prediction)

Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation? Is it more complicated?

Determine whether each scenario is prediction, inference, or both.

Application	Prediction	Inference

Predict effectiveness of vaccine
Determine the address written on
the image of an envelope.
Identify risk factors for getting long covid.
Transcribe an audio file of a person talking.
Predict stock prices.

Section 2

How to estimate

f

?

Input: Training data

$n$ data points observed
$x_{𝑖𝑗}$ is the $j$ th predictor for observation $i$
$y_{i}$ is the response variable for the $i$ th observation
Training data:
- ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$
- $x_{i} = {(x_{i}, x_{i 2}, \dots, x_{𝑖𝑝})}^{T}$

Parametric methods

Select a model

Example:

f (X) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}

Train the model

Example:

Find $β_{i}^{'} s$ so that

Y \approx β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}

How do you decide on the coefficients?

Y \approx β_{0} + β_{1} X_{1}

Example Non-parametric method: Nearest Neighbors

N_{k} (x) = Set of k nearest neighbors of x

\hat{f} (x) = \frac{1}{k} \sum_{x_{i} \in N_{k} (x)} y_{i}

Parametric methods: Pros and Cons

Pros

Easier to estimate paramters than to figure out a completely arbitrary function

Cons

You might have chosen the wrong function type

Overfitting

Possible fix: Find more Flexible models, which means braider functional form

Problem: needs more variables, could lead to overfitting

Overfitting: Following the noise too closely

Prediction Accuracy vs Model Interpretability

More flexible allows for greater accuracy, but potential for overfitting
Also more restrictive makes it easier to understand and interpret the results

Supervised learning:
Training data has response variable

y

for every input

x

Unsupervised Learning:
Training data does not have response variable

y

for every input

x

Regression vs Classification

Types of variables: Emphasize this is output variable, and which it is determines regression vs classification

Quantitative
Ex: Blood pressure, temperature, volume, height, income
Qualitative / Categorical
Purchased a ticket, owns a house, Job, Digit in MNIST,

Emphasize that this has to do with the output variable type

Section 3

Group work

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Is this classification or regression? regression
Do we want inference or prediction? Inference
What is $n$ , the number of data points? 500
What is $p$ , the number of variables? 3

From Ex 2.4.2

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Is this classification or regression? classification
Do we want inference or prediction? Prediction
What is $n$ , the number of data points? 20
What is $p$ , the number of variables? 13

TL;DR

Input/output variables
Prediction vs inference
Reduceable vs irreduceable error
Overfitting
Classification vs regression
Supervised vs Unsupervised learning

Wrap up

Next time:

Friday 1/16
- Bring Laptop!
- First homework due Sun Jan 25
- There will be a quiz this Friday

Announcements:

Office hours!
- Dr. Bao: Tue 9 am-11 am (EB 2507 L)
- Dr. Khasawneh: MWF 13:40 pm-2:00 pm (EB 2400), Tue 1:30 pm-2:30 pm (Zoom)
- Siyu : Wed 2:00 pm-3:00 pm (EB 2504)
- Haishen: WF 2:00 pm-3:00 pm (EB2504)