Lecture 5-1: Newton Method: Part 1

Lecture 5-1: Newton Method: Part 1#

Download the original slides: CMSE382-Lec5_1.pdf

Warning

This is an AI-generated transcript of the lecture slides and may contain errors or inaccuracies. Please refer to the original course materials for authoritative content.

Review: Linear and Quadratic Approximation#

Topics Covered#

Topics:

Linear approximation theorem
Quadratic approximation theorem
Pure Newton’s method
Newton method quadratic local convergence

Announcements:

Homework 2 posted, due Thursday, Feb 12 at 11:59pm
Midterm 1 on Wednesday, Feb 18.

Linear Approximation Theorem#

Theorem (Linear Approximation Theorem):

Let \(f : U \to \mathbb{R}\) be a twice continuously differentiable function over an open set \(U \subseteq \mathbb{R}^n\),
Let \(\mathbf{x} \in U\), \(r > 0\) satisfy \(B(\mathbf{x}, r) \subseteq U\).

Then for any \(\mathbf{y} \in B(\mathbf{x}, r)\) there exists \(\boldsymbol{\xi} \in [\mathbf{x}, \mathbf{y}]\) such that

\[f(\mathbf{y}) = f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{1}{2}(\mathbf{y} - \mathbf{x})^T \nabla^2 f(\boldsymbol{\xi}) (\mathbf{y} - \mathbf{x}).\]

Linear Approximation Visual#

\[f(\mathbf{y}) = f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + o(\|\mathbf{x}-\mathbf{y}\|).\]

Quadratic Approximation Theorem#

Theorem (Quadratic Approximation Theorem):

Let \(f : U \to \mathbb{R}\) be a twice continuously differentiable function over an open set \(U \subseteq \mathbb{R}^n\).
Let \(\mathbf{x} \in U\), \(r > 0\) satisfy \(B(\mathbf{x}, r) \subseteq U\).

Then for any \(\mathbf{y} \in B(\mathbf{x}, r)\)

\[\begin{split}\begin{aligned} f(\mathbf{y}) = f(\mathbf{x}) &+ \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) \\ & + \frac{1}{2}(\mathbf{y} - \mathbf{x})^T \nabla^2 f(\mathbf{x})(\mathbf{y} - \mathbf{x}) \\ &+ o(\|\mathbf{y} - \mathbf{x}\|^2). \end{aligned}\end{split}\]

Newton’s Method#

Newton’s Method Introduction#

Newton’s method focuses on optimizing

\[\min \{f(\textbf{x}) \mid \textbf{x} \in \mathbb{R}^n\}\]

where \(f\) is twice continuously differentiable.

Newton’s method is a second order method
- Uses information from the Hessian
- Contrast with first order methods like gradient descent which only uses gradient information
Requires that \(\nabla^2 f (\mathbf{x})\) is positive definite for every \(\mathbf{x} \in \mathbb{R}^n\)
- A unique optimal solution \(\mathbf{x}^*\) exists.

Newton’s Method - Main Idea#

Main idea: Minimize the quadratic approximation around \(\textbf{x}_k\):

\[\textbf{x}_{k+1} = \arg\min\limits_{\textbf{x} \in \mathbb{R}^n}\left\{f(\textbf{x}_k) + \nabla f(\textbf{x}_k)^\top(\textbf{x}-\textbf{x}_k)+\frac12(\textbf{x}-\textbf{x}_k)^\top \nabla^2f(\textbf{x}_k)(\textbf{x}-\textbf{x}_k)\right\}\]

This function is not well-defined unless \(\nabla^2f(\textbf{x}_k) \succ 0\).
The unique minimizer is the unique stationary point:

\[\nabla f(\textbf{x}_k)+\nabla^2 f(\textbf{x}_k)(\textbf{x}_{k+1}-\textbf{x}_k) = \textbf{0}\]

or equivalently \(\textbf{x}_{k+1} = \textbf{x}_k -(\nabla^2 f(\textbf{x}_k))^{-1}\nabla f(\textbf{x}_k)\)

\(-(\nabla^2f(\textbf{x}_k))^{-1}\nabla f(\textbf{x}_k)\) is called the Newton direction.

Pure Newton’s Method#

Input: \(\varepsilon > 0\) tolerance parameter

Initialization: Pick \(\textbf{x}_0 \in \mathbb{R}^n\) arbitrarily.

\(\mathbf{x}_0\) too far from \(\mathbf{x}^*\) can cause divergence.

General step: For any \(k = 0, 1, 2, \ldots\), do:

Compute the Newton direction \(\mathbf{d}_k\), which is the solution to the linear system \(\nabla^2 f(\textbf{x}_k)\mathbf{d}_k = -\nabla f(\textbf{x}_k)\)
- More efficient than inverting \(\nabla^2 f(\mathbf{x}_k)\) in \(\mathbf{d}_k=-(\nabla^2f(\textbf{x}_k))^{-1}\nabla f(\textbf{x}_k)\)
Set \(\textbf{x}_{k+1} = \textbf{x}_k+\mathbf{d}_k\)
If \(\|\nabla f(\textbf{x}_{k+1})\| < \varepsilon\), stop and output \(\textbf{x}_{k+1}\).

Newton’s Method: Example#

Example: \(f(x,y) = \frac12(10x^2+y^2)+5\log(1+e^{-x-y})\)

Comparison of Newton’s Method (blue) with Gradient Descent (black)

Convergence#

Quadratic Local Convergence of Newton’s Method#

Theorem: Let \(f\) be a twice continuously differentiable function defined over \(\mathbb{R}^n\). Let \(\{\textbf{x}_k\}_{k \geq 0}\) be the sequence generated by Newton’s method, and let \(\textbf{x}^\ast\) be the unique minimizer of \(f\) over \(\mathbb{R}^n\).

Assume that:

there exists \(m > 0\) such that \(\lambda_{\min}(\nabla^2 f(\textbf{x})) \geq m\) for any \(\mathbf{x} \in \mathbb{R}^n\)
there exists an \(L\) such that \(\nabla^2 f(\textbf{x})\) is Lipschitz with constant \(L\)

Then for any \(k = 0, 1, 2, \ldots\), the following inequality holds:

\[\|\textbf{x}_{k+1}-\textbf{x}^\ast\| \leq \frac{L}{2m}\|\textbf{x}_k-\textbf{x}^\ast\|^2\]

Moreover, if \(\|\textbf{x}_0-\textbf{x}^\ast \| \leq \frac{m}{L}\), then for \(k = 0, 1, 2, \ldots\):

\[\|\textbf{x}_k-\textbf{x}^\ast \|\leq \frac{2m}{L}\left(\frac12\right)^{2^k}\]

Rapid convergence! (number of accuracy digits is doubled at each \(k\))
Can still converge if the assumptions are not met