Lecture 4-2: Gradient Method: Part 2

Lecture 4-2: Gradient Method: Part 2#

Download the original slides: CMSE382-Lec4_2.pdf

Warning

This is an AI-generated transcript of the lecture slides and may contain errors or inaccuracies. Please refer to the original course materials for authoritative content.

Condition Number, Sensitivity, and Diagonal Scaling#

Topics Covered#

Topics:

Condition number
Gradient descent solution sensitivity
Diagonal scaling

Announcements:

Quiz 2 today, end of class
No office hours Friday

Condition Number#

Recall: positive definiteness. Singular matrix.

An \(n\times n\) real symmetric matrix \(A\) is called:

Positive definite if \(\boldsymbol{x}^T A \boldsymbol{x} > 0\) for every non-zero choice of \(\boldsymbol{x}\).
- The diagonal entries of \(A\) are positive.
- There is an invertible matrix \(B\) such that \(A=B^T B\)
- \(\mathbf{A}\) is positive definite if and only if all its eigenvalues are positive.

A singular matrix is a square matrix that does not have an inverse.

A matrix is singular if and only if its determinant is 0.
A matrix has a zero eigenvalue if and only if its determinant is 0.

Condition Number#

Definition: Let \(\mathbf{A}\) be a positive definite matrix. Then the condition number of \(\mathbf{A}\) is defined by

\[\kappa(\mathbf{A}) = \frac{\lambda_{\text{max}}(\textbf{A})}{\lambda_{\text{min}}(\textbf{A})}\]

\(\kappa \geq 1\)
Matrices with large condition number are ill-conditioned and matrices with small condition number are well-conditioned.
A singular matrix has an infinite condition number.

Sensitivity of Solutions#

Sensitivity of gradient descent solution

For a linear system \(A \mathbf{x} = b\), the condition number measures the sensitivity of the solution \(\mathbf{x}\) to fluctuations in the observed data \(\mathbf{b}\):

\[\frac{\|\Delta \mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A)\frac{\| \Delta \mathbf{b}\|}{\|\mathbf{b} \|}\]

Gradient descent rate of convergence of \(\textbf{x}_k\) to a given stationary point \(\textbf{x}^\ast\) depends on \(\kappa(\nabla^2f(\textbf{x}^\ast))\).

Example: Solve the system

\[\begin{split}\begin{bmatrix}1+10^{-5} & 1\\1 & 1+10^{-5}\end{bmatrix}\mathbf{x} = \mathbf{b}\end{split}\]

If \(\mathbf{b}=\begin{bmatrix}1\\1 \end{bmatrix}\), then \(\mathbf{x} = \begin{bmatrix} 0.4999975 \\ 0.4999975 \end{bmatrix}\)
If \(\mathbf{b}=\begin{bmatrix}1.01\\1 \end{bmatrix}\), then \(\mathbf{x} = \begin{bmatrix} 500.50249748\\-499.49750251 \end{bmatrix}\)

Diagonal Scaling - Motivating Example#

Assume we want to minimize the quadratic function

\[\begin{split}f(x,y) = 1000x^2+40xy+y^2=\mathbf{x}^T A \mathbf{x}, \text{ where } A = \begin{bmatrix} 1000 & 20 \\ 20 & 1 \end{bmatrix}, \mathbf{x}=\begin{bmatrix}x\\ y \end{bmatrix}.\end{split}\]

We can write this problem as \(\min\limits_{\textbf{x}} \{\textbf{x}^\top\textbf{A}\textbf{x}\}\)

The condition number of \(\textbf{A}\) is \(\kappa(A)=1668.001\).
- Using gradient descent, we would expect slow convergence.

Diagonal Scaling#

Main idea: We have an ill-conditioned \(\mathbf{A}\) but we want to minimize \(\textbf{x}^\top\mathbf{A}\textbf{x}\).

Instead:

pick a non-singular matrix \(\mathbf{S}\)
set \(\textbf{x} = \mathbf{S}\textbf{y}\)
Then instead minimize the transformation:

\[\min\limits_{\textbf{y}} \{(\mathbf{S}\textbf{y})^\top \mathbf{A} (\mathbf{S}\textbf{y})\}\]

The transformation aims to speed up convergence while still giving a good answer.
The gradient method for the transformed problem becomes:

\[\textbf{x}_{k+1} = \textbf{x}_k-t_k\mathbf{S}\mathbf{S}^\top \nabla f(\textbf{x}_k)\]

Instead of picking \(\mathbf{S}\) directly, set \(\mathbf{D} = \mathbf{S}\mathbf{S}^\top\) and pick \(\mathbf{D}\) instead.

Scaled Gradient Method#

Input: tolerance parameter \(\varepsilon > 0\).

Initialization: Pick \(\textbf{x}_0 \in \mathbb{R}^n\) arbitrarily.

For any \(k = 0, 1, 2, \ldots\) do:

Pick a scaling matrix \(\textbf{D}_k \succ 0\)
Pick a stepsize \(t_k\)
- For example, using exact line search on the function \(g(t) = f(\textbf{x}_k - t\textbf{D}_k\nabla f(\textbf{x}_k))\).
Set \(\textbf{x}_{k+1} = \textbf{x}_k - t_k\textbf{D}_k\nabla f(\textbf{x}_k)\).
If \(\|\nabla f(\textbf{x}_{k+1})\| \leq \varepsilon\), then STOP and \(\textbf{x}_{k+1}\) is the output.

Choosing the Scaling Matrix \(\textbf{D}_k\)#

How to choose \(\textbf{D}_k\)?

Newton’s method: \(\textbf{D}_k = (\nabla^2 f(\textbf{x}_k))^{-1}\)
Diagonal method: \(\textbf{D}_k = (\nabla^2f(\textbf{x}_k))^{-1}_{ii}\)
Often chosen because they result in fast convergence.

Lecture 4-2: Gradient Method: Part 2

Contents

Lecture 4-2: Gradient Method: Part 2#

Condition Number, Sensitivity, and Diagonal Scaling#

Topics Covered#

Condition Number#

Condition Number#

Sensitivity of Solutions#

Diagonal Scaling - Motivating Example#

Diagonal Scaling#

Scaled Gradient Method#

Choosing the Scaling Matrix \(\textbf{D}_k\)#