Lecture 4-2: Gradient Method: Part 2#

Download the original slides: CMSE382-Lec4_2.pdf

Warning

This is an AI-generated transcript of the lecture slides and may contain errors or inaccuracies. Please refer to the original course materials for authoritative content.


Condition Number, Sensitivity, and Diagonal Scaling#

Topics Covered#

Topics:

  • Condition number

  • Gradient descent solution sensitivity

  • Diagonal scaling

Announcements:

  • Quiz 2 today, end of class

  • No office hours Friday


Condition Number#

Recall: positive definiteness. Singular matrix.

An \(n\times n\) real symmetric matrix \(A\) is called:

  • Positive definite if \(\boldsymbol{x}^T A \boldsymbol{x} > 0\) for every non-zero choice of \(\boldsymbol{x}\).

    • The diagonal entries of \(A\) are positive.

    • There is an invertible matrix \(B\) such that \(A=B^T B\)

    • \(\mathbf{A}\) is positive definite if and only if all its eigenvalues are positive.

A singular matrix is a square matrix that does not have an inverse.

  • A matrix is singular if and only if its determinant is 0.

  • A matrix has a zero eigenvalue if and only if its determinant is 0.


Condition Number#

Definition: Let \(\mathbf{A}\) be a positive definite matrix. Then the condition number of \(\mathbf{A}\) is defined by

\[\kappa(\mathbf{A}) = \frac{\lambda_{\text{max}}(\textbf{A})}{\lambda_{\text{min}}(\textbf{A})}\]
  • \(\kappa \geq 1\)

  • Matrices with large condition number are ill-conditioned and matrices with small condition number are well-conditioned.

  • A singular matrix has an infinite condition number.


Sensitivity of Solutions#

Sensitivity of gradient descent solution

  • For a linear system \(A \mathbf{x} = b\), the condition number measures the sensitivity of the solution \(\mathbf{x}\) to fluctuations in the observed data \(\mathbf{b}\):

\[\frac{\|\Delta \mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A)\frac{\| \Delta \mathbf{b}\|}{\|\mathbf{b} \|}\]
  • Gradient descent rate of convergence of \(\textbf{x}_k\) to a given stationary point \(\textbf{x}^\ast\) depends on \(\kappa(\nabla^2f(\textbf{x}^\ast))\).

Example: Solve the system

\[\begin{split}\begin{bmatrix}1+10^{-5} & 1\\1 & 1+10^{-5}\end{bmatrix}\mathbf{x} = \mathbf{b}\end{split}\]
  • If \(\mathbf{b}=\begin{bmatrix}1\\1 \end{bmatrix}\), then \(\mathbf{x} = \begin{bmatrix} 0.4999975 \\ 0.4999975 \end{bmatrix}\)

  • If \(\mathbf{b}=\begin{bmatrix}1.01\\1 \end{bmatrix}\), then \(\mathbf{x} = \begin{bmatrix} 500.50249748\\-499.49750251 \end{bmatrix}\)


Diagonal Scaling - Motivating Example#

Assume we want to minimize the quadratic function

\[\begin{split}f(x,y) = 1000x^2+40xy+y^2=\mathbf{x}^T A \mathbf{x}, \text{ where } A = \begin{bmatrix} 1000 & 20 \\ 20 & 1 \end{bmatrix}, \mathbf{x}=\begin{bmatrix}x\\ y \end{bmatrix}.\end{split}\]

We can write this problem as \(\min\limits_{\textbf{x}} \{\textbf{x}^\top\textbf{A}\textbf{x}\}\)

  • The condition number of \(\textbf{A}\) is \(\kappa(A)=1668.001\).

    • Using gradient descent, we would expect slow convergence.


Diagonal Scaling#

Main idea: We have an ill-conditioned \(\mathbf{A}\) but we want to minimize \(\textbf{x}^\top\mathbf{A}\textbf{x}\).

Instead:

  • pick a non-singular matrix \(\mathbf{S}\)

  • set \(\textbf{x} = \mathbf{S}\textbf{y}\)

  • Then instead minimize the transformation:

\[\min\limits_{\textbf{y}} \{(\mathbf{S}\textbf{y})^\top \mathbf{A} (\mathbf{S}\textbf{y})\}\]
  • The transformation aims to speed up convergence while still giving a good answer.

  • The gradient method for the transformed problem becomes:

\[\textbf{x}_{k+1} = \textbf{x}_k-t_k\mathbf{S}\mathbf{S}^\top \nabla f(\textbf{x}_k)\]
  • Instead of picking \(\mathbf{S}\) directly, set \(\mathbf{D} = \mathbf{S}\mathbf{S}^\top\) and pick \(\mathbf{D}\) instead.


Scaled Gradient Method#

Input: tolerance parameter \(\varepsilon > 0\).

Initialization: Pick \(\textbf{x}_0 \in \mathbb{R}^n\) arbitrarily.

For any \(k = 0, 1, 2, \ldots\) do:

  1. Pick a scaling matrix \(\textbf{D}_k \succ 0\)

  2. Pick a stepsize \(t_k\)

    • For example, using exact line search on the function \(g(t) = f(\textbf{x}_k - t\textbf{D}_k\nabla f(\textbf{x}_k))\).

  3. Set \(\textbf{x}_{k+1} = \textbf{x}_k - t_k\textbf{D}_k\nabla f(\textbf{x}_k)\).

  4. If \(\|\nabla f(\textbf{x}_{k+1})\| \leq \varepsilon\), then STOP and \(\textbf{x}_{k+1}\) is the output.


Choosing the Scaling Matrix \(\textbf{D}_k\)#

How to choose \(\textbf{D}_k\)?

  • Newton’s method: \(\textbf{D}_k = (\nabla^2 f(\textbf{x}_k))^{-1}\)

  • Diagonal method: \(\textbf{D}_k = (\nabla^2f(\textbf{x}_k))^{-1}_{ii}\)

  • Often chosen because they result in fast convergence.