Worksheet Chapter 4, Part 2

Worksheet Chapter 4, Part 2#

Topics#

This section focuses on Chapter 4.3 and 4.4 of Beck

This notebook covers:

Condition Number
Convergence of gradient descent
Diagonal Scaling

%matplotlib inline
import matplotlib.pylab as plt
import numpy as np

Recall: The gradient descent algorithm.#

We are continuing to study ways to find solutions to the minimization problem

\[\min_\mathbf{x}f(\mathbf{x}),\]

where $f(\cdot)$ is continuously differentiable over $\mathbb{R}^n$.

From last time, we went over the gradient descent algorithm, which is sketched here.

Initialization: Pick $\mathbf{x}_0\in\mathbb{R}^n$ arbitrarily
General step: For $k=0,1,2,\dots$,
- Pick a descent direction $\mathbf{d}_k$.
- Find a stepsize $t_k$ such that $f(\mathbf{x}_k+t_k\mathbf{d}_k) < f(\mathbf{x}_k)$.
- Set $\mathbf{x}_{k+1} = \mathbf{x}_k + t_k \mathbf{d}_k$.
- Stop if a stopping criterion is satisfied.

There are lots of decisions to made, in particular:

choice of descent direction.
choice of step size method.

Code from Last Time#

I’m going to copy in the code from last time so that we can use it here. I have made some modifications to the backtrack version so that it will allow us to use functions that aren’t necessarily quadratic by passing in $f$ as a function, as well as passing in a function for its gradient $g$.

# Backtracking line search function

def step_size_backtrack(f, g, x_k,s=2, alpha=1/4, beta = 0.5):
    """A function to compute the step size using backtracking line search.
    Args:
        f : The objective function. Note this should be a function of some input vector x.
        g : The gradient of the objective function. Again this should be a function of some input vector x.
        x_k : Current point
        s : Initial step size (default is 2)
        alpha : Parameter for backtracking (default is 1/4)
        beta : Step size reduction factor in backtracking (default is 0.5)

    Returns:
        t : The step size found using backtracking line search
    """

    t = s
    
    fun_val = f(x_k)
    grad_x = g(x_k)
    
    while fun_val - f(x_k - t * grad_x) < alpha*t*np.linalg.norm(grad_x)**2:   
        t = beta * t

    return t

def gradient_method_backtrack(f, g, x0, epsilon, s=2, alpha=1/4, beta=0.5, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the gradient method with backtracking line search.

    Args:
        f : The objective function. Note this should be a function of some input vector x.
        g : The gradient of the objective function. Again this should be a function of some input vector x. 
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        max_iter : Maximum number of iterations to perform (default is 100)
        s : Initial step size for backtracking line search (default is 2)
        alpha : Parameter for backtracking (default is 1/4)
        beta : Step size reduction factor in backtracking (default is 0.5)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    grad_k = g(x_k)
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:
    
        grad_k = g(x_k)
        t_k = step_size_backtrack(f, g, x_k, s, alpha, beta) #<----
        d_k = -grad_k
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = x_k + t_k * d_k
        
        # Compute the function value at the new iterate
        f_k = f(x_k)  
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        ite = ite + 1
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all

#------
# Example usage for last class's function, x^2 + 2y^2
#------

def f(x):
    return x[0]**2 + 2 * x[1]**2

def g(x):
    return np.array([2*x[0], 4*x[1]])

x0 = np.array([1.0, 1.0])
epsilon = 1e-6

x_soln, f_soln, x_all = gradient_method_backtrack(f, g, x0, epsilon, s=2, alpha=1/4, beta=0.5, max_iter = 100)

print("Solution x:", x_soln)

# Exact line search function and gradient method

def step_size_exact_line(x_k, grad_x):
    
    numerator = np.linalg.norm(grad_x)**2   
    denominator = 2 *np.dot(grad_x, A @grad_x)
    return numerator / denominator

def gradient_method_exact_line(A, b, x0, epsilon, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the gradient method with exact line search.

    Args:
        A : A symmetric positive definite matrix
        b : A vector
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        max_iter : Maximum number of iterations to perform (default is 100)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
    # Add an internal function to compute the function value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x 
    
    def gradient(A, b, x):
        return 2 * (A @ x + b)
    
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    grad_k = 2 * (A@x_k + b)
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:
    
        grad_k = gradient(A, b, x_k)
        t_k = step_size_exact_line(x_k, grad_k)
        d_k = -grad_k
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = x_k + t_k * d_k
        
        # Compute the function value at the new iterate
        f_k = f(x_k)
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        # Print statements that might be helpful for debugging
        # print("iter_number = %3d norm_grad = %2.6f fun_val = %2.6f" %(ite, np.linalg.norm(grad_k), f_k))
        # print(x_k)
        
        ite = ite + 1
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all


# Below is the function call to test the gradient method implementation
A = np.array([[1, 0], [0,2]])
b = np.array([0, 0])
x0 = np.array([2, 1])
epsilon = 1e-10

x_soln, f_soln, x_all = gradient_method_exact_line(A, b, x0, epsilon)
print("Solution x:", x_soln)
print("Function value at solution f(x):", f_soln)

Today’s task#

We are going to understand how fast and when the gradient descent method finds a solution. As an example from our previous lecture, gradient descent could find a minimizer of $f(x,y) = x^2 + 2y^2$ in only a few iterations, while trying to find a minimizer of $f(x,y) = x^2 + \frac{1}{100}y^2$ took hundreds of steps.

Condition Numbers#

Consider the quadratic minimization problem

\[\min_{\mathbf{x}} f(\mathbf{x}):={1\over2}\mathbf{x}^\top\mathbf{A}\mathbf{x},\]

where $\mathbf{A}\succ \mathbf{0}$.

We can see that the optimal solution is $\mathbf{x}^* = \mathbf{0}$.

We are going to make two assumptions for the moment:

We are going to use the gradient for the descent direction, so $\mathbf{d}_k = - \nabla f(\mathbf{x}_k) = -2 \mathbf{A} \mathbf{x}_k$.
We are going to use the exact line search, which we know for quadratic functions can be solved by

\[ t_k=\arg\min_{t\geq 0}f(\mathbf{x}_k+t\mathbf{d}_k) ={\mathbf{d}_k^\top\mathbf{d}_k\over 2\mathbf{d}_k^\top\mathbf{A}\mathbf{d}_k} ={\|\nabla f(\mathbf{x})\|^2\over 2\nabla f(\mathbf{x})^\top\mathbf{A}\nabla f(\mathbf{x})}\]

At every update step, we have $\mathbf{x}_{k+1}=\mathbf{x}_k+t_k\mathbf{d}_k$.

After a lot of matrix algebra, we can compute the value of the function at the updated step:

\[f(\mathbf{x}_{k+1})={1\over2}\mathbf{x}_{k+1}\mathbf{A}\mathbf{x}_{k+1}=\left(1-{(\mathbf{d}_k^\top\mathbf{d}_k)^2\over (\mathbf{d}_k^\top\mathbf{A}\mathbf{d}_k)(\mathbf{d}_k^\top\mathbf{A}^{-1}\mathbf{d}_k)}\right)f(\mathbf{x}_k).\]

Then by the Kantorovich inequality (Lem 4.10), we can bound the inside part of this value in terms of the maximum and minimum eigenvalues of $\mathbf{A}$:

\[{(\mathbf{d}_k^\top\mathbf{d}_k)^2\over (\mathbf{d}_k^\top\mathbf{A}\mathbf{d}_k)(\mathbf{d}_k^\top\mathbf{A}^{-1}\mathbf{d}_k)}\geq {4\lambda_{\min}(\mathbf{A})\lambda_{\max}(\mathbf{A})\over \left ( \lambda_{\max}(\mathbf{A}) + \lambda_{\min}(\mathbf{A}) \right )^2} = \]

To make the notation less icky, we will write $\lambda_{m}$ for $\lambda_{min}(\mathbf{A})$ and $\lambda_{M}$ for $\lambda_{max}(\mathbf{A})$.

Then we can see that

\[ f(\mathbf{x}_{k+1}) \leq \left( 1 - {4\lambda_M \lambda_m} \over {(\lambda_M + \lambda_m)^2} \right) f(\mathbf{x}_k) = \left( {\lambda_M - \lambda_m} \over {\lambda_M + \lambda_m} \right)^2f(\mathbf{x}_k). \]

Let’s think about that term on the right side,

\[ c = \left( {\lambda_M - \lambda_m} \over {\lambda_M + \lambda_m} \right)^2 = \left( \frac{\frac{\lambda_M}{\lambda_m} - 1}{\frac{\lambda_M}{\lambda_m} + 1} \right)^2 \]

Whatever the inputs, we are squaring the result, so $c$ is a positive number. The top is also smaller than the bottom of the fraction, so $c$ is also less than 1. That means that repeating the inequality over and over, we know that

\[ f(\mathbf{x}_{k+1}) \leq c^k f(\mathbf{x}_0) \]

so the sequence of functions is bounded above by a decreasing geometric sequence. That means that the rate of convergence is controlled by this $c$ parameter: the smaller $c$ is, the faster the sequence is forced to converge.

Definition#

With that, we can define the condition number

Definition 4.12 (condition number). Let $\mathbf{A}$ be an $n\times n$ positive definite matrix. Then its condition number is defined as

\[\kappa(\mathbf{A}):={\lambda_{\max}(\mathbf{A})\over\lambda_{\min}(\mathbf{A})}.\]

Matrices with a large condition number are called ill-conditioned.
Matrices with a small condition number are called well-conditioned.

Warning: The textbook uses $\chi(\mathbf{A})$ for the condition number, but nearly every other source I have seen uses $\kappa(\mathbf{A})$, so I will use that here.

Example: Rosenbrock Function#

We will see how the condition number controls the convergence with the following example function.

The Rosenbrock function is defined as

\[f(x,y)=100(y-x^2)^2+(1-x)^2.\]

The minimizer is $(x,y)=(1,1)$ with global mimimal value $0$.

The Hessian matrix at $(x,y)=(1,1)$ is

\[\begin{split}\nabla^2f(1,1)=\begin{bmatrix}802&-400\\-400&200\end{bmatrix}\end{split}\]

Take a look at this function on desmos.

❓❓❓ Question ❓❓❓: Write code below to compute the condition number of the Hessian matrix and print it. Is this an ill-conditioned or well-conditioned matrix?

A = np.array([[802, -400],[-400, 200]])

# Write code to compute and print the condition number of A.

❓❓❓ Question ❓❓❓:

The code below for running the gradient method with backtracking line search for the Rosenbrock function and run it.

Is the found solution close to the correct minimizer?
Is the algorithm converging (i.e. finding a solution without hitting your max_iter value)? Does it converge for higher values of max_iter?

# The Rosenbrock function 
f = lambda x: 100*(x[1]-x[0]**2)**2 + (1 - x[0])**2

# The Rosenbrock gradient
g = lambda x: np.array([-400*(x[1]-x[0]**2)*x[0]-2*(1-x[0]), 200*(x[1]-x[0]**2)])

x0 = np.array([2,5])
epsilon = 1e-6

x_soln, f_soln, x_all = gradient_method_backtrack(f, g, x0, epsilon, s=2, alpha=1/4, beta=0.5, max_iter = 300)

print("Solution x:", x_soln)

Run the code below which produces a contour plot of the function and traces the iterates from the gradient method. Trace where the iterates go. Are they going in the right direction? Are they going fast enough to get to the minimum?

x = np.linspace(-1, 3, 1000)
y = np.linspace(-1, 6, 1000)
xmesh, ymesh = np.meshgrid(x,y)
fmesh = f(np.array([xmesh, ymesh]))

fig, ax = plt.subplots(figsize=(8,8))
cp = ax.contour(xmesh, ymesh, fmesh, np.logspace(0,3,10))
#ax.clabel(cp, inline=True, fontsize=10)
ax.set_xlim([-0.2,2.3])
ax.set_ylim([-0.7,1.8])
x_all_n = np.asarray(x_all)
plt.plot(x_all_n.T[0],x_all_n.T[1], marker = '.', label = 'Iterates')


plt.plot([1],[1],'go',markersize=10,label='Minimum at (1,1)')
ax.axis('square')
ax.set_title('Contour Plot')
ax.set_xlabel('x')
ax.set_ylabel('y')

plt.legend()
plt.show()

# print(x_all_n)

Sensitivity of Solutions to Linear Systems#

We are given the linear system $\mathbf{A}\mathbf{x}=\mathbf{b}$, and $\mathbf{A}$ is positive definite. We know that the solution is $\mathbf{x}=\mathbf{A}^{-1}\mathbf{b}$. However, if $\mathbf{b}$ is perturbed by a little bit and the right hand side becomes $\mathbf{b}+\Delta\mathbf{b}$. Then the solution is $\mathbf{x}+\Delta\mathbf{x}=\mathbf{A}^{-1}(\mathbf{b}+\Delta\mathbf{b})$.

We can show that

\[{\|\Delta\mathbf{x}\|_2\over\|\mathbf{x}\|_2}\leq \kappa(\mathbf{A}){\|\Delta\mathbf{b}\|_2\over\|\mathbf{b}\|_2}.\]

Consider the example system below and its solution.

A = np.array([[1+1e-5, 1],[1, 1+1e-5]])
b = np.array([1,1])

x = np.linalg.solve(A,b)
x

Add the perturbation defined below to $\mathbf{b}$ and recompute the solution of the system of equations. What do you observe? Is there a large or small difference in the solution compared to above? Why?

db = np.array([0.01,0])

#Write code to recompute solution to linear system below

✎ Enter your reply and explanation here.

Diagonal Scaling#

So now we consider some optimization problem

\[\min_{\mathbf{x}}f(\mathbf{x}).\]

But, we change the variable to $\mathbf{y}$ with $\mathbf{x}=\mathbf{S}\mathbf{y}$ with invertible $\mathbf{S}$. Then the problem becomes $$\min_{\mathbf{y}}f(\mathbf{S}\mathbf{y}).$$

Then gradient update becomes

Letting $\mathbf{x}_k=\mathbf{S}\mathbf{y}_k$, we get an update of $\mathbf{x}_k$ as

\[ \mathbf{x}_{k+1}=\mathbf{x}_k-t_k\mathbf{S}\mathbf{S}^\top \nabla f(\mathbf{x}).\]

So we don’t actually care about what $\mathbf{S}$ is, only what $\mathbf{S}\mathbf{S}^\top$ is. Let $\mathbf{D}=\mathbf{S}\mathbf{S}^\top$ be a symmetric positive definite matrix, then we obtain the scaled gradient method with the scaling matrix $\mathbf{D}$:

\[\mathbf{x}_{k+1}=\mathbf{x}_k-t_k\mathbf{D}\nabla f(\mathbf{x}_k).\]

❓❓❓ Question ❓❓❓: You’ve just chosen to use $\mathbf{d_k} = -\mathbf{D} \nabla f(\mathbf{x}_k)$ as your descent direction.

The directional derivative of $f$ at $\mathbf{x}_k$ in direction $v$ is $f'(\mathbf{x}; \mathbf{v}) = \nabla f(\mathbf{x})^\top \mathbf{v}$. So what is $f'(\mathbf{x}_k; \mathbf{d}_k)$?
We are assuming that $\mathbf{D}$ is positive definite. Assume also that $\mathbf{x}_k$ is not a stationary point, so $\nabla f(\mathbf{x}_k)\neq \mathbf{0}$. Is $f'(\mathbf{x}; \mathbf{v}) = \nabla f(\mathbf{x})^\top \mathbf{v}$ strictly positive ($>0$), positive ($\geq 0$), strictly negative ($<0$) or negative ($\leq 0$)?
Conclude from your statement above that $\mathbf{d_k} = -\mathbf{D} \nabla f(\mathbf{x}_k)$ is a descent direction

Your notes here

The scaled gradient method#

Using the above, our scaled gradient method is as follows.

Input: $\varepsilon>0$.
Initialization: Pick $\mathbf{x}_0\in\mathbb{R}^n$ arbitrarily.
General step: For $k=0,1,2\cdots$
- Pick a scaling matrix $\mathbf{D}_k\succ \mathbf{0}$.
- Pick a stepsize $t_k$ based on the exact line method $$g(t)=f(\mathbf{x}_k-t\mathbf{D}\nabla f(\mathbf{x}_k)).$$
- Set $\mathbf{x}_{k+1}=\mathbf{x}_k-t_k\mathbf{D}\nabla f(\mathbf{x}_k)$.
- Stop if $\|\nabla f(\mathbf{x}_{k+1})\|\leq \varepsilon$.

So now we’re left with figuring out how to do that first step.

How to choose $\mathbf{D}$?#

There are a multitude of ways to pick the scaling matrix $\mathbf{D}$. In this notebook we will focus on the diagonal method. Specifically, we will let $\mathbf{D}$ be a diagonal matrix with

\[D_{ii}=\frac{1}{(\nabla^2 f(\mathbf{x}_k))_{ii}}\]

Note that in the case of a quadratic function, we already know that the Hessian is

\[ \nabla^2 f(\mathbf{x}_k) = 2\mathbf{A} \]

so $$D_{ii} = \frac{1}{2\mathbf{A}_{ii}}$$

In the code, the factor 2 is absorbed into the step-size formula so you can do the same update by using instead

\[D_{ii} = \frac{1}{\mathbf{A}_{ii}}\]

as we will do below. In this case, only the product $t_k \mathbf{D}$ matters.

Example 4.15.#

Consider the problem

\[\min_{x,y}1000x^2+40xy+y^2.\]

The corresponding matrix $\mathbf{A}$, vector $\mathbf{b}$, initial iterate and stopping criterion for gradient descent for the quadratic problem set up are provided in the code below.

A = np.array([[1000, 20], [20, 1]])
b = np.array([0, 0])
x0 = np.array([1, 1000.])
epsilon = 1e-5

❓❓❓ Question ❓❓❓:

Write code to compute the condition number of $\mathbf{A}$ and print it. Is this an ill-conditioned matrix?

## Write code below.

✎ Comment on the condition number and iterations it took the algorithm to reach stopping condition.

❓❓❓ Question ❓❓❓:

What is $\mathbf{D}$?

# Save your D here 

# D = ....

❓❓❓ Question ❓❓❓:

Before we get to the diagonal scaling, write code below to run the gradient_method_exact_line function (saved at the top of this notebook) with the above choices of inputs and print the solution. Note that information on the function can be found below. How many iterations did it take?

?gradient_method_exact_line

## Enter your code below.

❓❓❓ Question ❓❓❓:

Complete the code provided below for the scaled gradient method for this quadratic problem. It uses exact line search for the step size.

def gradient_method_scaled_quadratic(A, b, D, x0, epsilon, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the scaled gradient method with exact line search.

    Args:
        A : A symmetric positive definite matrix
        b : A vector
        D : A symmetric positive definite scaling matrix
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        max_iter : Maximum number of iterations to perform (default is 100)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
    # Add an internal function to compute the function and gradient value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x 
    
    def gradient(A, b, x):
        return 2 * (A @ x + b)
    
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    grad_k = gradient(A, b, x_k)
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:
    
        # ----
        # Remove break before you start working 
        break 
        # ----
        
        grad_k = gradient(A, b, x_k)
        t_k = np.asarray(grad_k.T@D@grad_k/(2*grad_k.T@D@A@D@grad_k))
        # t_k = step_size_exact_line(x_k, grad_k)
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = None #<---- Update this line
        
        # Compute the function value at the new iterate
        f_k = f(x_k)
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        # Print statements that might be helpful for debugging
        # print("iter_number = %3d norm_grad = %2.6f fun_val = %2.6f" %(ite, np.linalg.norm(grad_k), f_k))
        # print(x_k)
        
        ite = ite + 1
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all

x, fun_val, x_all = gradient_method_scaled_quadratic(A, b, np.array([[1/1000,0],[0,1]]), x0, epsilon)
print("The solution is ", x)
print("The number of iterations is ", len(x_all)-1)

❓❓❓ Question ❓❓❓: Compare the scaled version with the exact line version above. How many iterations did the algorithm take in each case? Fewer or more than the usual gradient descent?

✎ Comment on number of iterations and compare to previous method.

Worksheet Chapter 4, Part 2

Contents

Worksheet Chapter 4, Part 2#

Topics#

Recall: The gradient descent algorithm.#

Code from Last Time#

Today’s task#

Condition Numbers#

Definition#

Example: Rosenbrock Function#

Sensitivity of Solutions to Linear Systems#

Diagonal Scaling#

The scaled gradient method#

How to choose \(\mathbf{D}\)?#

Example 4.15.#