Worksheet Chapter 4, Part 1

Worksheet Chapter 4, Part 1#

Topics#

This section focuses on Chapter 4.1 and 4.2 of Beck

This notebook covers:

Descent Direction
Gradient Descent Algorithm

%matplotlib inline
import matplotlib.pylab as plt
import numpy as np

Descent directions#

We consider the unconstrained optimization problem

\[\min_\mathbf{x}f(\mathbf{x}),\]

where \(f:\mathbb{R}^n \to \mathbb{R}\) is continuously differentiable.

Up to now, we have looked for optima analytically. Analytically finding an optimal solution of this optimization problem in an unbounded setting is often done in the following steps

Find all stationary points \(\mathbf{x}\) such that \(\nabla f(\mathbf{x})=\mathbf{0}\).
Compare the objective values at these stationary points.

However, finding the analytical solution of the optimization problem is often difficult because (i) the stationary points might be difficult to find, and/or (ii) there may be infinite stationary points.

An alternative approach is an iterative algoritm. Our iterative algorithm takes the following form

\[\mathbf{x}_{k+1}=\mathbf{x}_k+t_k\mathbf{d}_k,\quad k=0,1,2,\cdots\]

This means you have some value for \(\mathbf{x}_0\). Then you use this along with \(t_0\) and \(\mathbf{d_0}\) to find \(\mathbf{x}_1\). Then you use the new \(\mathbf{x}_1\) with \(t_1\) and \(\mathbf{d_1}\) to find \(\mathbf{x}_2\). And so on, rinse and repeat until you decide to stop. This is written in the following pseudocode.

Schematic Descent Directions Method#

Initialization: Pick \(\mathbf{x}_0\in\mathbb{R}^n\) arbitrarily
General step: For \(k=0,1,2,\dots\),
- Pick a descent direction \(\mathbf{d}_k\).
- Find a stepsize \(t_k\) such that \(f(\mathbf{x}_k+t_k\mathbf{d}_k) < f(\mathbf{x}_k)\).
- Set \(\mathbf{x}_{k+1} = \mathbf{x}_k + t_k \mathbf{d}_k\).
- Stop if a stopping criterion is satisfied.

So now, we need to figure out how to get \(\mathbf{d}_k\) and \(t_k\) at each step in order to do the updates.

How to find \(\mathbf{d}_k\)#

We use the descent direction to pick \(\mathbf{d}_k\). A vector \(\mathbb{d}\in\mathbb{R}^n\) is called a descent direction of \(f\) at \(\mathbf{x}\) if the directional derivative \(f'(\mathbf{x};\mathbf{d})\) is negative, i.e.,

\[f'(\mathbf{x};\mathbf{d}):=\nabla f(\mathbf{x})^\top\mathbf{d}<0\]

When we have one of these, we can move in the direction of \(\mathbf{d}\) because of the following lemma.

Lemma 4.2 (descent property of descent directions). Let \(f:\mathbb{R}^n\rightarrow \mathbb{R}\) be a continuously differentiable function over \(\mathbb{R}^n\). Suppose that \(\mathbf{d}\) is a descent direction of \(f\) at \(\mathbf{x}\). Then there exists \(\varepsilon>0\) such that

\[f(\mathbf{x}+t\mathbf{d})<f(\mathbf{x}),\]

for any \(t\in(0,\varepsilon]\)

Our choice for \(\mathbf{d}_k\): While there are lots of possible choices of descent directions, for now we will choose to use minus the gradient at the current point:

\[ \mathbf{d}_k = -\nabla f(\mathbf{x}_k). \]

Not only is this always a descent direction, it is also the steepest direction for the function at \(\mathbf{x}_k\).

Focus on quadratic functions#

Will gradient descent is super powerful because it can be used even on functions we can’t solve analytically, we are going to learn how it works by setting up the algorithm to work on a quadratic function of the form

\[ f(\mathbf{x})=\mathbf{x}^\top\mathbf{A}\mathbf{x}+2\mathbf{b}^\top\mathbf{x}+c, \]

❓❓❓ Question ❓❓❓:

What is \(\nabla f\) for a quadratic function in this form?
Given this, what is \(\mathbf{d}_k\) in terms of the input function?
Finish the code below to return \(\nabla f\) and \(\mathbf{d}_k\) in terms of the inputs.

✎ Answer the question here.

def gradient(A, b, x):
    pass # <------ Replace this with a funcion that returns the gradient at x 

def descent_dir(grad_x):
    pass # <------ Replace this with a funcion that takes the gradient output from above and then returns d_k


# Example to test the function
A = np.array([[3, 2], [2, 6]])
b = np.array([2, -8])
x_k = np.array([0, 0])

grad_x = gradient(A, b, x_k)
d_k = descent_dir(grad_x)
print("Descent direction at x_k:", d_k)

How to find the stepsize \(t_k\)#

While there are many, many ways to determine step size, here we will focus on three popular options.

We could use a constant stepsize: \(t_k=t\) for all \(k\). However, this doesn’t give us a lot of control. Plus what would we want to choose??
Instead, we can use an adaptive stepsize, where the choice of \(t_k\) depends on information about the function at the current \(x_k\). We will look at two options for this.
- Find \(t_k\) by exact line search, which is the value that minimizes \(t_k = \arg\min_{t\geq 0} f(\mathbf{x}_k+t\mathbf{d}_k)\)
- Find \(t_k\) using backtracking line search, more on that later.

Gradient descent algorithm#

Now, we can sketch the update algorithm as follows. Before starting, we’ve decided to use \(\mathbf{d}_k = - \nabla f(\mathbf{x}_k)\), and we’ve made a decision about which version of step size method we will do. Then, the algorithm looks like this.

Input: \(\varepsilon>0\).
Initialization: Pick \(\mathbf{x}_0\in\mathbb{R}^n\) arbitrarily.
General step: For \(k=0,1,2\cdots\)
- Pick a stepsize \(t_k\) based on one of the options.
- Set \(\mathbf{x}_{k+1}=\mathbf{x}_k-t_k\nabla f(\mathbf{x}_k)\).
- Stop if \(\|\nabla f(\mathbf{x}_{k+1})\|\leq \varepsilon\).

We will be writing and testing our code on the function

\[\begin{split}f(x,y) = x^2 + 2y^2 = \mathbf{x}^\top \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix} \mathbf{x} \end{split}\]

so the goal is to find

\[\min_{\mathbf{x}} f(\mathbf{x}) = \min_{x,y}~x^2 + 2 y^2.\]

This happens to be a case where we know analytic solution - there is a global minimum at \((0,0)\). We can also see this in the contour plot of the function below.

def plot_our_contour(ax = None):
    if ax is None:
        fig, ax = plt.subplots()

    x = np.linspace(-0.2, 2.3, 1000)
    y = np.linspace(-0.7, 1.8, 1000)
    xmesh, ymesh = np.meshgrid(x,y)
    def f(x):
        return x[0]**2+2*x[1]**2
    fmesh = f(np.array([xmesh, ymesh]))

    cp = ax.contour(xmesh, ymesh, fmesh, np.linspace(-2,9,21))
    ax.clabel(cp, inline=True, fontsize=10)
    ax.set_xlim([-0.2,2.3])
    ax.set_ylim([-0.7,1.8])
    ax.axis('square')
    ax.set_title('Contour Plot of $f(x,y)=x^2+2y^2$')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    return ax

plot_our_contour()
plt.show();

So for now, pretend we couldn’t do that and we’ll see if the gradient method can find it for us.

Using the constant step size#

First, we are going to set up the gradient descent algorithm with the constant step size. So \(t_k\) will just be some fixed number.

❓❓❓ Question ❓❓❓: Finish the code below to run the gradient algorithm.

def gradient_method_constant(A, b, x0, epsilon, t_step, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the gradient method with exact line search.

    Args:
        A : A symmetric positive definite matrix
        b : A vector
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        t_step : Constant step size to use in the gradient method
        max_iter : Maximum number of iterations to perform (default is 100)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
    # Add an internal funciton to compute the function value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x
   
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    grad_k = 2 * (A@x_k + b)
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:

        #---- 
        # REMOVE BREAK BEFORE WORKING
        # This is to prevent the cell from throwing an error until you start working on the code
        break 
        #----
        
        grad_k = gradient(A, b, x_k)
        t_k = t_step
        d_k = descent_dir(grad_k)
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = np.nan # <------ Update x_k using t_k and d_k 

        # Compute the function value at the new iterate
        f_k = np.nan  # <------ Compute f(x_k) for the new x_k
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        # Print statements that might be helpful for debugging
        # print("iter_number = %3d norm_grad = %2.6f fun_val = %2.6f" %(ite, np.linalg.norm(grad_k), f_k))
        # print(x_k)
        
        ite = ite + 1
        
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all


# Below is the function call to test the gradient method implementation
A = np.array([[1, 0], [0,2]])
b = np.array([0, 0])
x0 = np.array([2, 1])
epsilon = 1e-10
t_step = 0.1

x_soln, f_soln, x_all = gradient_method_constant(A, b, x0, epsilon, t_step)
print("Solution x:", x_soln)
print("Function value at solution f(x):", f_soln)

Let’s take a look at where the iterators had to search in this case. Run the code below to show the path taken.

# This re-runs the gradient calculation
x_soln, f_soln, x_all = gradient_method_constant(A, b, x0, epsilon, t_step = 0.1, max_iter=300)

fig, ax = plt.subplots()
ax = plot_our_contour(ax)
ax.set_title(f'Constant step size, {len(x_all)-1} iterations')

#-- This is added to show the path taken by the gradient method --#
x_all_n = np.asarray(x_all)
for i in range(len(x_all) - 1):
    ax.plot(x_all_n[i:i+2].T[0], x_all_n[i:i+2].T[1])
ax.scatter(x_all_n.T[0],x_all_n.T[1])
#---

plt.show()

❓❓❓ Question ❓❓❓: Try different choices of \(t\) in the above code. I recommend \(t \in [0.01, 0.1, 0.2]\). What changes with different choices of \(t\)? Warning: Large values of \(t\) can lead to an infinite or long running loop.

✎ Answer the question here.

Using the exact line search step size#

Now, we’re going to swap out the constant step size for the exact line version. The idea here is to find follow the direction \(\mathbf{d}_k\) as far as it will go until the function \(f\) is no longer decreasing in that direction. This can be written as

\[t_k=\arg\min_{t\geq 0}f(\mathbf{x}_k+t\mathbf{d}_k)\]

In many functions, this is not a function we actually want to solve for example (hence the backtracking version of this in the next section). However, in the case of quadratic functions, we have an exact formula for the step size

\[t_k=\arg\min_{t\geq 0}f(\mathbf{x}_k+t\mathbf{d}_k) =-{\mathbf{d}_k^\top\nabla f(\mathbf{x}_k)\over 2\mathbf{d}_k^\top\mathbf{A}\mathbf{d}_k} ={\|\nabla f(\mathbf{x})\|^2\over 2\nabla f(\mathbf{x})^\top\mathbf{A}\nabla f(\mathbf{x})}\]

❓❓❓ Question ❓❓❓: Finish the code below to determine \(t_k\) for the exact line search from inputs \(\mathbf{x}_k\), \(A\), \(\mathbf{b}\), and \(\mathbf{d}_k\).

def step_size_exact_line(x_k, A, grad_x):
    
    numerator = np.nan  # <------ Compute the numerator of the exact line search formula
    denominator = np.nan  # <------ Compute the denominator of the exact line search formula
    return numerator / denominator

# Example to test the function
A = np.array([[3, 2], [2, 6]])
b = np.array([2, -8])
x_k = np.array([0, 0])
grad_x = gradient(A, b, x_k)
t_k = step_size_exact_line(x_k, A, grad_x)

print("Exact line search step size at x_k:", t_k)

❓❓❓ Question ❓❓❓: Finish the code below to write the gradient algorithm using the exact step size.

def gradient_method_exact_line(A, b, x0, epsilon, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the gradient method with exact line search.

    Args:
        A : A symmetric positive definite matrix
        b : A vector
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        max_iter : Maximum number of iterations to perform (default is 100)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
    # Add an internal function to compute the function value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x
     
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    grad_k = 2 * (A@x_k + b)
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:
        
        #---
        # DELETE THE BREAK BEFORE WORKING 
        # (this is just to prevent an infinite loop while the code is incomplete)
        break 
        #----
    
        grad_k = gradient(A, b, x_k)
        t_k = np.nan # <------ Compute t_k using exact line search
        d_k = descent_dir(grad_k)
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = np.nan # <------ Update x_k using t_k and d_k 

        # Compute the function value at the new iterate
        f_k = np.nan  # <------ Compute f(x_k) for the new x_k
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        # Print statements that might be helpful for debugging
        # print("iter_number = %3d norm_grad = %2.6f fun_val = %2.6f" %(ite, np.linalg.norm(grad_k), f_k))
        # print(x_k)
        
        ite = ite + 1
        
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all


# Below is the function call to test the gradient method implementation
A = np.array([[1, 0], [0,2]])
b = np.array([0, 0])
x0 = np.array([2, 1])
epsilon = 1e-10

x_soln, f_soln, x_all = gradient_method_exact_line(A, b, x0, epsilon)
print("Solution x:", x_soln)
print("Function value at solution f(x):", f_soln)

Next, we’ll try to understand what happened over the iterations of the code. Run the cell below as is to generate a contour plot of the function and a visualization of the iterates of the gradient method.

fig, ax = plt.subplots()
ax = plot_our_contour(ax)
ax.set_title(f'Exact Line Search, {len(x_all)-1} iterations')

#-- This is added to show the path taken by the gradient method --#
x_all_n = np.asarray(x_all)
for i in range(len(x_all) - 1):
    ax.plot(x_all_n[i:i+2].T[0], x_all_n[i:i+2].T[1])
ax.scatter(x_all_n.T[0],x_all_n.T[1])
#---

plt.show()

❓❓❓ Question ❓❓❓:

What do you notice about the path taken compared to the constant step size?
What do you notice about the number of iterations taken compared to the constant step size?

✎ Answer the question here.

Using Backtracking#

The idea behind backtracking is to try to get decently close to the exact line solution above without having to either have an analytical solution (like in the case of quadratic functions) or be able to solve that second minimzation problem (\(t_k=\arg\min_{t\geq 0}f(\mathbf{x}_k+t\mathbf{d}_k)\)) exactly.

At this point, I assume that I am working at some fixed \(\mathbf{x}_k\) and I’ve already decided to use \(\mathbf{d}_k = -\nabla f(\mathbf{x}_k)\). Then we have to pick three things:

\(s>0\), which is going to be some too-big parameter to start for \(t\) where we are going to pull ourselves back until we actually find the value we want.
\(\alpha \in (0,1)\)
\(\beta \in (0,1)\).

Because \(\beta\) is between 0 and 1, powers of beta will get iteratively closer to 0. So, we’re going to just keep checking \(t = s \beta^i\) for increasing values of \(i\).

i_list = list(range(10))
s = 2
beta = 0.5
plt.scatter(i_list, [s * beta**i for i in i_list])
plt.title('Exponential Decay of $s\\beta^i$');

Our goal then is to keep increasing \(i\) (or equivalently multiply \(s\beta^{i-1}\) by \(\beta\)) until some condition is satisfied. That condition is the following: we need \(i\) to be large enough so that

\[ f(\mathbf{x}_k) - f(\mathbf{x}_k + s\beta^i \mathbf{d}_k) \geq -\alpha s \beta^i \nabla f(\mathbf{x}_k)^\top \mathbf{d}_k \]

which for our choice of \(\mathbf{d}_k\) simplifies to

\[ f(\mathbf{x}_k) - f(\mathbf{x}_k - s\beta^i \nabla f(\mathbf{x}_k)) \geq \alpha s \beta^i \|\nabla f(\mathbf{x}_k)\|^2. \]

Once we’ve found a large enough \(i\) to do this, we use \(t_k = s \beta^i\).

❓❓❓ Question ❓❓❓: Add the while loop condition to the code below to compute the backtracking choice for \(t_k\). We’ve made some initial choices for \(s\), \(\alpha\), and \(\beta\) here.

def step_size_backtrack(A, b, x_k,s=2, alpha=1/4, beta = 0.5):
    t = s
    # Add an internal function to compute the function value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x
    
    fun_val = f(x_k)
    grad_x = gradient(A, b, x_k)
    
    while False:  # <------ Replace False with the backtracking condition
        t = beta * t

    return t

# Example usage 
x_k = np.array([2,1])
grad_x = gradient(A, b, x_k)
t = step_size_backtrack(A, b, x_k)

print("Backtracking step size at x_k:", t)

❓❓❓ Question ❓❓❓: Put it all together! Update the code below to do the gradient method, now with backtracking line search.

def gradient_method_backtrack(A, b, x0, epsilon, s=2, alpha=1/4, beta=0.5, max_iter = 100):
    """A function to minimize x^t A x + 2 b^t x using the gradient method with backtracking line search.

    Args:
        A : A symmetric positive definite matrix
        b : A vector
        x0 : Initial guess for the solution
        epsilon : Tolerance for the stopping criterion
        max_iter : Maximum number of iterations to perform (default is 100)
        s : Initial step size for backtracking line search (default is 2)
        alpha : Parameter for backtracking (default is 1/4)
        beta : Step size reduction factor in backtracking (default is 0.5)

    Returns:
        x : The vector of inputs that minimizes the objective function
        fun_val : The value of the objective function at the minimum
        x_all : A list of all iterates generated during the optimization
    """
   
    # Add an internal function to compute the function value for us 
    def f(x):
        return x.T @ A @ x + 2 * b.T @ x
    
    # Start with initial guess
    x_k = x0
    f_k = f(x_k)  # Initial function value
    
    # Store all iterates (the x_k's used)
    x_all = [x0]
    
    # Keep a counter for how many iterations
    ite = 0 
    grad_k = 2 * (A@x_k + b)
    while np.linalg.norm(grad_k) > epsilon and ite < max_iter:
    
        grad_k = gradient(A, b, x_k)
        t_k = 0 # <------ Fix this to compute t_k using backtracking line search
        d_k = descent_dir(grad_k)
        
        # Update x_k to the next x_k (x_{k+1} in the notation above)
        x_k = x_k + t_k * d_k
        
        # Compute the function value at the new iterate
        f_k = f(x_k)
        
        # Append the new iterate to the list
        x_all.append(x_k)       

        # Print statements that might be helpful for debugging
        # print("iter_number = %3d norm_grad = %2.6f fun_val = %2.6f" %(ite, np.linalg.norm(grad_k), f_k))
        # print(x_k)
        
        ite = ite + 1
        if ite >= max_iter:
            print("Maximum number of iterations reached.")
            break
    
    x_soln = x_k
    f_soln = f_k 
    
    return x_soln, f_soln, x_all


# Below is the function call to test the gradient method implementation
A = np.array([[1, 0], [0,2]])
b = np.array([0, 0])
x0 = np.array([2, 1])
epsilon = 1e-10

x_soln, f_soln, x_all = gradient_method_backtrack(A, b, x0, epsilon, s=2, alpha=1/4, beta=0.5)
print("Solution x:", x_soln)
print("Function value at solution f(x):", f_soln)

Run the code below which produces a contour plot and traces the iterates from the gradient method with backtracking line search.

fig, ax = plt.subplots()
ax = plot_our_contour(ax)

x_all_n = np.asarray(x_all)
for i in range(len(x_all) - 1):
    ax.plot(x_all_n[i:i+2].T[0], x_all_n[i:i+2].T[1])
ax.scatter(x_all_n.T[0],x_all_n.T[1])
ax.set_title('Backtracking Line Search, {} iterations'.format(len(x_all)-1))
plt.show()

❓❓❓ Question ❓❓❓: Changing \(\beta\): In the code cell below, rerun the gradient method with backtracking line search and regenerate the contour plot with iterates visualized for different choices of \(\beta\). Try a few choices and see how things change. Jot down your observations below.

# Enter code below.

✎ Enter your observations on effect of different choices of \(\beta\) here.

Check your understanding#

If everything above made sense, you should be able to answer the following questions.

Write down the basic update step for the gradient descent algorithm and explain what it means in terms of the function.
Name the three choices we discussed for choosing \(\mathbf{t}_k\). Explain the basic idea behind each.
True or false: For any input function, we need to have \(\mathbf{d}_k = - \nabla f(\mathbf{x})\).
What does the \(\beta\) parameter do in the backtracking algorithm?

Additional Examples#

If you have additional time, for the following functions, set up your gradient descent algorithm using the 3 different versions to minimize each of the following functions. What should the answer be? Compare and contrast each result with what you know the solution should be.

\(f(x,y) = (x-3)^2 + (y+2)^2\). Use \(\mathbf{x}_0 = (0,0)\).
\(f(x,y) = x^2+{1\over100}y^2\). Use \(\mathbf{x}_0 = [0.01, 1]\).

## Enter your code below.

Worksheet Chapter 4, Part 1

Contents

Worksheet Chapter 4, Part 1#

Topics#

Descent directions#

Schematic Descent Directions Method#

How to find \(\mathbf{d}_k\)#

Focus on quadratic functions#

How to find the stepsize \(t_k\)#

Gradient descent algorithm#

Using the constant step size#

Using the exact line search step size#

Using Backtracking#

Check your understanding#

Additional Examples#