Convex optimization techniques such as stochastic gradient descent (SGD), Newton's method, and their variants are widely used in machine learning applications. Perhaps most notable, is the wide use of convex optimization techniques for minimizing neural network loss functions. Convex optimization is a well studied field, with many theoretical and practical guarantees on algorithm convergence for convex objective functions. Unfortunately, given the non-convex nature of most loss landscapes, none of the theoretical analyses apply in the context of neural network training.

This means that in the context of machine learning, the efficiency of convex optimization algorithms is measured completely empirically. This experiment based analysis poses an issue for designing effective optimization algorithms, since often the only way to measure performance is by running the algorithms on complex real world problems. Even then, an algorithm that performs well on a handful of problems is not guaranteed to perform well on other problems. While authors may argue that their algorithms should heuristically perform better given some class of neural network loss functions, it is almost impossible to make any theortical guarantees.

To address this issue, we present an empirical analysis for the
optimization of four analytic objective functions by four well-known
optimization algorithms. Each of the four objective functions in this
paper has been carefully designed to exhibit some property that
contrasts with those of a "nice" convex function. Based on literature
and industry usage, the four optimization algorithms considered here
are SGD

In what follows, we will first introduce the four algorithms of interest, along with a summary of their convergence properties for convex functions. Next, we will describe the construction and rational behind the four analytic objective functions that will be tested on, along with various parameters of those functions that will be varied throughout the analysis. Next, we will describe the experiment and data collection process. Finally, we will present visualizations of the collected data and analyze the results.

For this analysis, we consider four algorithms commonly used in machine learning, specifically, for training neural networks. They are stochastic gradient descent (SGD), limited-memory BFGS (L-BFGS), adaptive gradients (AdaGrad), and adaptive moment estimation (Adam). For the sake of completeness, we will begin by summarizing these four algorithms and presenting their theoretical convergence guarantees for convex objective functions.

SGD

The difference between SGD and the standard gradient descent algorithm, is that SGD only assumes access to an approximation \(g\) for the true gradient \(\nabla f\). In theory, the only condition on the approximation \(g\) is that the expected value of \(g\) satisfies: $$ \mathbb{E}[g(x)] = \nabla f(x) $$ for all \(x\). Because \(g\) is only an approximation to \(\nabla f\), it is possible that each \(g(x^{(k)})\) could actually be an ascent direction, making the convergence of SGD non-monotone, even for strongly convex functions. Because of these random "bad directions," for a fixed step size \(\alpha^{(k)} = \alpha\), SGD can only converge to within some radius of the true optimum \(x^\star\), at which point its convergence stalls. To achieve absolute convergence, the step size must therefore be decayed. In practice, this means that after holding \(\alpha\) constant for some number of iterations, \(\alpha\) should be decayed by some factor \(\tau\).

For a strongly convex function and a deterministic gradient, SGD reduces to standard gradient descent and its convergence is linear. I.e., given \(t\) iterations, $$ |f(x^{(k)}) - f(x^\star)| \approx \mathcal{O}\left( c^t ) \right) $$ for some constant \(0 < c < 1\). If \(g\) is indeed a stochastic estimate to \(\nabla f\), the convergence rate is reduced to \(\mathcal{O}\left(\frac{1}{t}\right)\). If furthermore the objective function is only convex (as opposed to strongly convex), this rate is futher reduced to \(\mathcal{O}\left(\frac{1}{\sqrt{t}}\right)\).

In general, quasi-Newton methods use an approximation to the Hessian \(\nabla^2 f\) to allow for bigger step sizes in directions of low variance. For a perfect quadratic, this allows Newton methods to "jump" straight to the minima; for strongly convex functions, this accomodates poorly conditioned objective functions by normalizing the sub-level sets of \(f\). The classic Newton update can be derived from a second-order Taylor approximation to \(f\), and is given by: $$ x^{(k+1)} = x^{(k)} - \left(\nabla^2 f(x^{(k)})\right)^{-1}\nabla f(x^{(k)}). $$

The original Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS) algorithm iteratively refines an approximation to the Hessian matrix \(H^{(k)}\) by applying Rank-1 matrix updates to its current approximation, each of which satisfies the secant condition: $$ H^{(k)}\left(x^{(k)}-x^{(k-1)}\right) = \nabla f(x^{(k)}) - \nabla f(x^{(k-1)}). $$ Intuitively, this can be thought of as iteratively refining the Hessian based on a planar fit to each observed gradient. The Newton update is defined in terms of the inverse Hessian, so BFGS avoids performing a matrix inversion by leveraging the Rank-1 Sherman-Morrison-Woodbury matrix identity: $$ (H + xy^T)^{-1} = H^{-1} - H^{-1}x(I + v^T H^{-1}x)v^T H^{-1} $$ where \(xy^T\) is a Rank-1 matrix, and \(I\) is the identity. By leveraging this formula, BFGS is able to keep the iteration cost cheap, since the cost of the Rank-1 update is sigificantly cheaper than the cost of matrix inversion.

L-BFGS

For a strongly convex objective function, L-BFGS converges superlinearly to the
optimum \(x^\star\).
That is, L-BFGS is faster than linear but slower than the quadratic convergence
rate \(\mathcal{O}\left( c^{b^t} \right)\) (where both \(c\) and \(b\) are positive
numbers less than one).
When convexity assumptions are dropped, L-BFGS has no convergence guarantees.
In fact, in the presence of local maxima and saddle-points, most quasi-Newton
methods will converge to both

AdaGrad

The \(k\)th variance estimate for each dimension of \(G\) is given by $$ G^{(k)} = diag\left(\sqrt{\sum_{n=1}^k (g^{(n)})^2} + \varepsilon\right) $$ where each \(g^{(n)}\) is a previous gradient (estimate) and \(\varepsilon\) is a "fudge" factor, introduced to prevent \(G\) from becoming singular in degenerate cases. Note that since \(G^{(k)}\) is diagonal, it can be readily inverted. Also, by storing \(\left(G^{(k)}\right)^2 - \varepsilon I\) and performing the square root and "fudging" operations on demand, \(G\) can be iteratively refined without tracking previous gradients. To normalize the sublevel sets and improve the conditioning of \(f\), a scaled version of \(G\) can be directly plugged in for \(H\) in the quasi-Newton update: $$ x^{(k+1)} = x^{(k)} - \alpha^{(k)} (G^{(k)})^{-1} g(x^{(k)}) $$ where \(\alpha^{(k)}\) is a step size, and \(g\) is an approximation to \(\nabla f\) in the stochastic case and \(g = \nabla f\) in the deterministic case. More intuitively, AdaGrad can also be thought of as a trust region method, where the variance estimate \(G\) allows for larger steps in directions of low variance.

AdaGrad is guaranteed the same convergence as SGD, but the constant terms that are ignored by the Big-O notation are siginificantly better for AdaGrad when the problem is poorly conditioned.

Adam

Leveraging the variance estimate \(G\) used by AdaGrad, Adam applies momentum not only to the gradient (first moment) estimate, but also applies momentum to the variance matrix \(G\) (second moment). Therefore, the Adam update can be summarized by: $$ x^{(k+1)} = x^{(k)} - \alpha \left(\sqrt{\beta_2 g^2(x^{(k)}) + (1-\beta_2)(G^{(k)})^2}\right)^{-1} \left(\beta_1 g(x^{(k)}) + (1-\beta_1)\hat{g}\right) $$ where \(\beta_1\) and \(\beta_2\) are the first and second moment coefficients respectively, and \(G^{(k)}\) is the \(k\)th "non-fudged" variance estimate from AdaGrad. Large momentum coefficients are most helpful for noisy and poorly conditioned problems. However, if the momentum coefficient is too large with respect to the step size \(\alpha\), Adam can fail to converge. Most interesting problems are noisy and poorly conditioned, and most algorithms tend to converge well for any well-conditioned problem. So, it is common practice to set \(\beta_1 \approx 1\) and \(\beta_2 \approx 1\), then choose the largest convergent step size \(\alpha\).

Similarly to AdaGrad, Adam converges at the same rate as SGD but with more favorable hidden constants when the problem is poorly conditioned.

In order to empirically evaluate each optimization technique, we present four analytic functions for minimization that each have a single global minimum, but varying other properties. Roughly speaking, they are presented in order of increasing difficulty.

A convex sub-quadratic function that appears to come to a sudden point. Quadratic estimators will tend to overshoot the minimum of this function when stepping. The function is defined as

$$ \sum_{i=1}^{d} |x_i|^{\frac{2k}{2k-1}} $$

where \(d\) is the dimension of the problem and \(k > 1\). A one dimensional plot of the function and its derivative looks like:

The basin is a convex super-quadratic function used to mimic a phenomenon observed in practice where the region surrounding an optimal point has a gradient whose magnitude goes to zero at a rapidly decreasing rate. Visually, this manifests as a 'flattening' surrounding the global minimum. The function is defined as

$$ \sum_{i=1}^{d} x_i^{2k} $$

where \(d\) is the dimension of the problem and \(k > 1\). A one dimensional plot of the function and its derivative looks like:

In problems with more than tens of dimensions, the likelihood of non-uniform curvature between dimensions becomes increasingly likely. When dimensions have opposing curvature, *saddle points* are created. Recent work

$$ \sum_{i=1}^{d} \frac{s^4 x^2}{2} - \frac{s^2 x^4}{2} + \frac{x^6}{6}, $$

where \(d\) is the dimension of the problem and \(s\) is the constant defining the absolute value of the location of saddle points per-dimension. A one dimensional plot of the function and its derivative looks like:

Many problems that require optimization have local minima. Using Chebyshev polynomials, we construct a function that has a prescribed number of minima that grows exponentially with increasing dimension.

$$ \sum_{i=1}^{d} \big [ 1 + a x_i^2 + f_{2m + 1}(x_i) \big ], $$ $$ \begin{align} f_0(x_i) &= 1, \\ f_1(x_i) &= x_i, \\ f_{n+1}(x_i) &= 2 x_i f_{n}(x_i) - f_{n-1}(x_i), \\ \end{align} $$

where \(d\) is the dimension of the data, \(a\) is a multiplier for determining the relative effect size of the quadratic term, and \(m\) is the number of local minima per dimension

We use the following transformations to analyze further properties of the functions.

The analytic functions that we have defined thus far all have
well-defined, deterministic gradients almost everywhere. In most
machine learning applications, a subset of the total training data
volume is used in each gradient evaluation, resulting in a stochastic
estimate of the true gradient. To simulate this reality, we add
various amounts of uniform random noise to each gradient evaluation.
SGD and other first order methods (such as Adam and AdaGrad) are
expected to still converge under these conditions

The condition number of the sub-level set \(C\) of a function \(f\) is defined as the ratio between the maximum diameter \(W_{max}\) and the minimum diameter \(W_{min}\) across \(C\): $$ \kappa(C) = \frac{W_{max}}{W_{min}}. $$ For \(f\) convex, this is proportional to the conditioning of \(f\) as an operator. Notice that the sublevel sets for all the above functions are approximately square. This means that without modification, the problems are all well-conditioned, i.e., \(\kappa(f) \approx 1\). To simulate poor problem conditioning, which is common in machine learning applications, we introduce various amounts of skew on \(f\). For each objective function, we run optimizations with no skew, an inverse conditioning ratio of \(\frac{W_{min}}{W_{max}} = 0.5\), and an inverse conditioning of \(\frac{W_{min}}{W_{max}} = 0.01\).

Also note that each of the above functions is completely separable, in that it can be decomposed into the sum of its components in each dimension, which can be optimized separately. For Adam and AdaGrad, which use diagonal matrices to capture variance in each basis dimension, this means that all the necessarry information can be captured without need for off-diagonal elements. However, as the functions are rotated to a maximum angle of 45 degrees, the objective functions become non-separable, making Adam and AdaGrad's variance approximations poor proxies for the true Hessian. To simulate non-separability, we run the optimization algorithms on each objective function with no rotation, 22.5 degree rotation, and full 45 degree rotation.

All of the objective functions were implemented in Python, and their gradients were generated using the Python automatic differentiation tool AutoGrad. The constants for the objective functions are presented in the table below:

| | | | |

\(m\) | \(k\) | \(a\) | \(s\) | |

Me Jr. | N/A | 2 | N/A | N/A |

Basin | N/A | 2 | N/A | N/A |

Saddle Point | N/A | N/A | N/A | 0.75 |

Multi-Min | 3 | N/A | 2 | N/A |

The four algorithms discussed have been coded in Python, with the exception of L-BFGS, for which the SciPy implementation L-BFGS-B was used.

| | | | | |

\(\alpha\) | \(\beta_1\) | \(\beta_2\) | \(\tau\) | \(\varepsilon\) | |

SGD | \(0.1\) | N/A | N/A | \(0.5\) | N/A |

AdaGrad | \(0.01\) | N/A | N/A | N/A | \(10^{-6}\) |

Adam | \(0.01\) | \(0.9\) | \(0.99\) | N/A | \(10^{-8}\) |

Each algorithm was run on each noise level, skew, and rotation of all four objective function in 10 dimensions.

The two overall best performers for the analytic objective functions given varying noise, skew, and rotation were Stochastic Gradient Descent and ADAM. The following figure shows the median objective value obtained versus number of gradient evaluations for each optimization algorithm over all values for noise, skew, and rotation.

Multiple (expected) behaviors of each optimization algorithm can be observed in this plot. For the subquadratic *Me Jr.* function, SGD converges to the radius allowed by step size, hence the appearance of a step function. ADAM performs best on the *Basin* function because its momentum allows it to continue walking closer to the minimum. ADAM performs best on the *Saddle* function because of its strictly positive second-order estimate of function curvature that allows it to escape saddle points. None of the optimization algorithms are able to successfully minimize the *Multimin* problem.

Some unexpected and difficult-to-explain behaviors also occur. It is unclear why SGD suddenly obtains better performance with a specific step size on the *Multimin* problem. Perhaps the step size becomes just the right amount to step out of the local min.

The scaling on the following three plots does not stay perfect. By pressing 'Play', then 'Pause' on the desired frame, hover over the plot and press the 'autoscale' tool-tip button to the left of the 'home' button. This will correct the ranges. If you click on the slider itself, it will break the 'autoscale' button until the page is refreshed. Also the transitions between frames are clunky, we will attempt to pretty this up after submission, but working with these interactive plots is tedious and time consuming.

First, we consider increasing the noise in the gradient of the objective function.

As expected, the addition of noise significantly slows the convergence of all the algorithms for all the functions. However, the addition of noise seems to allow both SGD and Adam to escape local minima in the Multimin function. This manifests in the fact that both take longer to converge initially, but ultimately manage to converge much closer to the true minimum when noise is added.

Next, we consider increasing the skew (i.e., deteriorating the conditioning) of the objective functions.

Interestingly, SGD seems to be less affected by the skew that Adam and AdaGrad. This is unexpected behavior, since the theoretical analyses for both Adam and AdaGrad claim that they should exhibit better constants than SGD for poorly conditioned problems.

Finally, we consider various rotations of the objective functions, which correspond to non-separability. We had considered that this could negatively impact Adam and AdaGrad, but as seen below, none of the algorithms are significantly affected.

The only algorithm that exhibits any response to rotation is SGD, which fails to drop exactly to zero for Me Jr, instead taking its signature stair step path downward

In this paper, we have presented a side-by-side convergence analysis for four common convex optimization algorithms on four analytic objective functions exhibiting specific properties, under various amounts of noise, skew, and rotation. Most of our empirical results back the theoretical results, but there are a few exceptions. Most notably, Adam and AdaGrad do not display the same robustness to poor conditioning that was promised in their analyses. Also interestingly, the presence of noise can be beneficial to both SGD and Adam in a multimin space, allowing them to escape local minima.