Numerical Optimization Methods

0 views

Skip to first unread message

Shinyoung Gedris

unread,

Aug 5, 2024, 7:35:34 AM8/5/24

to ombiosenworl

NumericalOptimization is one of the central techniques in Machine Learning. For many problems it is hard to figure out the best solution directly, but it is relatively easy to set up a loss function that measures how good a solution is - and then minimize the parameters of that function to find the solution.

I ended up writing a bunch of numerical optimization routines back when I was first trying to learnjavascript.Since I had all this code lying around anyway, I thought that it might be fun to provide someinteractive visualizations of how these algorithms work.

The cool thing about this post is that the code is all running in the browser, meaning you caninteractively set hyper-parameters for each algorithm, change the initial location, and change what function is being called to get a better sense of how these algorithms work.

To overcome these problems, the Nelder-Mead methoddynamically adjusts the step size based off the loss of the new point. If the new point is better than any previously seen value, it expands the step size to accelerate towards the bottom. Likewise if the new point is worse it contracts the step size to converge around the minima.

Click anywhere in this graph to restart with a new initial location. This method will generate a triangleat that spot and then flip flop towards the minima at each iteration, expandingor contracting as necessary according to the settings.

While this method is incredibly simple, itactually works fairly well on low dimensional functions. It can even minimize non-differentiablefunctions like \( f\left(x\right) = \left\left\lfloor x \right\rfloor - 50\right \), which allof the rest of the methods I'm going to talk about would fail at.

One possible direction to go is to figure out what the gradient \(\nabla F(X_n) \) is at the current point, and take a step down the gradient towards the minimum. The gradient can be calculated by symbolically differentiating the loss function, or by using automatic differentiation likeTorchand TensorFlow does. Using a fixed step size \(\alpha\) this means updating the current point \(X_n\) by:

A line search can modify thelearning rate at each iteration so that both the loss always is descending, which prevents it from overshootingthe minima - and also making sure that the gradient is flattening out sufficiently , whichprevents taking too many tiny steps. Enabling line search here leads tofewer iterations with the downside that each iteration has might have to sample extra functionpoints.

The Conjugate Gradient method tries to estimate the curvature of the function being minimized byincluding the previous search direction with the current gradient to come up with a new better searchdirection.

While the theory behind this might be a little involved, the math is pretty simple.The initial search direction \( S_0 \) is the same as in gradient descent. Subsequent search directions \(S_n\) are computed by:

You can see the progress of this method below. The actual direction taken is in red, with the gradients at each iteration beingrepresented by a yellow arrow. In certain cases the search direction being used is almost 90 degrees to the gradient, which explains why Gradient Descent had such problems on this function:

The challenge here is to convert a matrix of distances between some pointsinto coordinates for each point that best approximate the required distances. One way of doing this is to minimize a function like:

Nocedai and Wright have written an excellentbook on numericaloptimization that wasmy reference for most of this. While it is a great resource, there are a couple of other techniques not coveredthat I quickly wanted to mention.

Sebastian Ruder wrote an excellent overview of gradient descent methodsthat go into more depth, especially in the case of stochastic gradient descent on large sparse models like those used to train deep neural networks.

One cool derivative free optimization method is Bayesian Optimization. Eric Brochu, Mike Vlad Cora and Nando de Freitaswrote a great introduction to Bayesian Optimization. An interesting application of Bayesian Optimization is in hyper-parameter tuning - there is even a company SigOpt that is offeringBayesian Optimization as a service for this purpose.

Mathematical optimization (alternatively spelled optimisation) or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives.[1][2] It is generally divided into two subfields: discrete optimization and continuous optimization. Optimization problems arise in all quantitative disciplines from computer science and engineering[3] to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries.[4]

In the more general approach, an optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function. The generalization of optimization theory and techniques to other formulations constitutes a large area of applied mathematics.

Problems formulated using this technique in the fields of physics may refer to the technique as energy minimization,[5] speaking of the value of the function f as representing the energy of the system being modeled. In machine learning, it is always necessary to continuously evaluate the quality of a data model by using a cost function where a minimum implies a set of possibly optimal parameters with an optimal (lowest) error.

Typically, A is some subset of the Euclidean space R n \displaystyle \mathbb R ^n , often specified by a set of constraints, equalities or inequalities that the members of A have to satisfy. The domain A of f is called the search space or the choice set, while the elements of A are called candidate solutions or feasible solutions.

While a local minimum is at least as good as any nearby elements, a global minimum is at least as good as every feasible element.Generally, unless the objective function is convex in a minimization problem, there may be several local minima.In a convex problem, if there is a local minimum that is interior (not on the edge of the set of feasible elements), it is also the global minimum, but a nonconvex problem may have more than one local minimum not all of which need be global minima.

asks for the maximum value of the objective function 2x, where x may be any real number. In this case, there is no such maximum as the objective function is unbounded, so the answer is "infinity" or "undefined".

The term "linear programming" for certain optimization cases was due to George B. Dantzig, although much of the theory had been introduced by Leonid Kantorovich in 1939. (Programming in this context does not refer to computer programming, but comes from the use of program by the United States military to refer to proposed training and logistics schedules, which were the problems Dantzig studied at that time.) Dantzig published the Simplex algorithm in 1947, and also John von Neumann and other researches worked on the theoretical aspects of linear programming (like the theory of duality) around the same time.[7]

Adding more than one objective to an optimization problem adds complexity. For example, to optimize a structural design, one would desire a design that is both light and rigid. When two objectives conflict, a trade-off must be created. There may be one lightest design, one stiffest design, and an infinite number of designs that are some compromise of weight and rigidity. The set of trade-off designs that improve upon one criterion at the expense of another is known as the Pareto set. The curve created plotting weight against stiffness of the best designs is known as the Pareto frontier.

A design is judged to be "Pareto optimal" (equivalently, "Pareto efficient" or in the Pareto set) if it is not dominated by any other design: If it is worse than another design in some respects and no better in any respect, then it is dominated and is not Pareto optimal.

The choice among "Pareto optimal" solutions to determine the "favorite solution" is delegated to the decision maker. In other words, defining the problem as multi-objective optimization signals that some information is missing: desirable objectives are given but combinations of them are not rated relative to each other. In some cases, the missing information can be derived by interactive sessions with the decision maker.

Optimization problems are often multi-modal; that is, they possess multiple good solutions. They could all be globally good (same cost function value) or there could be a mix of globally good and locally good solutions. Obtaining all (or at least some of) the multiple solutions is the goal of a multi-modal optimizer.

Classical optimization techniques due to their iterative approach do not perform satisfactorily when they are used to obtain multiple solutions, since it is not guaranteed that different solutions will be obtained even with different starting points in multiple runs of the algorithm.

The satisfiability problem, also called the feasibility problem, is just the problem of finding any feasible solution at all without regard to objective value. This can be regarded as the special case of mathematical optimization where the objective value is the same for every solution, and thus any solution is optimal.

Many optimization algorithms need to start from a feasible point. One way to obtain such a point is to relax the feasibility conditions using a slack variable; with enough slack, any starting point is feasible. Then, minimize that slack variable until the slack is null or negative.

The extreme value theorem of Karl Weierstrass states that a continuous real-valued function on a compact set attains its maximum and minimum value. More generally, a lower semi-continuous function on a compact set attains its minimum; an upper semi-continuous function on a compact set attains its maximum point or view.