John,
I'm really glad you're doing this, as optimization functions are also
important to me. I'm sorry it took me so long to get to this. I can't figure
out how to make line comments in your repository, so I'll do them here.
Comments on API (most important because future changes will break
compatibility):
1. It seems your algorithms are written to take two separate functions, one
for the value and the other for the gradient. Presumably, you did it this way
because optimization routines don't always need both of them (for example, in
backtracking_line_search), so you can save some time by not needlessly
computing the gradient. However, in many complex functions, much of the work
in calculating the value and gradient is common, and so doing the
"preparation" once and then getting out both the value and the gradient can be
very helpful.
I wonder if this is a case where something like this might make the most
sense (this is a toy problem, sometimes the savings are much more
substantial):
function myfun(x,iscalcgrad::Bol)
dx = x - 1
val = sum(dx.^2)/2
if !iscalcgrad
return val
end
g = dx # we saved some work by pre-computing dx
return val, g
end
myfun(x) = myfun(x, false)
Unless, of course, anyone knows of cases where you just need the gradient (not
the value) and it's actually faster to calculate the gradient on its own than
it is to calculate the function value? (If so, then two Bool inputs? One for
val and one for g?)
2. You require data points to be of type Vector, and that seems reasonable for
algorithms that use a Hessian. However, some algorithms (e.g., conjugate
gradient) don't ever construct the Hessian, and therefore can work very
naturally with arrays of arbitrary shape. For example, if you're trying to
optimize a matrix, it's nice not to have to manually convert back-and-forth
between vector and matrix representations just to make the optimization
routine happy. You may not like having the APIs be different, however, in which
case one nice solution (for bfgs, for example) would be to write two versions
of update_hessian, one that requires Vector inputs and the other which does
"sv = s[:]; yv = y[:]" first (within the last 24 hours this has become a
particularly highly-optimized operation, so it shouldn't add much overhead).
3. backtracking_line_search computes the value and gradient from scratch. What
if we already know those values, say because we're in the middle of a
conjugate-gradient algorithm? You'd probably prefer to pass them in to avoid
recomputing them. Perhaps the entire algorithm should be written in terms of
an anonymous function,
f = alpha->myfun(x + alpha*g_x)
Comments about the algorithms (less important, as these can be fixed as we go):
1. bfgs: is what you're calling "hessian" here really the inverse Hessian?
Either the variable name and description needs to be fixed, or the algorithm
needs to be fixed.
http://en.wikipedia.org/wiki/BFGS_method. I also suspect
that your hessian_update function may involve more matrix multiplications (an
O(N^3) process) than is really necessary. If you expand out the product, all
of these are intrinsically matrix-time-vector or vector*vector' operations, so
you should avoid forming the dense combinations unless you have no
alternative. Otherwise, you're basically losing all the advantages you get
from using Sherman-Morrision, because inversion has the same complexity as
matrix-matrix multiplication.
2. backtracking_line_search: for many algorithms you need to also check the
curvature condition (see
http://en.wikipedia.org/wiki/Wolfe_conditions). Also,
your variable names could be a bit confusing, as alpha is conventionally (I
think) the variable name for the step size (which you call t, but then use
alpha for another meaning).
3. I'm not sure about your convergence criteria. Consider what happens if I
simply redefine my objective function as
f2(x) = 10^8 * myfun(x)
I tend to like things like this:
isconverged = old_val - new_val <= tol*(abs(old_val) + abs(new_val) +
eps())
because all the units make sense. You may also want a test on the gradient, of
course, but things like norm(gradient) are problematic because the different
variables can have different scaling (e.g., different physical units).
sum(gradient'*dx) fixes the units problem, although a very short step can lead
to trouble. I'm not up on the latest in terms of good choices of convergence
criteria.
4. You don't seem to use any "wisdom" about the step size from previous line
searches. So if a step size of 1 is a bad choice, you have to repeat all the
work to get down to a good scale the next time. But if you do that, you also
need to try growing your step size, say by a factor of 2 after the final
successful step, to make sure it doesn't just keep getting smaller. (Again,
I'm not up on the latest here.)
I'm afraid this is all the time I have for this now, but hopefully this helps
a bit!
--Tim