Revised Optimization Functions

616 views
Skip to first unread message

John Myles White

unread,
Jul 3, 2012, 10:56:00 AM7/3/12
to juli...@googlegroups.com
Hi all,

I've done a good deal of work over the last week on revising the optimization library I started building some time ago. You can see the state of things here:

https://github.com/johnmyleswhite/optim.jl

The quick summary is that there's a minimally functional optimize() function that selects between the Nelder-Mead, L-BFGS and Newton's methods based on the amount of information you can provide about the input function and its first and second derivatives.

I still have to implement conjugate gradients and Brent's method, but that's all that remains before we have, in principle, a first draft of a complete toolkit for unconstrained optimization.

After that work is finished, I'll implement L-BFGS-B and some other constrained optimization algorithms.

Right now, I could really use some feedback. There's a lot of code that's been built already and a good number of tests, but more are needed before I'll feel comfortable signing off on these functions for use in real work.

-- John

Jeffrey Sarnoff

unread,
Jul 3, 2012, 1:23:34 PM7/3/12
to juli...@googlegroups.com
Hi John,
It will be more easily incorporated in some set-ups with relative paths in init.jl

Jeffrey Sarnoff

unread,
Jul 3, 2012, 1:27:43 PM7/3/12
to juli...@googlegroups.com
(I assume)

John Myles White

unread,
Jul 3, 2012, 1:32:51 PM7/3/12
to juli...@googlegroups.com, juli...@googlegroups.com
I feel dumb asking this, but isn't it already using relative paths?

 -- John

Douglas Bates

unread,
Jul 3, 2012, 2:16:47 PM7/3/12
to juli...@googlegroups.com
Perhaps Jeffrey is referring to paths that are relative to the Julia home directory.  I imagine this question will be addressed when packages are more fully developed.

Viral Shah

unread,
Jul 3, 2012, 2:39:58 PM7/3/12
to juli...@googlegroups.com
Hi John,

This is indeed really cool. An optimization package might just convince a lot of people to try out julia! I'll take a look, but may not have detailed insight to offer.

-viral

John Myles White

unread,
Jul 3, 2012, 10:09:17 PM7/3/12
to juli...@googlegroups.com
Thanks, Viral! It would be great if you took a lot at it when you have time. It'll probably be even more valuable to get your input after I have enough tests in place to insure the correctness of the functions; at that point, I'm sure you'd have a lot of great input about how to make the functions faster.

 -- John

Tim Holy

unread,
Jul 11, 2012, 8:30:55 AM7/11/12
to juli...@googlegroups.com
John,

I'm really glad you're doing this, as optimization functions are also
important to me. I'm sorry it took me so long to get to this. I can't figure
out how to make line comments in your repository, so I'll do them here.

Comments on API (most important because future changes will break
compatibility):
1. It seems your algorithms are written to take two separate functions, one
for the value and the other for the gradient. Presumably, you did it this way
because optimization routines don't always need both of them (for example, in
backtracking_line_search), so you can save some time by not needlessly
computing the gradient. However, in many complex functions, much of the work
in calculating the value and gradient is common, and so doing the
"preparation" once and then getting out both the value and the gradient can be
very helpful.
I wonder if this is a case where something like this might make the most
sense (this is a toy problem, sometimes the savings are much more
substantial):

function myfun(x,iscalcgrad::Bol)
dx = x - 1
val = sum(dx.^2)/2
if !iscalcgrad
return val
end
g = dx # we saved some work by pre-computing dx
return val, g
end
myfun(x) = myfun(x, false)

Unless, of course, anyone knows of cases where you just need the gradient (not
the value) and it's actually faster to calculate the gradient on its own than
it is to calculate the function value? (If so, then two Bool inputs? One for
val and one for g?)

2. You require data points to be of type Vector, and that seems reasonable for
algorithms that use a Hessian. However, some algorithms (e.g., conjugate
gradient) don't ever construct the Hessian, and therefore can work very
naturally with arrays of arbitrary shape. For example, if you're trying to
optimize a matrix, it's nice not to have to manually convert back-and-forth
between vector and matrix representations just to make the optimization
routine happy. You may not like having the APIs be different, however, in which
case one nice solution (for bfgs, for example) would be to write two versions
of update_hessian, one that requires Vector inputs and the other which does
"sv = s[:]; yv = y[:]" first (within the last 24 hours this has become a
particularly highly-optimized operation, so it shouldn't add much overhead).

3. backtracking_line_search computes the value and gradient from scratch. What
if we already know those values, say because we're in the middle of a
conjugate-gradient algorithm? You'd probably prefer to pass them in to avoid
recomputing them. Perhaps the entire algorithm should be written in terms of
an anonymous function,

f = alpha->myfun(x + alpha*g_x)




Comments about the algorithms (less important, as these can be fixed as we go):
1. bfgs: is what you're calling "hessian" here really the inverse Hessian?
Either the variable name and description needs to be fixed, or the algorithm
needs to be fixed. http://en.wikipedia.org/wiki/BFGS_method. I also suspect
that your hessian_update function may involve more matrix multiplications (an
O(N^3) process) than is really necessary. If you expand out the product, all
of these are intrinsically matrix-time-vector or vector*vector' operations, so
you should avoid forming the dense combinations unless you have no
alternative. Otherwise, you're basically losing all the advantages you get
from using Sherman-Morrision, because inversion has the same complexity as
matrix-matrix multiplication.

2. backtracking_line_search: for many algorithms you need to also check the
curvature condition (see http://en.wikipedia.org/wiki/Wolfe_conditions). Also,
your variable names could be a bit confusing, as alpha is conventionally (I
think) the variable name for the step size (which you call t, but then use
alpha for another meaning).

3. I'm not sure about your convergence criteria. Consider what happens if I
simply redefine my objective function as
f2(x) = 10^8 * myfun(x)
I tend to like things like this:
isconverged = old_val - new_val <= tol*(abs(old_val) + abs(new_val) +
eps())
because all the units make sense. You may also want a test on the gradient, of
course, but things like norm(gradient) are problematic because the different
variables can have different scaling (e.g., different physical units).
sum(gradient'*dx) fixes the units problem, although a very short step can lead
to trouble. I'm not up on the latest in terms of good choices of convergence
criteria.

4. You don't seem to use any "wisdom" about the step size from previous line
searches. So if a step size of 1 is a bad choice, you have to repeat all the
work to get down to a good scale the next time. But if you do that, you also
need to try growing your step size, say by a factor of 2 after the final
successful step, to make sure it doesn't just keep getting smaller. (Again,
I'm not up on the latest here.)


I'm afraid this is all the time I have for this now, but hopefully this helps
a bit!

--Tim

Tim Holy

unread,
Jul 11, 2012, 8:35:54 AM7/11/12
to juli...@googlegroups.com
Oh, and one more point on the API. It's often very nice (for
diagnostics/benchmarking) to return the full sequence of function values,
rather than just the last. I know you can print the sequence to the terminal,
but consider testing a whole slew of algorithms and then trying to make a
figure illustrating the performance of them all? Much better to have the data
captured in a variable.

If I remember correctly, the "mincg" algorithm that Avik posted a while back
had a nice API this way. I'd recommend checking that and seeing if it gives
you ideas about the design of this package in general.

--Tim


On Tuesday, July 03, 2012 07:56:00 AM John Myles White wrote:

John Myles White

unread,
Jul 11, 2012, 9:39:20 AM7/11/12
to juli...@googlegroups.com
Thanks for all the input, Tim. I'll need to spend some time digesting it. There are probably some lax uses of the term Hessian to refer to its inverse at times.

One quick point that others might chime in on: I'd like to develop types (e.g. DifferentiableFunction and TwiceDifferentiableFunction) for combining functions with their gradient and Hessian/inverse Hessian, since we can then use dispatch to handle inputs based on this information.

The trouble with this approach is that it makes calling the core function a little ugly.

-- John

Patrick O'Leary

unread,
Jul 11, 2012, 9:55:17 AM7/11/12
to juli...@googlegroups.com
On Wednesday, July 11, 2012 8:39:20 AM UTC-5, John Myles White wrote:
One quick point that others might chime in on: I'd like to develop types (e.g. DifferentiableFunction and TwiceDifferentiableFunction) for combining functions with their gradient and Hessian/inverse Hessian, since we can then use dispatch to handle inputs based on this information.

The trouble with this approach is that it makes calling the core function a little ugly.

Mentally sketching deuglification:
convert(::Function, f::DifferentiableFunction) = f.fun
d(f::DifferentiableFunction) = f.dfun
d(f::Function) = # insert numerical differentiation here


...etc

John Myles White

unread,
Jul 11, 2012, 10:07:01 AM7/11/12
to juli...@googlegroups.com
Perfect. I hadn't even considered using convert. I had almost starting hacking something using ref(), which felt criminal to me.

 -- John

John Myles White

unread,
Jul 11, 2012, 10:28:49 AM7/11/12
to juli...@googlegroups.com
Received this comment on my blog from an OR professor which is worth considering:

---

John, I don’t want to rain on your parade, but… I would not recommend implementing optimization algorithms from scratch in Julia. The rationale for this is identical to the one according to which it’s not wise to implement basic linear algebra operators for scratch in any language other than FORTRAN, and in all cases it’s better to link to good BLAS and LAPACK libraries. Much more than in the linear algebra cases, the performance ratios between a library developed by an expert, and a vanilla implementation of a vanilla algorithm can be very high. To get an idea of the performance variability among “expert” implementations, checkhttp://plato.asu.edu/bench.html . It’s not uncommon to see ratios of 100x. And I can guarantee that the ratio between the slowest expert implementation and a vanilla one is much higher than 100x. Put together the two things, and you easily get slowdowns of 10^4 (when you get to a solution — not guaranteed!). Just to give you some context, IPOPT (which is probably the fastest free library, even if it’s slower than most commercial libraries) has been developed almost full time by Andreas Wachter for about 10 years with contributions from others, first at IBM Research and then at Northwestern. Besides speed considerations, there are always considerations of DRY and comparative advantage.

My recommendation would be to: i. check out http://www.coin-or.org/ (the repository for open-source optimization code); ii. target a few problem classes; perhaps Convex, Quadratic first and Linear and Mixed Integer much later; iii. indentify the best-of-breed libraries; iv. link up with the core developers and get guidance in developing an API of these packages in Julia. A part of this could be a great summer of code project.

By the way, both R and MATLAB have poor optimization support. This could be a real differentiator for Julia.

---

My response was that it seems valuable to have both a pure Julia library and a wrapper to standard libraries. The problem for me is that the standard libraries are so opaque as to be magical to me. There's also a quirk of discipline in that my fields all take R and Matlab's optimization support as the industry standard for what optimization software should look like.

-- John


On Jul 11, 2012, at 6:55 AM, Patrick O'Leary wrote:

Tim Holy

unread,
Jul 11, 2012, 10:30:59 AM7/11/12
to juli...@googlegroups.com
On Wednesday, July 11, 2012 06:39:20 AM John Myles White wrote:
> One quick point that others might chime in on: I'd like to develop types
> (e.g. DifferentiableFunction and TwiceDifferentiableFunction) for combining
> functions with their gradient and Hessian/inverse Hessian, since we can
> then use dispatch to handle inputs based on this information.

As long as they can "share the work" that sounds quite reasonable.

Perhaps I should have also mentioned that, if you're going to use the Wolfe
conditions & check the second one (curvature), I think you basically need the
gradient each and every time you ask for a function value. In that case it's
easy, you just always have your function return both and you don't have to
worry about this stuff at all.

But for algorithms that can use the Hessian, you're back to having to think
about this "output packaging" problem.

--Tim

Tim Holy

unread,
Jul 11, 2012, 10:38:10 AM7/11/12
to juli...@googlegroups.com
While I do think that having some algorithms implemented in pure Julia is
feasible, I would say that at a minimum it's a good idea to survey the other
libraries and see how they structure their API and which papers they use as
the reference for their algorithms. This does take time, however, and I've
been guilty of not doing this myself enough in the past. But, the perceptive
and knowledgeable folks in the Julia community are teaching me to lean more on
other smart people!

--Tim

Nathaniel Daw

unread,
Jul 11, 2012, 10:56:44 AM7/11/12
to juli...@googlegroups.com
On a quick look I don't see that the COIN-OR site provides much that looks like a drop in replacement for the R/Matlab style generic nonlinear optimizers that John is targeting here.

You may want to take a look at 

http://www.di.ens.fr/~mschmidt/Software/minFunc.html

which is a source-available Matlab implementation of a few different optimization algorithms structured as a dropin replacemement for fminunc that my students swear works better than the mathworks version. It's somewhat overlapping with John's stuff but looks more optimized and with (v simple) support for numerical differentiation.

By the way, I have been playing with John's Nelder-Mead code and it seems like it tends to re-visit and re-evaluate the same point multiple times. Is this intended?

Viral Shah

unread,
Jul 11, 2012, 11:10:37 AM7/11/12
to juli...@googlegroups.com
Yes, this is always a conundrum. The same applies for say, ODE solvers. There are lots of C and Fortran options, but integrating them truly always turns out to be difficult. Callbacks, objective functions, passing of arguments across the two languages, and the fact that the julia optimizer can't inline the julia functions into the solver are some reasons to have a native julia implementation.

One other reason why people do not like writing such algorithms in a higher level language is that it effectively makes it difficult to use with other languages. Since we are likely to have a static compiler at some point, I feel that julia has a shot at becoming a language for library writers in general.

Of course, the pros and cons have to be weighed and we may even have more than one approach in the long run.

-viral

John Myles White

unread,
Jul 11, 2012, 11:23:02 AM7/11/12
to juli...@googlegroups.com
Like you, Nathaniel, several people have convinced me that minFunc is a good source of inspiration. I still haven't had a chance to read through its source, but it seems like a perfect example of what I'm hoping to achieve in Julia.

The Nelder-Mead code could very well be broken, but are you inferring that it revisits points from the trace output? If so, there could be some complications, because the trace for Nelder-Mead is a bit impoverished relative to the underlying process, since the trace is outputting the centroid of the simplex, rather than the simplex vertices. I think that some operations in the NM method could leave the centroid unchanged across iterations.

That said, I have no real faith in that code's correctness yet. I'll take a look at it tonight. Do you have a simple example I could debug?

 -- John

Douglas Bates

unread,
Jul 11, 2012, 12:11:12 PM7/11/12
to juli...@googlegroups.com
I think there is a bigger problem with Fortran-based implementations in that they expect the objective function (and gradient and Hessian functions) to be passed as Fortran subroutines. IIRC there is no easy way to pass a Julia function for a callback from compiled code and, having spent far too much of my life writing awkward wrappers to pass R closures through ghastly Fortran interfaces to be able to evaluate the objective, I think this is a good thing.  

There are actually people in the optimization community still writing Fortran-77 subroutines with those wonderful WORK, and IWORK and ... arrays from which they calculate positions for various arrays, etc.  Not only do they not want to use languages that can represent types and classes and all those good things, they think that dynamic allocation and freeing of memory is one of those suspicious new-fangled notions that real programmers should avoid.

The bottom line is that unless you can find Fortran/C/C++ optimization libraries that allow reverse communication, it will be very difficult to incorporate such code in Julia.  Better to start off with an implementation in Julia using the wide range of tools provided by the language (and perhaps borrowing some design ideas from Matlab implementations) than to try to retrofit code written in an archaic, stunted language into a modern computing environment.

Sorry for venting.

I think I mentioned this before but in case I didn't, the idea of reverse communication is to have the caller of the optimization code control the execution.  The optimization code has information on its state, etc. and all it does is update its state and return to the caller with a request for another evaluation or indicate termination.  The port subroutines underlying the R nlminb function are an example of this.

Krzysztof Kamieniecki

unread,
Jul 11, 2012, 12:18:35 PM7/11/12
to juli...@googlegroups.com
I am in favor of Julia native code, even it duplicates FORTRAN / C code. My end goal is to have optimizer running on GPUs and having the optimizer and objective/gradient/hessian functions compile together will simplify that.

Also I think the non-linear optimizers I am currently only need to calculate (transpose(x) * Hessian * x) so I would prefer not to calculate the full hessian.

Mason Simon

unread,
Jul 11, 2012, 1:13:08 PM7/11/12
to juli...@googlegroups.com
Keep it up John! I can think of a number of reasons to replicate algorithms in Julia:
1) Discover how concisely they can be implemented. Concise code is easier to learn from and easier to build off of. If some hot new optimization algorithm comes along, maybe it can be implemented in less time in Julia than FORTRAN, and you'll have built the groundwork for it.
2) Build a reference implementation that the lower-level stuff can be tested against. Maybe you'll reveal some sneaky bug in their FORTAN--who knows? That comment is abusing the term "DRY"; DRY means factor out commonalities in Your code, not "Don't compete with my Results, just accept them and find something else to do with Your time"…
3) And "comparative advantage"--them's fighting words :)

Viral Shah

unread,
Mar 6, 2013, 2:54:51 AM3/6/13
to juli...@googlegroups.com
We actually have pretty good support for linear programming thanks to all the work by Miles, Iain and Carlo. See MathProg, SymbolicLP, LinprogGLPK, Clp, and CoinMP.

John Myles White also has an Optim package.

To me, it is essential that we have well thought out APIs, even if we do not support all the libraries, or have simple implementations in julia to begin with. Others here know a lot more than I do, and it would be nice to hear what they have to say about the choice of libraries and API design.

-viral



On 06-Mar-2013, at 12:44 PM, Guilherme Freitas wrote:

> Hi all,
>
> I am new to Julia, but I like *a lot* what I have seen. I am listing in the end of this post some thoughts that I already posted on John's blog (found his blog first, so...)
>
> Again, would love to collaborate on making optimization in Julia "happen".
>
> At a design level, I think there are three options to think about (I'll just give examples):
>
> 1. Something like SciPy, interfacing to external solvers.
> 2. Something like the GSL or NLpy, giving access to individual components of an optimization algorithm (line searchers, KKT solvers, etc.)
> 3. A modeling layer (like AMPL, CVXMOD, Pyomo, PuLP, etc.)
>
> My guess is that 1 and 2 would be "easier" and more immediately useful.
>
> Sorry for the lack of structure. It's a crazy night.
>
> ----
>
> Some thoughts (sorry for not elaborating, it’s a tough night):
>
> 1. AMPL [1] is awesome and maybe a “default” Julia optimization package can take many lessons from them. If we could use their solver interface (which I *think* is open source), we would have immediate access to a large number of industry-grade solvers already interfaced to AMPL.
>
> 2. Something like CVXOPT [2] would be beyond awesome.
>
> 3. Other Python packages to look at: Pyomo [3] (a project by William Hart at Sandia National labs) and NLpy [4] (by Dominique Orban at Ecole Polytechnique de Montreal) is also worth a look for ideas. Pyomo uses the AMPL solver interface trick I mentioned above. I think NLpy reuses also some part of AMPL. There are some interesting design decisions there. I also like the syntax of PuLP [5] quite a bit.
>
> 4. Important solvers to have: IPOPT [6] (interior point) and ALGENCAN [7] (active set) . They are certainly among the best (if not *the* best) open source, smooth nonlinear solvers out there.
>
> 5. Automatic differentiation [8, 9] would be extremely useful. There are many libraries out there, and a very good book by Andreas Griewank published by SIAM.
>
> 6. A solver for systems of smooth nonlinear equations would be great. For something readable, maybe the solvers in Kelley’s little book [10] published by SIAM would be ideal… Well documented, high-quality and very readable. The book has MATLAB implementations that could be easily ported.
>
> 7. Good linear programming support. I am guessing clp [11] by the COIN-OR foundation would be among the best. A good simplex solver that could use fractions or arbitrary precision numbers would be great too! I think GLPK [12] does that already, need to check.
>
> 8. Talking about GLPK, a lot of people would be well served by a good interface to GLPK and the GSL [13]. There is a GSL-Shell project [14] in Lua that might be worth looking at.
>
> I am not an expert in optimization, but I am an enthusiast, a user and I know a fair amount of math and programming. I would love to contribute. Send me an email if you want to coordinate something! I found out about Julia recently, and it looks neat!
>
> [1] http://www.ampl.com
> [2] http://abel.ee.ucla.edu/cvxopt/
> [3] https://software.sandia.gov/trac/coopr/wiki/Pyomo
> [4] https://github.com/dpo/nlpy
> [5] http://pythonhosted.org/PuLP/index.html
> [6] https://projects.coin-or.org/Ipopt
> [7] http://www.ime.usp.br/~egbirgin/tango/codes.php
> [8] http://www.autodiff.org/
> [9] https://en.wikipedia.org/wiki/Automatic_differentiation
> [10] http://www.ec-securehost.com/SIAM/FA01.html
> [11] http://www.coin-or.org/Clp/
> [12] https://www.gnu.org/software/glpk/
> [13] https://www.gnu.org/software/gsl/
> [14] http://www.nongnu.org/gsl-shell/

Tim Holy

unread,
Mar 6, 2013, 7:00:03 AM3/6/13
to juli...@googlegroups.com
It's not a bad idea to have access to those external libraries. In the long
run, though, it would be great to have our own. Julia could easily become
_the_ platform for developing new algorithms.

A bit of context: while I haven't really advertised it, the cgdescent
algorithm should be pretty high-end. The original C version of cgdescent
(CG_DESCENT by Hager & Zhang) is as good a contender as any for "fastest
general purpose unconstrained optimization algorithm." In particular, their
line search algorithm is amazing. L-BFGS on its own is often slower than
CG_DESCENT, but L-BFGS + the Hager/Zhang linesearch is just as fast as
CG_DESCENT, and in a few isolated corner cases (involving massive precision
loss) it is even better.

I've tried to capture the strengths of their algorithm in my implementation
(which was a "clean room implementation" from their papers, given that their
algorithm is GPL), and even extend it to work with constrained optimization.
For problems of a few hundred to a few hundred-thousand variables, it simply
blows away Matlab's fminunc , often by a factor of 10, 100, or more. However,
I have no doubts there are issues with my version. I've periodically toyed
with contacting Hager & Zhang to see if they have any interest in taking a
peek and perhaps making it better, though I haven't done so yet.

With regards to your original question, there are probably two attractive
things about wrapping external libraries: it would give us access to a wider
variety of algorithms, and in the long run it would give us an easier way to
benchmark our own implementations.

However, while working on cgdescent, I came way with the idea that the single
biggest contribution anyone could make to let Julia rock for optimization
would be to write a parser for the CUTEr test suite:
http://en.wikipedia.org/wiki/CUTEr
This seems to be the de facto standard collection of test problems.

Definitely up to you what you decide to tackle, of course! Given the explosive
growth of Julia and the importance of optimization to just about everything,
it's definitely a point that could use a lot more attention. I'm very glad to
hear that you're interested.

Best,
--Tim

Dahua

unread,
Mar 6, 2013, 4:46:29 PM3/6/13
to juli...@googlegroups.com
John,

I really appreciate your efforts on the optimization package. Needless to say, optimization is one of the most important things in technical computing.

Here are my two cents. I think a very important aspect (especially at the beginning) is the design of the interface. Actually, I don't think MATLAB is a good example for this. But I wish Julia can make it right.

Unlike matrix algebra, which has a de facto standard back-end (i.e. BLAS & LAPACK), there are many competing algorithms for optimization. A good interface should allow the access to different back-end algorithm through a uniform front-end interface. By default, it may rely on a Julia based implementation, but it should allow the plugin of different solvers (e.g. IP-OPT, Knitro, etc). Some language (e.g. MATLAB) allows the user to supply a string or a enumerator to specify the choice of algorithms. But this is not a great idea in my opinion, because it limits the extensibility of the package (users can only choose between those algorithms that have already been implemented).

What about the following design ?

abstract AbstractOptimizer
abstract GradientBasedOptimizer <: AbstractOptimizer
# abstract types for other kinds of optimizers ...

The optimizer vendor/wrapper should implement/adapt the optimizers following the required interface. In this way, people can plug in their favourite solvers (e.g. a commercial solver if they happen to have the license).

Then the client code may look like the following

x = optimize(f, x0);    # use the default solver
x
= optimize(f, x0, opts);     # use the default solver with supplied params
x
= optimize(f, x0, solver);   # use a customized solver

The function f is a little bit tricky. I think it should be able to take in any objective function, and it uses some clearly documented ways to examine the properties of the function (e.g. whether the function may output the gradient and/or Hessian). 

I agreed with Tim that it is often much more efficient to have the same function to output both the objective value & gradient, as the computation of these things usually shares a common part and it is wasteful to compute those parts twice. It is also important to have a way to tell the function whether it should evaluates the gradient. Here is my thoughts on the interface.

v = f(x);       # f only evaluates the objective value
v
= f(x, g);    # f evaluates the objective value and as well writes the gradient to g
v
= f(x, g, H); # additionally outputs Hessian

This design has two advantage: (1) f knows when it should evaluate the gradient simply by signature. (2) The optimizer can allocate the gradient vector only once and reuse the same memory block every time the gradient needs to be updated (generally, the dimension of this vector will not change). It is much more efficient than creating a new vector at each iteration.

In terms of the choice of default implementation, it might be good to use a set of benchmarks to guide the choice. I also think the default implementation should be pure Julia (it has minimal cost of interacting with Julia functions), while passing Julia functions to external world may incur overhead.

John Myles White

unread,
Mar 6, 2013, 4:50:18 PM3/6/13
to juli...@googlegroups.com
This all sounds promising to me. I am handing in my thesis on Friday. Once that is done, I will give this design more thought. At some point it would be good to have a real-time chat between all the people involved in writing optimization code for Julia. My only remaining hesitation to wholesale adopt Tim's API, which has many strengths that my API is badly lacking, is that it seems to lack a simple base case for a user with little knowledge of optimization who just wants to pass in a function and some bounds and get an answer.

Again, I will give this a lot more thought after my free time returns. Right now the academic deadlines are too pressing to do anything else.

-- John

Iain Dunning

unread,
Mar 6, 2013, 8:46:24 PM3/6/13
to juli...@googlegroups.com
1. AMPL [1] is awesome and maybe a “default” Julia optimization package can take many lessons from them. If we could use their solver interface (which I *think* is open source), we would have immediate access to a large number of industry-grade solvers already interfaced to AMPL.

We have MathProg.jl which is very competitive with the speed of AMPL. It is designed for Linear, Mixed Integer, and soon Quadratic Programming - it will not be extended to nonlinear programming because we feel a different design would be needed that would sacrifice speed for the LP/MILP/QP case.
The "interface" is the .nl file format - its a nasty format, but your logic is sound.

2. Something like CVXOPT [2] would be beyond awesome.
 
A lot of the general functionality in there is already in Julia. A general convex solver isn't, and would be quite a project - but of course totally doable.

3. Other Python packages to look at: Pyomo [3] (a project by William Hart at Sandia National labs) and NLpy [4] (by Dominique Orban at Ecole Polytechnique de Montreal) is also worth a look for ideas. Pyomo uses the AMPL solver interface trick I mentioned above. I think NLpy reuses also some part of AMPL. There are some interesting design decisions there. I also like the syntax of PuLP [5] quite a bit.

See MathProg above.

4. Important solvers to have: IPOPT [6] (interior point) and ALGENCAN [7] (active set) . They are certainly among the best (if not *the* best) open source, smooth nonlinear solvers out there.

Miles and I have been tinkering with taking Julia functions, using metaprogramming to generate and compile the Jacobian. We only need to hook into the C API to get IPOPT running.

5. Automatic differentiation [8, 9] would be extremely useful. There are many libraries out there, and a very good book by Andreas Griewank published by SIAM.

See above. Our (Miles and I) feeling is that "automatic differentiation" and "symbolic differentiation" become VERY confused when metaprogramming gets in the mix. See https://github.com/IainNZ/NLTester

7. Good linear programming support. I am guessing clp [11] by the COIN-OR foundation would be among the best. A good simplex solver that could use fractions or arbitrary precision numbers would be great too! I think GLPK [12] does that already, need to check.

This is done, we support CLP and CBC (via CoinMP) and there are packages for them. 

8. Talking about GLPK, a lot of people would be well served by a good interface to GLPK and the GSL [13]. There is a GSL-Shell project [14] in Lua that might be worth looking at.

This has been available from near the start, but has been superceded by CLP in my opinion.

Miles Lubin

unread,
Mar 6, 2013, 9:05:50 PM3/6/13
to juli...@googlegroups.com
On Wednesday, March 6, 2013 8:46:24 PM UTC-5, Iain Dunning wrote:
1. AMPL [1] is awesome and maybe a “default” Julia optimization package can take many lessons from them. If we could use their solver interface (which I *think* is open source), we would have immediate access to a large number of industry-grade solvers already interfaced to AMPL.

We have MathProg.jl which is very competitive with the speed of AMPL. It is designed for Linear, Mixed Integer, and soon Quadratic Programming - it will not be extended to nonlinear programming because we feel a different design would be needed that would sacrifice speed for the LP/MILP/QP case.
The "interface" is the .nl file format - its a nasty format, but your logic is sound.

It is possible to extend MathProg to nonlinear optimization, but it will be a nontrivial project and will require much macro-fu. I'd put it on the long-term wish list.

 
5. Automatic differentiation [8, 9] would be extremely useful. There are many libraries out there, and a very good book by Andreas Griewank published by SIAM.

See above. Our (Miles and I) feeling is that "automatic differentiation" and "symbolic differentiation" become VERY confused when metaprogramming gets in the mix. See https://github.com/IainNZ/NLTester

I think there's a big potential for using Julia for automatic differentiation for cases like simulations because of the very powerful metaprogramming tools and the first-class access to the julia parser. For more simple closed-form expressions which show up in nonlinear modeling, symbolic differentiation is actually pretty easy, with the added benefit that you can just compile a function at runtime that computes derivatives efficiently.




John Myles White

unread,
Mar 6, 2013, 9:09:08 PM3/6/13
to juli...@googlegroups.com
If anyone feels like setting up a symbolic differentiation system in Julia, I wrote the most barebones possible example for the Calculus.jl package.

 -- John

Kevin Squire

unread,
Mar 6, 2013, 9:21:11 PM3/6/13
to juli...@googlegroups.com


On Wednesday, March 6, 2013 6:05:50 PM UTC-8, Miles Lubin wrote:
On Wednesday, March 6, 2013 8:46:24 PM UTC-5, Iain Dunning wrote:
1. AMPL [1] is awesome and maybe a “default” Julia optimization package can take many lessons from them. If we could use their solver interface (which I *think* is open source), we would have immediate access to a large number of industry-grade solvers already interfaced to AMPL.

We have MathProg.jl which is very competitive with the speed of AMPL. It is designed for Linear, Mixed Integer, and soon Quadratic Programming - it will not be extended to nonlinear programming because we feel a different design would be needed that would sacrifice speed for the LP/MILP/QP case.
The "interface" is the .nl file format - its a nasty format, but your logic is sound.

It is possible to extend MathProg to nonlinear optimization, but it will be a nontrivial project and will require much macro-fu. I'd put it on the long-term wish list.

 
5. Automatic differentiation [8, 9] would be extremely useful. There are many libraries out there, and a very good book by Andreas Griewank published by SIAM.

I'll second this: the Griewank book is very thorough and well written.
 

See above. Our (Miles and I) feeling is that "automatic differentiation" and "symbolic differentiation" become VERY confused when metaprogramming gets in the mix. See https://github.com/IainNZ/NLTester

I think there's a big potential for using Julia for automatic differentiation for cases like simulations because of the very powerful metaprogramming tools and the first-class access to the julia parser. For more simple closed-form expressions which show up in nonlinear modeling, symbolic differentiation is actually pretty easy, with the added benefit that you can just compile a function at runtime that computes derivatives efficiently.


Forward-mode automatic differentiation is quite similar to symbolic differentiation, but I would really love to see a reverse-mode differentiation implemented in julia.  It's a bit more involved, but much more efficient for a large class of simulation and optimization problems.

Kevin

Miles Lubin

unread,
Mar 6, 2013, 9:39:22 PM3/6/13
to juli...@googlegroups.com
Hi Dahua,

It's actually pretty difficult in my opinion to come up with a good (generic and powerful) interface for optimization. The properties of the function you're optimizing make a huge difference in the correct choice of algorithms. It's also important to be able to access solver options, and that's difficult to do in a generic way. Also, nobody wants to be manually coding gradients, Jacobians, or Hessians.

A practical approach, in my opinion, is to skip the awkward matlab/scipy-type interfaces (they should be there for those who want them, but I don't think they're the most useful) and
1. Provide low-level solver-specific interfaces to important codes such as Clp, GLPK (both already done), IPOPT, Couenne, commercial solvers, etc.
2. Provide a high-level optimization system/modeling language that can detect problem structure and call appropriate solvers efficiently, automatically providing sparse Jacobian/Hessian matrices. YALMIP is an example of this in Matlab. The MathProg package developed by Iain and me is a first stab at this, limited to linear/quadratic/mixed-integer (for now).
3. For problems that cannot be expressed in a closed form (simulations, etc.), provide a set of macros that can perform automatic differentiation. Only in this case does it make that much sense (to me at least) to use an interface like the one you propose.

Miles

Miles Lubin

unread,
Mar 6, 2013, 9:48:21 PM3/6/13
to juli...@googlegroups.com


On Wednesday, March 6, 2013 9:21:11 PM UTC-5, Kevin Squire wrote:

I'll second this: the Griewank book is very thorough and well written.

It happens to be sitting on my desk right now :)
 
 

See above. Our (Miles and I) feeling is that "automatic differentiation" and "symbolic differentiation" become VERY confused when metaprogramming gets in the mix. See https://github.com/IainNZ/NLTester

I think there's a big potential for using Julia for automatic differentiation for cases like simulations because of the very powerful metaprogramming tools and the first-class access to the julia parser. For more simple closed-form expressions which show up in nonlinear modeling, symbolic differentiation is actually pretty easy, with the added benefit that you can just compile a function at runtime that computes derivatives efficiently.


Forward-mode automatic differentiation is quite similar to symbolic differentiation, but I would really love to see a reverse-mode differentiation implemented in julia.  It's a bit more involved, but much more efficient for a large class of simulation and optimization problems.

This isn't my specialty, so I'm not in a position to implement it from scratch (in a reasonable amount of time), but I'd be interested in helping out if someone would like to take the lead.



 

Tim Holy

unread,
Mar 6, 2013, 9:49:47 PM3/6/13
to juli...@googlegroups.com
On Wednesday, March 06, 2013 01:46:29 PM Dahua wrote:
> v = f(x); # f only evaluates the objective value
> v = f(x, g); # f evaluates the objective value and as well writes the
> gradient to g
> v = f(x, g, H); # additionally outputs Hessian
>
> This design has two advantage: (1) f knows when it should evaluate the
> gradient simply by signature. (2) The optimizer can allocate the gradient
> vector only once and reuse the same memory block every time the gradient
> needs to be updated (generally, the dimension of this vector will not
> change). It is much more efficient than creating a new vector at each
> iteration.

This is how I did it in cgdescent, for those reasons, but with two differences:

- It doesn't use multiple dispatch. Instead, if you pass in g = nothing, then
the objective function knows not to compute the gradient. That might seem
weird in a language like Julia, but the reason is that I've got some objective
functions (running right at this moment, it so happens) where the objective
function is two to three screens full of code, and whether you include the
gradient or not hardly changes the size. There's no good reason to write all
that code twice (once with g and once without) when a trivial run-time check
whose overhead you can't really even measure suffices to unify them into a
single function.

- The second difference is that the output argument g is on the left side of x.
I know it doesn't really matter, but my thinking was (1) functions often have
additional parameters, and it seems natural to let x be the "barrier" between
arguments-that-are-outputs and arguments-that-are-parameters. Since optional
arguments usually go at the end, that meant putting the gradient at the
beginning. (2) at least on the left they are a little closer to being like the
LHS in an assignment statement :-).

I do agree with John, though, that assigning to the pre-allocated g has one
disadvantage: some users might accidentally toss g, and this could be
confusing. For example, consider minimizing (A*x-b)^2/2: it's very tempting to
write g = A'*(A*x-b). If you do that inside your objective function, you end
up reassigning g to the temporary created by this multiplication. Any checks
you do inside that function will pass, but then the caller simply won't see
the data; g will contain whatever it contained upon entry.

We do have a syntax for that, because we can write At_mul_B(g, A, A*x-b) and
the output of the multiplication is stored in the pre-allocated g. It's a
little uglier, however. Particularly since John really values (and is good at
writing) very readable code, I understand his hesitation, even though in this
case I think that performance simply has to win in the end. Hopefully we can
come up with a way to get readability and good performance.

(Good luck with the thesis, John!)

--Tim

Miles Lubin

unread,
Mar 6, 2013, 9:56:57 PM3/6/13
to juli...@googlegroups.com

Any harm in providing interfaces for both versions, one where a preallocated gradient is passed in and one where the user returns a gradient? With second-order methods it's also important to not keep reallocating Hessians (which could be dense or sparse).

 

Dahua

unread,
Mar 6, 2013, 10:00:36 PM3/6/13
to juli...@googlegroups.com
Miles,

I agreed that it would be great to provide facilities for automatic differentiation to free people from the need of writing the routines to compute gradient and/or Hessian. It would also be a great feature to have a modeling language and a mechanism to choose an appropriate solver and parameter configuration.

However, I still think that it is still important to provide some intermediate-level interfaces for people who understand optimization to do it in a "traditional" way -- that is, writing a function that outputs both objective values and gradients. 

This is actually based on my experience working in machine learning/computer vision. In such fields, the evaluation of many useful objective function itself is very complex (e.g. various kinds of energy functions defined on images). Usually, such objective function can not be expressed in a nice analytic form. Instead, people have to write hundreds and sometimes thousand lines of MATLAB code to just compute the objective over a data set. Therefore, it is not that easy for a program to compute the derivatives automatically. 

To optimize such function, people have to write the gradient in a dedicated way. For some of the complex objective functions, even deriving an efficient way to compute the gradient itself is sufficient for a paper published on a top-tiered journal or conference.

For such problems (a considerable portion of problems in AI), an interface that can accept user-supplied gradient and user choice of solvers is important.

Tim Holy

unread,
Mar 6, 2013, 10:08:54 PM3/6/13
to juli...@googlegroups.com
On Wednesday, March 06, 2013 06:39:22 PM Miles Lubin wrote:
> Also, nobody wants to be manually
> coding gradients, Jacobians, or Hessians.

What are you talking about, I _love_ coding gradients, particularly when it
requires that I keep track of kronecker deltas 5 layers deep. It keeps the
riffraff away, and at the end of the day my chest hair is measurably longer.
(Not that that's a good thing, mind you.)

On your other point, I'm not sure that I understand the interface you're
proposing. The optimization algorithm should be reasonably "generic" (it may
have a limited problem domain, but it should still be somewhat general),
whereas the objective function is very specific. I confess I'm having a hard
time imagining how you could get away with a design that doesn't involve
thinking about how to compute the gradient (except in cases like linear
programming where the function is of a very narrowly-defined, albeit important,
form).

--Tim

Dahua

unread,
Mar 6, 2013, 10:44:12 PM3/6/13
to juli...@googlegroups.com


On Wednesday, March 6, 2013 8:49:47 PM UTC-6, Tim wrote:
On Wednesday, March 06, 2013 01:46:29 PM Dahua wrote:
> v = f(x);       # f only evaluates the objective value
> v = f(x, g);    # f evaluates the objective value and as well writes the
> gradient to g
> v = f(x, g, H); # additionally outputs Hessian
>
> This design has two advantage: (1) f knows when it should evaluate the
> gradient simply by signature. (2) The optimizer can allocate the gradient
> vector only once and reuse the same memory block every time the gradient
> needs to be updated (generally, the dimension of this vector will not
> change). It is much more efficient than creating a new vector at each
> iteration.

This is how I did it in cgdescent, for those reasons, but with two differences:

- It doesn't use multiple dispatch. Instead, if you pass in g = nothing, then
the objective function knows not to compute the gradient. That might seem
weird in a language like Julia, but the reason is that I've got some objective
functions (running right at this moment, it so happens) where the objective
function is two to three screens full of code, and whether you include the
gradient or not hardly changes the size. There's no good reason to write all
that code twice (once with g and once without) when a trivial run-time check
whose overhead you can't really even measure suffices to unify them into a
single function.

I completely agree with this. It is important to not write the same thing twice (especially when it is quite complex).


- The second difference is that the output argument g is on the left side of x.
I know it doesn't really matter, but my thinking was (1) functions often have
additional parameters, and it seems natural to let x be the "barrier" between
arguments-that-are-outputs and arguments-that-are-parameters. Since optional
arguments usually go at the end, that meant putting the gradient at the
beginning. (2) at least on the left they are a little closer to being like the
LHS in an assignment statement :-).

This seems more like a matter of taste. While my original proposal seems more natural to me, I don't actually have strong preference here. What's much more important is to have a consistent interface and stick to it.


I do agree with John, though, that assigning to the pre-allocated g has one
disadvantage: some users might accidentally toss g, and this could be
confusing. For example, consider minimizing (A*x-b)^2/2: it's very tempting to
write g = A'*(A*x-b). If you do that inside your objective function, you end
up reassigning g to the temporary created by this multiplication. Any checks
you do inside that function will pass, but then the caller simply won't see
the data; g will contain whatever it contained upon entry.

This is a fair point. However, preallocated gradient is still a better option. If you use an interface that requires pre-allocated array, people may make mistakes the first time they write a gradient. But they can quickly learn how to do it right (It may be good to provide some tutorials though).

However, if an interface like "[v, g] = f(x)" is adopted. It would be extremely difficult (if not impossible) to avoid re-creating arrays at each call when performance is critical.

Miles Lubin

unread,
Mar 6, 2013, 11:19:22 PM3/6/13
to juli...@googlegroups.com
Hi Dahua,

Thanks for your perspective. It's clear that optimization looks different computationally in different fields and contexts. Gradient-based unconstrained optimization looks very different from linear/nonlinear/integer programming. I was too dismissive in my previous post of a black-box function/gradient interface. Still, it seems too early to try to standardize this at the base Julia level (there's a significant refactoring of linear algebra still in the works, issue 2308).

Miles

Tim Holy

unread,
Mar 7, 2013, 4:55:13 AM3/7/13
to juli...@googlegroups.com
On Wednesday, March 06, 2013 08:19:22 PM Miles Lubin wrote:
> Still, it seems too early
> to try to standardize this at the base Julia level (there's a significant
> refactoring of linear algebra still in the works, issue 2308).

I'm not sure I agree with this. I think it's a much more restricted API issue
than factorization, with really only a couple of decisions that need to be
made. We should not shy away from making some decisions and then seeing how it
works out; it's really the only way to move forward. Given the churn in the
rest of Julia, we can always change our mind for 0.3!

--Tim

Tim Holy

unread,
Mar 7, 2013, 5:04:31 AM3/7/13
to juli...@googlegroups.com
Hi Dahua,

On Wednesday, March 06, 2013 07:44:12 PM Dahua wrote:
> > - The second difference is that the output argument g is on the left side
> > of x.
> > ....
>
> This seems more like a matter of taste. While my original proposal seems
> more natural to me, I don't actually have strong preference here. What's
> much more important is to have a consistent interface and stick to it.

Agreed.

> This is a fair point. However, preallocated gradient is still a better
> option. If you use an interface that requires pre-allocated array, people
> may make mistakes the first time they write a gradient. But they can
> quickly learn how to do it right (It may be good to provide some tutorials
> though).

Oh, I agree, I'm just wanting to be clear about the main weakness of the
approach I adopted. If you check out cgdescent, you'll see that it uses a
single g vector throughout, without (at least in my view) making things too
unacceptably ugly. I once played with rewriting JMW's l_bfgs in a way that
avoids so much reallocation; that turns out to be a somewhat harder task, one
that I ran out of time on before completing in a satisfactory way.

--Tim

John Myles White

unread,
Mar 7, 2013, 8:14:50 AM3/7/13
to juli...@googlegroups.com
I also tried to rewrite L-BFGS to avoid so much allocation. It is tricky, but desperately needs doing. (Case in point: on many of the problems I work on, it underperforms Nelder-Mead for speed.)

Starting next week I'll try to focus in on this issue again. I am in agreement with Dahua about the virtues of simplified interface, but am interested to see how much we can create an API that allows tiers for more sophisticated users.

-- John

Dahua

unread,
Mar 7, 2013, 8:30:15 AM3/7/13
to juli...@googlegroups.com
Yes, writing BFGS without reallocating arrays takes a little bit thought on the data structure. But it is completely doable. 

I actually wrote a BFGS routine in MATLAB (in a toolbox I wrote earlier) that delegates the update to a C++ mex, which does not do any reallocation.

The basic idea: 

Suppose you have to store k gradients of dimension d. Then preallocate a d-by-k matrix, and treat that like a cyclic list. 
You may use an index to indicate which is the first gradient and an integer to indicate the number of available gradients. At each iteration, you move the indicator forward. There are some corner cases that you might have to take care though.

In Julia, at each iteration you will have to create a vector view of a specific column and let the objective function f to output the gradient to that vector view. A temporary vector view can be created using pointer_to_array.
 
If necessary, I am glad to implement this once the interface is settled.

Dahua

unread,
Mar 7, 2013, 8:48:30 AM3/7/13
to juli...@googlegroups.com
This is a reference MATLAB implementation of L-BFGS updates without re-creating arrays:

Reference MATLAB implementation

% mm: index of the column that stores last updated gradient (init to 0)
% hm: the length of available history

% backward-pass
q = g;
for j = [mm : -1 : 1, hm : -1 : mm+1]
    talpha(j) = rho(j) * (S(:,j)' * q);
    q = q - talpha(j) * Y(:,j);
end                    

% forward-pass
z = q;
for j = [mm + 1 : hm, 1 : mm]
    beta = rho(j) * (Y(:,j)' * z);
    z = z + (talpha(j) - beta) * S(:,j);
end 

% update history (in a cyclic way)
                                
if hm < cache_n                
    hm = hm + 1;
    mm = mm + 1;
elseif mm < hm
    mm = mm + 1;
else
    mm = 1;
end

new_s = x - x_pre;  % in Julia, new_s and new_y can be pre-created
new_y = g - g_pre;
                              
S(:, mm) = x - x_pre;
Y(:, mm) = g - g_pre;
rho(mm) = 1.0 / (new_s' * new_y);


John Myles White

unread,
Mar 7, 2013, 8:54:05 AM3/7/13
to juli...@googlegroups.com
That's Dahua. I'll try to get this working over the weekend so that we minimally have something faster while we debate API design.

 -- John
Reply all
Reply to author
Forward
0 new messages