Optim.jl: unexpected results when using gradient-based methods

Holger Stichnoth

unread,

May 19, 2014, 6:44:02 AM5/19/14

to julia...@googlegroups.com

Hello,

I installed Julia a couple of days ago and was impressed how easy it was to make the switch from Matlab and to parallelize my code
(something I had never done before in any language; I'm an economist with only limited programming experience, mainly in Stata and Matlab).

However, I ran into a problem when using Optim.jl for Maximum Likelihood estimation of a conditional logit model. With the default Nelder-Mead algorithm, optimize from the Optim.jl package gave me the same result that I had obtained in Stata and Matlab.

With gradient-based methods such as BFGS, however, the algorithm jumped from the starting values to parameter values that are completely different. This happened for all thr starting values I tried, including the case in which I took a vector that is closed to the optimum from the Nelder-Mead algorithm.

The problem seems to be that the algorithm tried values so large (in absolute value) that this caused problems for the objective
function, where I call exponential functions into which these parameter values enter. As a result, the optimization based on the BFGS algorithm did not produce the expected optimum.

While I could try to provide the analytical gradient in this simple case, I was planning to use Julia for Maximum Likelihood or Simulated Maximum Likelihood estimation in cases where the gradient is more difficult to derive, so it would be good if I could make the optimizer run also with numerical gradients.

I suspect that my problems with optimize from Optim.jl could have something to do with the gradient() function. In the example below, for instance, I do not understand why the output of the gradient function includes values such as 11470.7, given that the function values differ only minimally.

Best wishes,
Holger

julia> Optim.gradient(clogit_ll,zeros(4))
60554544523933395e-22
0Op
0
0

14923.564009972584
-60554544523933395e-22
0
0
0

14923.565228435104
0
60554544523933395e-22
0
0

14923.569064311248
0
-60554544523933395e-22
0
0

14923.560174904109
0
0
60554544523933395e-22
0

14923.63413848258
0
0
-60554544523933395e-22
0

14923.495218282553
0
0
0
60554544523933395e-22

14923.58699717058
0
0
0
-60554544523933395e-22

14923.54224130672
4-element Array{Float64,1}:
-100.609
734.0
11470.7
3695.5

function clogit_ll(beta::Vector)

    # Print the parameters and the return value to
    # check how gradient() and optimize() work.
    println(beta) 
    println(-sum(compute_ll(beta,T,0)))

    # compute_ll computes the individual likelihood contributions
    # in the sample. T is the number of periods in the panel. The 0
    # is not used in this simple example. In related functions, I
    # pass on different values here to estimate finite mixtures of
    # the conditional logit model.
    return -sum(compute_ll(beta,T,0))
end

Andreas Noack Jensen

unread,

May 19, 2014, 7:09:04 AM5/19/14

to julia...@googlegroups.com

What is the output of versioninfo() and Pkg.installed("Optim")? Also, would it be possible to make a gist with your code?

--
Med venlig hilsen

Andreas Noack Jensen

John Myles White

unread,

May 19, 2014, 9:51:16 AM5/19/14

to julia...@googlegroups.com

If you can, please do share an example of your code. Logit-style models are in general numerically unstable, so it would be good to see how exactly you’ve coded things up.

One thing you may be able to do is use automatic differentiation via the autodiff = true keyword to optimize, but that assumes that your objective function is written in completely pure Julia code (which means, for example, that your code must not call any of functions not written in Julia provided by Distributions.jl).

— John

Holger Stichnoth

unread,

May 20, 2014, 11:34:21 AM5/20/14

to julia...@googlegroups.com

Hi Andreas,
hi John,
hi Miles (via julia-opt, where I mistakenly also posted my question yesterday),

Thanks for your help. Here is the link to the Gist I created: https://gist.github.com/anonymous/5f95ab1afd241c0a5962

In the process of producing a minimal (non-)working example, I discovered that the unexpected results are due to the truncation of the logit choice probabilities in the objective function. Optim.optimize() is sensitive to this when method = :l_bfgs is used. With method = :nelder_mead, everything works fine. When I comment out the truncation, :l_bfgs works as well. However, I need to increase the xtol from its default of 1e-12 to at least 1e-10, otherwise I get the error that the linesearch failed to converge.

I guess I should just do without the truncation. The logit probabilities are between 0 and 1 by construction anyway. I had just copied the truncation code from a friend who had told me that probabilities that are too close to 0 or 1 sometimes cause numerical problems in his Matlab code of the same function. With Optim.optimize(), it seems to be the other way around, i.e. moving the probabilities further away from 0 or 1 (even by tiny amounts) means that the stability of the (gradient-based) algorithm is reduced.

So for me, the problem is solved. The problem was not with Optim.jl, but with my own code.

The only other thing that I discovered when trying out Julia and Optim.jl is that the optimization is currently considerably slower than Matlab's fminunc. From the Gist I provided above, are there any potential performance improvements that I am missing out on?

Best wishes,
Holger

Holger Stichnoth

unread,

May 20, 2014, 11:42:39 AM5/20/14

to julia...@googlegroups.com

When I set autodiff = true in the Gist I posted above, I get the message "ERROR: no method clogit_ll(Array{Dual{Float64},1},)".

Holger

On Monday, 19 May 2014 14:51:16 UTC+1, John Myles White wrote:

John Myles White

unread,

May 20, 2014, 11:47:39 AM5/20/14

to julia...@googlegroups.com

Glad that you were able to figure out the source of your problems.

It would be good to get a sense of the amount of time spent inside your objective function vs. the amount of time spent in the code for optimize(). In general, my experience is that >>90% of the compute time for an optimization problem is spent in the objective function itself. If you instrument your objective function to produce timing information on each call, that would help a lot since you could then get a sense of how much time is being spent in the code for optimize() after accounting for your function itself.

It’s also worth keeping in mind that your use of implicit finite differencing means that your objective function is being called a lot more times than theoretically necessary, so that any minor performance issue in it will very substantially slow down the solver.

Regarding you objective function’s code, I suspect that the combination of global variables and memory-allocating vectorized arithmetic means that your objective function might be a good bit slower in Julia than in Matlab. Matlab seems to be a little better about garbage collection for vectorized arithmetic and Julia is generally not able to optimize code involving global variables.

Hope that points you in the right direction.

— John

John Myles White

unread,

May 20, 2014, 11:49:32 AM5/20/14

to julia...@googlegroups.com

Yes, to use autodiff you need to make sure that all of the functions you call could be applied to Array{T} for all T <: Number. The typing on your code is currently overly restrictive when you define clogit_ll(beta::Vector{Float64}) and friends. If you loosen things to clogit_ll(beta::Vector), you might get autodiff to work.

— John

Tim Holy

unread,

May 20, 2014, 11:54:07 AM5/20/14

to julia...@googlegroups.com

On Tuesday, May 20, 2014 08:34:21 AM Holger Stichnoth wrote:
> The only other thing that I discovered when trying out Julia and Optim.jl
> is that the optimization is currently considerably slower than Matlab's
> fminunc. From the Gist I provided above, are there any potential
> performance improvements that I am missing out on?

In addition to the factors that John mentioned, I'll add that it's nontrivial
to compare performance fairly. One of the biggest issues is that the
algorithms may use different convergence criteria, and thus may have huge
differences in the number of iterations. For my problems, Optim's `cg` just so
happens to be about 100x faster than fminunc. YMMV.

Definitely worth profiling to see where the bottleneck is.

--Tim

Miles Lubin

unread,

May 21, 2014, 11:48:51 AM5/21/14

to julia...@googlegroups.com

Just to extend on what John said, also think that if you can restructure the code to devectorize it and avoid using global variables, you'll see *much* better performance.

The way to avoid globals is by using closures, for example:

function foo(x, data)
    ...
end


...
data_raw = readcsv(file)
data = reshape(data_raw, nObs, nChoices*(1+nVar), T)



Optim.optimize(x-> foo(x,data), ...)

Holger Stichnoth

unread,

May 22, 2014, 6:38:44 AM5/22/14

to julia...@googlegroups.com

@ John: You are right, when I specify the function as clogit_ll(beta::Vector) instead of clogit_ll(beta::Vector{Float64}), autodiff = true works fine. Thanks for your help!

@ Tim: I have set the rather strict default convergence criteria of Optim.optimize to Matlab's default values for fminunc, but the speed difference is still there.

@ Miles/John: Getting rid of the global variables through closures and devectorizing made the optimization _slower_ not faster in my case: https://gist.github.com/stichnoth/7f251ded83dcaa384273. I was surprised to see this as I expected a speed increase myself.

Miles Lubin

unread,

May 22, 2014, 9:18:36 AM5/22/14

to julia...@googlegroups.com

I was able to get a nearly 5x speedup by avoiding the matrix allocation and making the accumulators type stable: https://gist.github.com/mlubin/055690ddf2466e98bba6

How does this compare with Matlab now?

Holger Stichnoth

unread,

May 22, 2014, 9:53:36 AM5/22/14

to julia...@googlegroups.com

Thanks, it's faster now (by roughly a factor of 3 on my computer), but still considerably slower than fminunc:

Averages over 20 runs:
Julia/Optim.optimize: 10.5s
Matlab/fminunc: 2.6s

Here are my Matlab settings:

options = optimset('Display', 'iter', ...
     'MaxIter', 2500, 'MaxFunEvals', 500000, ...
     'TolFun', 1e-6, 'TolX', 1e-6, ...
     'GradObj', 'off', 'DerivativeCheck', 'off');

startb      = ones(1,nVar)';
[estim_clo, ll_clo]= ...
     fminunc(@(param)clogit_ll(param,data), ...
     startb,options);

Could the speed issue be related to the following messages that I get when I run the Julia code?

C:\Users\User\Documents\References\Software\Julia\mlubin>julia main.jl
Warning: could not import Base.foldl into NumericExtensions
Warning: could not import Base.foldr into NumericExtensions
Warning: could not import Base.sum! into NumericExtensions
Warning: could not import Base.maximum! into NumericExtensions
Warning: could not import Base.minimum! into NumericExtensions

John Myles White

unread,

May 22, 2014, 10:35:22 AM5/22/14

to julia...@googlegroups.com

How many function evaluations is Matlab performing and how many is Julia performing?

— John

Miles Lubin

unread,

May 22, 2014, 5:59:28 PM5/22/14

to julia...@googlegroups.com

I can get another 50% speedup by:

- Running the optimization twice and timing the second run only, this is the more appropriate way to benchmark julia because it excludes the function compilation time

- Setting autodiff=true

- Breaking up the long chains of sums, apparently these seem to be slow

At this point one really needs to compare the number of function evaluations in each method, as John suggested.

John Myles White

unread,

May 22, 2014, 6:03:58 PM5/22/14

to julia...@googlegroups.com

Yeah, this case is tricky enough that we really need to get down to the lowest details:

(1) Do Julia and Matlab perform similar numbers of function evaluations?

(2) If they don't perform similar numbers of function evaluations, is one of them producing a better solution? Is the one that's producing a better solution doing more function evaluations?

(3) If they're doing different numbers of function evaluations and the one that does _fewer_ evaluations also produces a better solution, what's the reason? For example, is our line search default less effective for this problem than the Matlab line search? If you try other line search algorithms, do the results stay the same?

Unfortunately, answering all of these reliably make take us all pretty far down the rabbit hole. But they're worth pushing on systematically.

-- John

Holger Stichnoth

unread,

May 27, 2014, 9:03:30 AM5/27/14

to julia...@googlegroups.com

Hi John, hi Miles,

Thanks to both of you. I did not have time to look into this over the weekend; I will do so in the next couple of days. I have already uploaded the Matlab files for comparison: https://gist.github.com/stichnoth/7f251ded83dcaa384273

Holger

Thibaut Lamadon

unread,

Jun 22, 2014, 7:50:16 AM6/22/14

to julia...@googlegroups.com, stic...@gmail.com

Hi guys,

I wanted to look into this as well. The main issue I think is in the speed of the objective function. Running @time on the objective function suggested a large amount of byte allocation. Checking the type revealed that getting x and y from data would set their types to Any.

So I convert the data to {Float64,3}, and then I changed to only store cumulative sum, not the vector of likelihood. I run the objective function 50 times.

without the convert I get a total time of 9.18s

with the convert and original function I get 4.15s

with the convert and the new function I get 1.49s

with matlab, I get 0.64s

so matlab still appears to be 2.5 times faster. But I am guessing matlab is using SIMD instructions when computing matrix multiplications. So we would need to try to use BLAS in julia with matrix multiplication to get a very good comparison.

Anyway, fixing the type of the input, and just summing inside the loop gives a 6x speed up.

PS: Running the full optimization

with the convert and the new function I get 4.8s

with my matlab I get 4s

I could not commit to Holger gist, so I forked it: https://gist.github.com/tlamadon/58c47c115f8cf2388e89

please check that I have not done anything stupid, but output seemed similar.

Holger, I hope you are having a good time at home (or in Paris?), And a world cup note: Allez les bleus!

very best,

t.

Tim Holy

unread,

Jun 22, 2014, 10:57:53 AM6/22/14

to julia...@googlegroups.com

If x1, ..., x6 or coeff are Float64 arrays, then the initialization

u1 = 0; u2 = 0; u3 = 0; u4 = 0; u5 = 0; u6 = 0

is problematic as soon as you get to

for k=1:nVar
u1 += x1[i + ni*( k-1 + nk* (t-1))]*coeff[k]
u2 += x2[i + ni*( k-1 + nk* (t-1))]*coeff[k]
u3 += x3[i + ni*( k-1 + nk* (t-1))]*coeff[k]
u4 += x4[i + ni*( k-1 + nk* (t-1))]*coeff[k]
u5 += x5[i + ni*( k-1 + nk* (t-1))]*coeff[k]
u6 += x6[i + ni*( k-1 + nk* (t-1))]*coeff[k]
end

because you're initializing them to be integers but then they get converted
into Float64. A more careful approach is to do something like this:

T = typeof(one(eltype(x1))*one(eltype(coeff))
TT = typeof(one(T) + one(T))
u1 = u2 = u3 = u4 = u5 = u6 = zero(TT)

In general, code_typed is your friend: look for Union types.

T = Vector{Float64}
code_typed(ll, (T, T, T, T, T, T, T, T, T, T, T, T, T, Float64, Int, Int))

and you'll see Union types all over the place. (TypeCheck also, but it didn't
seem to pick up this error.) And see the Performance Tips section of the
manual.

--Tim

> >>>>>>>> Andreas Noack Jensen

Thibaut Lamadon

unread,

Jun 22, 2014, 12:13:49 PM6/22/14

to julia...@googlegroups.com

Hi Tim

is this a concern even-though I declare u1::Float64 = 0; at the beginning of the function, in ll2?

t.

Tim Holy

unread,

Jun 22, 2014, 2:46:28 PM6/22/14

to julia...@googlegroups.com

I didn't look at ll2. But that one seems OK.

I didn't read the whole thread; are you timing just the execution of the
objective function, or of the whole optimization? You can't easily interpret
the latter.

--Tim

> > > >>>>>>function ll2(coeff, x1, x2, x3, x4, x5, x6,
y1, y2, y3, y4, y5, y6,
T, nObs, nVar)

llit::Float64 = 0; #zeros(nObs,T)
ni = size(x1,1)
nk = size(x1,2)

denom::Float64 =0;
u1::Float64 = 0;
u2::Float64 = 0;
u3::Float64 = 0;
u4::Float64 = 0;
u5::Float64 = 0;
u6::Float64 = 0;
p1::Float64 = 0;
p2::Float64 = 0;
p3::Float64 = 0;
p4::Float64 = 0;
p5::Float64 = 0;
p6::Float64 = 0;

for i=1:nObs
for t=1:T
u1 = 0; u2 = 0; u3 = 0; u4 = 0; u5 = 0; u6 = 0;

for k=1:nVar
jj = i + ni*( k-1 + nk* (t-1))
u1 = u1 + x1[jj]*coeff[k]
u2 = u2 + x2[jj]*coeff[k]
u3 = u3 + x3[jj]*coeff[k]
u4 = u4 + x4[jj]*coeff[k]
u5 = u5 + x5[jj]*coeff[k]
u6 = u6 + x6[jj]*coeff[k]
end

denom = exp(u1) +
exp(u2) +
exp(u3) +
exp(u4) +
exp(u5) +
exp(u6)

p1 = exp(u1)/denom
p2 = exp(u2)/denom
p3 = exp(u3)/denom
p4 = exp(u4)/denom
p5 = exp(u5)/denom
p6 = exp(u6)/denom

llit +=
(log(p1))*y1[i,t] +
(log(p2))*y2[i,t] +
(log(p3))*y3[i,t] +
(log(p4))*y4[i,t] +
(log(p5))*y5[i,t] +
(log(p6))*y6[i,t]

end
end
return -llit
end>>> 0

Thibaut Lamadon

unread,

Jun 22, 2014, 3:35:37 PM6/22/14

to julia...@googlegroups.com

yeah I agree, I was mostly looking at the objective function. But I also timed the optimization since that was the original point of the thread.

However, the total number of function evaluations is quite different in Matlab and Julia.

As you said, those are different issues, better to focus on the objective function for now,

t.

John Myles White

unread,

Jun 24, 2014, 11:30:59 AM6/24/14

to julia...@googlegroups.com

Would be good to understand why we do such a different number of function evaluations. I’m assuming Matlab uses a different line search algorithm. If we can find out what it is, it might be worth implementing.

— John

João Felipe Santos

unread,

Jun 24, 2014, 11:46:43 AM6/24/14

to julia...@googlegroups.com

MATLAB's documentation states that the line search they use is based on section 2.6 of Fletcher, R., "Practical Methods of Optimization," John Wiley and Sons, 1987.

--
João Felipe Santos

John Myles White

unread,

Jun 24, 2014, 11:52:09 AM6/24/14

to julia...@googlegroups.com

Thanks, Joao. Looking at 2.6, it seems like the line search algorithm is something like the More algorithm we provide, but not identical. If somebody feels up to digging into this, it could be fruitful. Of course, it could also be an extraordinary waste of time if Matlab’s algorithm happens to be faster on this specific problem, but it is worse for most problems. To really assess things, we need to get a big suite of benchmark problems up-and-running like CUTEr.

— John

Reply all

Reply to author

Forward