Code generation for NLPs - results and questions

769 views
Skip to first unread message

Adrian Bürger

unread,
Aug 6, 2018, 12:52:53 PM8/6/18
to CasADi
Hello everyone,

I've been tinkering with CasADi C code generation for an NLP (which is the discretization of an OCP using direct collocation for use within MPC). Since it might be interesting for others as well, I now wanted to both share my results/questions and ask whether you have ideas and/or experience if and how the solution speed could be further improved.

Problem description

The information for the NLP as displayed by IPOPT is as follows:

    Number of nonzeros in equality constraint Jacobian...:    83419
    Number of nonzeros in inequality constraint Jacobian.:    10795
    Number of nonzeros in Lagrangian Hessian.............:    25423

    Total number of variables............................:    18450
                         variables with only lower bounds:     1350
                    variables with lower and upper bounds:     1125
                         variables with only upper bounds:        0
    Total number of equality constraints.................:    12600
    Total number of inequality constraints...............:     4950
            inequality constraints with only lower bounds:      675
       inequality constraints with lower and upper bounds:     3600
            inequality constraints with only upper bounds:      675

Runtime comparison

In the following, a table is displayed that shows the IPOPT solution time on an i5-4570 CPU @ 3.20GHz for the NLP when using MX variables, SX variables and C code compiled with different compilers and options (additionally to using -03 and the option -fno-omit-frame-pointer which I was advised to include):

+------------------------------------------+-------------------------------------+----------------------------+------------+---------------------+----------------+
|Mode                                      |(1): Time in function evaluation* (s)|(2): Total solution time (s)|(3): (1)/(2)|(4): IPOPT iterations|(5): (2)/(4) (s)|
|------------------------------------------+-------------------------------------+----------------------------+------------+---------------------+----------------|
|1) MX                                     |                               103.8 |                       141  |      0.7362|                  381|          0.3701|
|2) gcc-7 -march=haswell                   |                                 7.69|                        55  |      0.1398|                  482|          0.1141|
|3) gcc-7 -march=x86-64                    |                                 5.82|                        42.6|      0.1366|                  381|          0.1118|
|4) clang-6 -march=x86-64                  |                                 3.76|                        40.7|      0.0923|                  381|          0.1068|
|5) clang-6 -march=x86-64 -fno-math-errno  |                                 3.72|                        40.6|      0.0917|                  381|          0.1066|
|6) clang-6 -march=haswell                 |                                 3.72|                        40.7|      0.0914|                  381|          0.1068|
|7) clang-6 -march=haswell -fno-math-errno |                                 3.78|                        40.6|      0.0932|                  381|          0.1066|
|8) SX (expand=True)                       |                                 3.52|                        65.3|      0.0539|                  620|          0.1053|
+------------------------------------------+-------------------------------------+----------------------------+------------+---------------------+----------------+

* Time in function evaluation according to IPOPT output

Obviously, using MX is the by far slowest option regarding duration per IPOPT iteration as shown in (5). The next best results were obtained using C Code compiled by gcc-7. C code compiled using clang-6 was able to achieve nearly the same duration per iteration (5) as using SX variables. Neither compiling architecture specific code nor using no-math-errno (which was merely a guess that it might have a positive effect) did have a noticeable impact on the runtime.

Questions regarding the runtime comparison:
  • The time in function evaluation (1) as shown by IPOPT is much smaller for SX that for other modes. However, this value appears incorrect since the calculated duration per iteration (5) is the same as in other modes of comparable speed. Might it be the case that the computation of the time in function evaluations is done differently for SX variables?
  • Interestingly, the number of iterations taken by IPOPT for 2) and 8) is different from that for all other modes. For 2) this might be a results of this particular compilation mode (I think?). For SX, can this be the results of the expansion of expressions, or is this a strange behaviour?

Compiler time and memory usage

For creating the C code, I took care that the expression graph is small (cf. https://groups.google.com/forum/#!topic/casadi-users/A-3AWqnt-RU). However, I was not able to generate a C file containing the nlpsol-dependencies of total size less than 22,8 MB (including comments, 13,8 MB without comments). Major parts of the code consist of calls to the function casadi_copy(). The source code file(s) are attached to this post.

When using gcc-7, compilation takes approx. 2 hours using a Xeon CPU E5-2620 v3 @ 2.40GHz with peak memory usage of approx. 30 GB. Compilation with clang-6 only has a peak memory usage of 3 GB, but takes in between 18 and 20 hours to finish using an i5-4570 CPU @ 3.20GHz.

Questions regarding compiler time and memory usage:
  • Can you think of a reason for this frequent calls to the casadi_copy function? Is it intended or can certain formulations avoid this? It doesn't seem to make a difference (at least not one visible for me) if I set up the NLP from a single, larger vector of MX variables or from a concatenation of smaller MX vectors.
  • Are there more compiler options which could be used to further speed up either a) the execution time (more important) or b) the compilation time (less important) also especially in the case of clang? Regarding a), I guess using fast-math would be a bad idea (at least using all contained options, right?). Regarding b), would it be possible to speed up compilation time despite keeping the high optimization level -O3?
I hope that this information might be useful for others as well and would be glad if you could provide some answer/hints for the question stated above!. Thank you very much!
nlp.zip

Joris Gillis

unread,
Aug 7, 2018, 9:33:39 AM8/7/18
to CasADi
Dear Adrian,

Your table implies that in all cases, the time spent in the solver (probably the factorisation of the linear system) is dominant over the time spent in function evaluations.
Column 5 is then just converging to the mean cost of an iteration when function evaluations are free?
It also implies you should pick any of 2-8, and focus your efforts on bringing down the factorisation time (what other linear solvers have you tried?).

SX versus MX may not preserve order of operations, which leads to slight differences due to floating point arithmetic.

The generic advice is to keep your graphs compact by having a hierarchy of Functions (see also http://docs.casadi.org/#for-loop-equivalents ).
There's also a memory-speed trade-off internal to CasADi that may be influenced with

GlobalOptions.setMaxNumDir(n);

or with 'max_num_dir' option. Both are fragile commands, the GlobalOptions is less likely to be broken.

Default n = 64 gives fast, but possibly large graphs (-> large generated c code -> slow compiling).
n = 1 would give the most compact c code, but possibly slower.

Best,
  Joris

Adrian Bürger

unread,
Aug 8, 2018, 2:00:52 PM8/8/18
to CasADi
Am Dienstag, 7. August 2018 15:33:39 UTC+2 schrieb Joris Gillis:
Dear Adrian,

Your table implies that in all cases, the time spent in the solver (probably the factorisation of the linear system) is dominant over the time spent in function evaluations.
Column 5 is then just converging to the mean cost of an iteration when function evaluations are free?
It also implies you should pick any of 2-8, and focus your efforts on bringing down the factorisation time (what other linear solvers have you tried?).


Dear Joris,

thank you very much for your answer and for pointing this out to me! The original intention of using C code generation is to be able to use spline interpolation without being restricted to the use of MX variables (cf. https://groups.google.com/forum/#!topic/casadi-users/YzW2-zlaYks). Being not so familiar with C and it's compilers, compiling with smarter flags (possibly already known to you :-) I thought of something that might have further positive impact without having to change the overall setup. But of course you are right, the evaluation times are already small now, and the main time is now spent in factorization.

The comparison in the first post has been done using ma27. It appears to me that ma27 already does a pretty good job regarding time per iteration, and ma57 (without automatic scaling) and ma97 result in more or less equal speed. The other available HSL solvers ma77 and ma86 (with and without multithreading) as well as mumps are all slower, from a factor of 1.5 to 2.5. According to the description at http://www.hsl.rl.ac.uk/ipopt/ I guess ma27, ma57 and maybe also ma97 appear to match my problem well. I neither have Pardiso nor WSMP, but I read that those are more suited for larger problems. Are there other choices available? For formulating the NLP, I took care to preserve sparsity.
 
SX versus MX may not preserve order of operations, which leads to slight differences due to floating point arithmetic.

Thanks for the clarification!
 
The generic advice is to keep your graphs compact by having a hierarchy of Functions (see also http://docs.casadi.org/#for-loop-equivalents ).

I tried to use functions thoroughly, but I didn't use mappings here yet. Would you (apart from faster compilation time) expect an advantage for problems of that size from using maps also, e. g., though multithreading?
 
There's also a memory-speed trade-off internal to CasADi that may be influenced with

GlobalOptions.setMaxNumDir(n);

or with 'max_num_dir' option. Both are fragile commands, the GlobalOptions is less likely to be broken.

Default n = 64 gives fast, but possibly large graphs (-> large generated c code -> slow compiling).
n = 1 would give the most compact c code, but possibly slower.

That's nice to know! So alternatively to using MX, I could produce smaller C code during development that compiles faster, and for production larger code that takes longer to compile but runs faster. Is there some (reasonable) upper limit for this option?

Best regards, and thanks once for your support!

Adrian 
 

Best,
  Joris


Reply all
Reply to author
Forward
0 new messages