variable instantation too long in C code

289 views
Skip to first unread message

Filip Jorissen

unread,
Apr 6, 2017, 1:54:28 PM4/6/17
to CasADi
I'm compiling a fairly large C model generated using casadi 3.1. GCC was getting rather slow and is requiring a seemingly excessive amount of memory too (e.g. 3GB for a 80 MB .c file?). At some point GCC would no longer finish. 

Profiling using the -Q flag showed that the compilation of nlp_jac_g was the culprit. This function contains a declaration of the type:

real_t w0, *w1=w+3815, *w2=w+4082, *w3=w+4349, *w4=w+4523, *w5=w+4525, *w6=w+4701, *w7=w+4717, *w9=w+4733, *w10=w+4777, w11, *w12=w+4996, w13, *w14=w+5215, w15, *w16=w+5434, w17, .... etc etc

In total there were > 14000 variables declared on a single line of code, which made GCC freak out apparently.

After splitting the code over multiple lines the compilation worked again, much faster than before, and required much less memory too.

I'd like to suggest that the code generator only puts up to 100 variable definitions on a single line of code?

Joris Gillis

unread,
Apr 6, 2017, 1:58:08 PM4/6/17
to CasADi
Nice find! If such a simple trick decreases compile time, we should definitely do that..

Best,
  Joris

Filip Jorissen

unread,
Apr 6, 2017, 2:00:58 PM4/6/17
to CasADi
If you have a fix for this then I can let you know how much exactly ;)

Filip Jorissen

unread,
Apr 6, 2017, 2:16:13 PM4/6/17
to CasADi
Nvm, found it. I'll make a pull request later.

Filip Jorissen

unread,
Apr 6, 2017, 2:53:10 PM4/6/17
to CasADi
It looks like I did something wrong since I can't reproduce my earlier results.. Sorry about that.

Maybe nlp_jac_g is just too large? I contains 200 000 lines of code right now, which is one fifth of all the code.

Greg Horn

unread,
Apr 6, 2017, 3:16:36 PM4/6/17
to Filip Jorissen, CasADi
Are you using SX for your NLP, or using the "expand" NlpSolver option? When my NLPs got too large I switched to MX for the NLP (you can still use SX for models, stage objectives, constraints, etc). Also look into using primitives map/mapAccum to compile the code into for loops instead of expanding everything

On Thu, Apr 6, 2017 at 11:53 AM Filip Jorissen <filip.j...@gmail.com> wrote:
It looks like I did something wrong since I can't reproduce my earlier results.. Sorry about that.

Maybe nlp_jac_g is just too large? I contains 200 000 lines of code right now, which is one fifth of all the code.

--
Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to the Google Groups "CasADi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casadi-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/casadi-users/b94ea240-152f-48a9-847d-ae163c16e4bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg Horn

unread,
Apr 6, 2017, 3:19:05 PM4/6/17
to Filip Jorissen, CasADi
Also, could you briefly describe how you used the "-Q flag" to discover that line of code? The gcc -Q flag seems to be a help flag

Filip Jorissen

unread,
Apr 6, 2017, 3:26:39 PM4/6/17
to CasADi
I'm using MX without expanding the code. Is there some documentation of map/mapaccum that I can look at to see if it applies?

I added the -Q flag, which outputs what function GCC is translating, etc. This way I saw that translation did not proceed further after reaching a certain function (nlp_jac_g). I then looked at the function body and tried to change the implementation of what looked like a very long line of C code. But this apparently was not the issue since the long (infinite?) compilation time stays after spreading the definition over multiple lines. So -Q only pointed me at the function, not the line of code.

Filip Jorissen

unread,
Apr 6, 2017, 3:35:48 PM4/6/17
to CasADi
Based on the API description I cannot use mapaccum since my problem is formulated slightly differently than what mapaccum requires: I eliminate certain states such that I cannot write X[k+1] = A*X[k] + BU[k]. 

Greg Horn

unread,
Apr 6, 2017, 3:52:36 PM4/6/17
to Filip Jorissen, CasADi
What does your loop look like?

On Thu, Apr 6, 2017 at 12:35 PM Filip Jorissen <filip.j...@gmail.com> wrote:
Based on the API description I cannot use mapaccum since my problem is formulated slightly differently than what mapaccum requires: I eliminate certain states such that I cannot write X[k+1] = A*X[k] + BU[k]. 

--
Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to the Google Groups "CasADi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casadi-users...@googlegroups.com.

Filip Jorissen

unread,
Apr 6, 2017, 4:27:33 PM4/6/17
to CasADi
essentially:
X[k] =  A^k * X0 + A^(k-1)*B*U[0] + A^(k-2)*B*U[1] + ...

Greg Horn

unread,
Apr 6, 2017, 9:01:56 PM4/6/17
to Filip Jorissen, CasADi
You can formulate that with mapAccum or I think even map then sum. Are your A matrices constants or symbolic jacobians?

On Thu, Apr 6, 2017 at 1:27 PM Filip Jorissen <filip.j...@gmail.com> wrote:
essentially:
X[k] =  A^k * X0 + A^(k-1)*B*U[0] + A^(k-2)*B*U[1] + ...

--
Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to the Google Groups "CasADi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to casadi-users...@googlegroups.com.

Filip Jorissen

unread,
Apr 7, 2017, 1:36:07 PM4/7/17
to CasADi
My A matrices are DM's. I'm going to try to put that computation in a map and hope that helps! I'm not sure how to use a mapaccum but I figure a map should do.

Filip Jorissen

unread,
Apr 7, 2017, 2:08:29 PM4/7/17
to CasADi
I waited some time for the compilation to finish for a problem of 'manageable' size. This resulted in the following output when using the -Q flag:

Execution times (seconds)
 phase setup            
:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1094 kB ( 2%) ggc
 phase parsing          
:   7.11 ( 1%) usr   5.98 (54%) sys  13.09 ( 1%) wall  378689 kB (633%) ggc
 phase opt
and generate  :1198.99 (99%) usr   4.98 (45%) sys1204.81 (99%) wall 3874355 kB (6473%) ggc
...
 
out of ssa              :  51.11 ( 4%) usr   0.00 ( 0%) sys  51.10 ( 4%) wall       0 kB ( 0%) ggc
 expand vars            
:1041.32 (86%) usr   0.20 ( 2%) sys1042.24 (86%) wall   85033 kB (142%) ggc
 integrated RA          
:  34.93 ( 3%) usr   1.14 (10%) sys  36.17 ( 3%) wall  805572 kB (1346%) ggc
...
 
final                   :   8.32 ( 1%) usr   0.32 ( 3%) sys   8.64 ( 1%) wall  153117 kB (256%) ggc
 rest of compilation    
:   8.51 ( 1%) usr   0.02 ( 0%) sys   8.63 ( 1%) wall  231837 kB (387%) ggc
 TOTAL                
:1206.10            10.99          1218.07              59852 kB
So the 'expand vars' section seems to take a lot of time.

The answer of this question explains that the cause is essentially that there are too many variables which the compiler need to loop over, which requires O(N^2) operations where N is the number of variables.

So when functions become too large, we're in trouble! Is it possible to split functions into multiple parts automatically in casadi? Or alternatively it could make sense to throw a warning when functions contain many variables, suggesting the user to split up the function into smaller parts?

Filip Jorissen

unread,
Apr 7, 2017, 4:55:07 PM4/7/17
to CasADi
A solution could be to declare an array of real_t's instead of multiple real_t's? This is what happens in Modelica/Dymola where they also use one large function to compute all variables.

Filip Jorissen

unread,
Apr 7, 2017, 8:28:47 PM4/7/17
to CasADi
I managed to convert all my code to use functions and maps, but I'm still getting large C code files. I did some analysis to deug this. I'll try to explain what I did:

In my code I use 7 'maps' with various numbers of inputs and 1 output, essentially for each discrete time index i:
1) and 2) compute inputs U[i] from parameters p[i] and optimisation variables o[i]
3) compute states X[i] from U[:] (i.e. function of all optimisation variables) and X0 (X0 is an element of p) as follows:

int intervals = 12;
MX A
= MX::sym("A", n_statesUsed, n_states);
MX B
= MX::sym("B",  n_statesUsed, n_inputs*intervals);
MX b
= MX::sym("b",  n_statesUsed, 1);
MX X0
= MX::sym("X0", n_states,1);
MX U
= MX::sym("U", n_inputs*intervals,1);
 
MX
Xexp = mtimes(A, X0) + mtimes(B, U) + b;
Function XFun("XFun", {A, B, b, X0, U}, {Xexp});
Function XMap = XFun.map(intervals);

4) compute inequality constraints G[i] from X[i], U[i], p[i], o[i]
5 and 6) compute 2 sets of equality constraints H1[i], H2[i] from X[i], U[i], p[i], o[i]
7) compute cost C[i] from X[i], U[i], p[i], o[i]

using these I compute constraints and cost:
constraints = vertcat(vertcat(G[:]), vertcat(H1[:]), vertcat(H2[:]))
cost += C[i] // for all i

finally:

MXDict nlp = {{"x", o}, {"f", cost}, {"g", constraints}, {"p",p}};
Function solver = nlpsol("solver", "ipopt", nlp);
solver
.generate_dependencies("nlp.c");

Some vector lengths: 
X0: 276
p: 276+176*intervals
o: 16*intervals 
G: 8*intervals
H1: 44*intervals
H2: 0

the expressions contained by U, G, H are not trivial bot not incredibly complex either. They come from a Modelica model that I am parsing through JModelica and they involve some min/max/if-else/exp/power and other expressions.

When compiling this into C-code for intervals=12 I get a file size of 1.4 GB when using an implementation using maps and 410 MB when using an earlier version of my code without functions/maps: i.e. U,G,H,X are all inlined into 'constraints' and 'cost' using MX::substitute(). This seems odd since the second implementation contains a lot of code duplication.  Either way GCC won't compile the resulting file. 

When translating a simpler variant of the same problem with:
X0: 276
p: 276+176*intervals
o: 16*intervals 
G: 8*intervals
H1: 0
H2: 0
and U and C essentially consisting of vertcat() only, i.e. they contain no equations, the C code sizes are respectively 920MB and 9MB with/without maps. This seems to suggest that the XFun/XMap implementation is somehow not very efficient.

Since looking at the code for smaller problems is more manageable, I looked at a problem with:
X0: 1
p: 1 +1*intervals
o: 1*intervals
G: 1* intervals
H1: 0
H2: 0
U contains just a few equations, C = sum(o). intervals=12

In the c code in 'fwd24_XFun_12' I find an array of size 6912 that is passed onto 'fwd24_XFun_12_map' for the 7th function input ("Input 7 (fwd_i1), part 0 (x_7)"). The XMap inputs have following sizes:

Atot: (1, 12)
Btot: (1, 288)
btot: (1, 12)
X0tot: (1, 12)
Utot: (24, 12)

fwd_i1 seems to correspond to B and 6912 = 24*288. One 12th of Btot is multiplied with one 12th of Utot, which leads to 24*24 operations. This happens 12 times, which in total leads to 12*24*24=6912 operations. I figure this is related to the used AD algorithm.

Anyway, 'fwd24_XFun_12' is in turn called from 'nlp_jac_g', which simply passes a zero(6912,0.) vector as an input argument. Eventually this array is used in 12 loops over function 'fwd24_XFun', each containing 24 pieces of code looping over a vector of size 24. The first of these pieces of code is pasted here:


  w6 = 0.;
  /* #12: Input 7 (fwd_i1), part 0 (f0_1) */
  if (arg[7])
    copy(arg[7], 24, w5);
  else 
    fill(w5, 24, 0);
  /* #13: Input 4 (der_i4), part 0 (U) */
  if (arg[4])
    copy(arg[4], 24, w7);
  else 
    fill(w7, 24, 0);
  /* #14: @6 = mac(@5,@7,@6) */
  for (i=0, rr=(&w6); i<1; ++i) for (j=0; j<1; ++j, ++rr) for (k=0, ss=w5+j, tt=w7+i*24; k<24; ++k) *rr += ss[k*1]**tt++;

This essentially seems to compute the first term of d(B*U) = dB *U + B*d(U), which makes sense if it weren't for the fact that B is a constant and therefore dB = 0. This computation could therefore be avoided?
Also, is it really required to copy arg[4] into w7, can't you just pass the address of arg[4] to the for loop?

Another odd thing is that there are 24 copies of the code that I pasted above. I would expect you only need 1? If this is a bug (which I doubt) then it would explain the huge codes I get when the size of U becomes larger: i.e. 276+176*intervals = 2388. This could lead to vectors of  2388*2388*12= 68 million entries and 2388 times more code than required.

One other concern I have: in all my problems U is a mix of optimisation variables o and parameters p. There are significantly more parameters p (2388) than optimisation variables (16). For parameters we don't care about the derivatives since dU=dp=0, but they seem to be computed anyway? So I suppose I should split up my matrix B into a part for 'p' and a part for 'o' to get this evaluated efficiently? Or won't this help? Can't this be automated somehow?

Either way I'll give it a try. I'll try to also look up a reference for how DA works, that should help me figure it out. Any advice to help debug this would be appreciated, or a good reference for DA would be cool too. :)

Thanks in advance, good weekend!

Filip



Filip Jorissen

unread,
Apr 11, 2017, 7:02:22 PM4/11/17
to CasADi
I found part of the problem.

My model essentially contains a function 

Xnext = XFun(B,U)= mtimes(B, U)

B is a dense DM (16 x 2088), containing 33408 elements and U is dense too (2088x1). The optimisation problem has 32*12 optimisation variables where 12 is the number of MPC control intervals. Apparently casadi needs 64 DA sweeps to compute the derivatives of these 384 variables. Function XFun is called in a map with 12 iterations.

To compute the derivative, the C-code function "fwd64_XFun_12" is called with a zero vector for the 'derivatives' of B of size 33408*64*12 = 25 657 344. Since B is dense, the corresponding sparsity vector has size 25 657 344. The resulting declaration of this sparsity vector in C - code has a length of 245 462 379 characters. Assuming 1 byte per character, this adds up to 250 MB of code for one sparsity vector.

I see two solutions for this problem, but I'm not sure how straight forward they are too implement:
1) support dense matrices and vectors
2) add a flag to casadi functions that allow the user to indicate that derivatives with respect to B should never be computed. The input vector should then not be required.

I can of course reformulate my problem to mitigate the problem, but in the future I'd like to go to 12->24 intervals, 16 -> 100 states, 2088->20000 inputs, which would increase the vector size by a factor 120 so this does not look like a good solution.

How can this be solved properly?

Thanks!


Reply all
Reply to author
Forward
0 new messages