Automatic Vs Symbolic Differentiation: Casadi vs matlab's symbolic toolbox, octave's symbolic package, and python's sympy.

Dave Goel

unread,

Aug 31, 2017, 1:02:26 PM8/31/17

to CasADi

Casadi uses automatic differentiation (AD). For some reason, most everything else uses symbolic differentiation ("SD" below). Aren't they basically the same thing? No!

Let me illustrate via a simple example, and in plain English!

In your algebra class, your teacher posed this question: y = sin(sin(... (x)))). z = y^3 - cos(y) + y. What is z?

If you (correctly!) answered z = y^3 - cos(y) + y, where y = sin(sin...(x)), you'd get a big fat 0, since you just repeated the question back to him. For all real-life applications, your answer is THE correct way. The professor, however, expects you to write down the entire expression for z in terms of x. It is unwieldy to write down. Worse, if you computed it for, say, x=1.2, you'd be computing sin(sin...(x)) three times! You can quickly imagine this graph exploding for real-life situations.

You used a "where" trick. AD uses your "where" trick by default everywhere. I don't know why you wouldn't. It's not even a "trick", and is a very natural thing to do. After all, every computer operation is composition of few basic operations such as , a^b, a*b, sin(a), exp(a), etc. All you have gotta do is apply your basic calculus at these levels.

SD, on the other hand, doesn't use "where" substitution. Symbolic believes in writing down the entire humongous expression z(x), and then computing it for x=1.2. That leads to an exponential number of repetitions of computing sin(sin(...(x))).

Symbolic seems, to me, to be an unfortunate hold-over from that old middle school algebra class. I believe that some packages such as theano and tensorflow do try to use "where" substitutions, but they do so "after the fact." They first write the whole symbolic expression, and then go looking for nice "where" substitutions in that humongous graph. It does take them a long time, therefore, to initially "set up" your expression. (I could be wrong here.)

So, matlab provides a nice symbolic toolbox called "symbolic." Octave provides the same as a package. Octave's symbolic uses python sympy under the hood, which uses SD.

Let's run a simple test. We will compose an unwieldy y(x) and then a z(y), and ask for each package to differentiate z wrt x.

Here's how it works for casadi. Casadi's run-time is almost independent of k! Casadi handles it just fine!

k = 50; ## 0.004s to compute AND to display.

x= MX.sym('x');

y = x+0;

for ii=1:k; y = sin(y + 0.1*y^3); endfor

z = y^4 * log(y^2) * exp(y^3 - tanh(y));

tic; h = gradient(z,x); toc ## 0.004s

F = Function('f',{x},{h});

tic; answer = F(1.2) ## -0.024 when k=4

toc

Now, let's try the same thing with sympy/symbolic:

k = 6;

x= sym('x');

y = x+0;

for ii=1:k; y = sin(y + y^3/10); endfor

z = y^4 * log(y^2) * exp(y^3 - tanh(y));

## disp(z) ## Even at k = 4, a LARGE expression.

tic; h = gradient(z,x); toc

tic;

answer = double(subs(h, 6/5)) ## -0.024 for k = 4

toc

So, there we are. We can see how sympy is exponential in k.

For k = 5, casadi takes 0.004s for defining and computing.

For k = 5, sympy takes 13s to define and 25s to compute. That's 6000 times slower!

For k = 6, casadi continues to take 0.004 and 0.004s.

For k = 6, sympy takes 42s to define, and 96s to compute. That's 24,000 times slower!

As you can see, we dare not increase k too much for sympy. Whereas, with casadi, we can increase k to stupidly large values!

For k = 100, casadi takes 0.01s to define and 0.004s to compute.

For k = 300, casadi takes 0.02s to define and 0.004s to compute.

Jonas Koenemann

unread,

Aug 31, 2017, 2:58:11 PM8/31/17

to CasADi

Symbolic toolboxes are good in simplifications :)

>> syms x y; x^6/y*y^3/x/x/x/x/y/y/x/x

ans =

1

>> x = casadi.SX.sym('x'); y = casadi.SX.sym('y'); x^6/y*y^3/x/x/x/x/y/y/x/x

ans =

SX(((((((((((sq((x*sq(x)))/y)*(y*sq(y)))/x)/x)/x)/x)/y)/y)/x)/x))

Dave Goel

unread,

Aug 31, 2017, 3:14:34 PM8/31/17

to Jonas Koenemann, CasADi

Indeed. Casadi does provide some of that functionality via 'simplify' if it's used judiciously :)

--
Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to a topic in the Google Groups "CasADi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/casadi-users/1mRXfV6G150/unsubscribe.
To unsubscribe from this group and all its topics, send an email to casadi-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/casadi-users/57bf4bc1-f304-4ca2-ba5e-056ff5d59a32%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Joel Andersson

unread,

Sep 1, 2017, 11:06:42 AM9/1/17

to CasADi

Thanks for a nice illustration of AD vs. symbolic differentiation!

About simplifications, that's definitely an area where CasADi could improve. Probably not to the level of a computer algebra system like sympy, but enough to speed up a lot of user applications. I wrote down some details about this in the following GitHub issue https://github.com/casadi/casadi/issues/2069. It's not something that I'm planning to work on anytime soon, unless it's needed in a project I'm taking part in. Also note that the "simplify" command doesn't do much at the moment.

Joel

Dave Goel

unread,

Sep 1, 2017, 1:40:24 PM9/1/17

to CasADi

Responding to both Jonas and Joel. Joel, thanks for the new github issue.

It's probably worth mentioning that when unsure, it's probably best choose to err on the side of not simplifying than the other way. I almost think of simplification as a non-issue.

In your big computation graph, a non-simplified expression somewhere deep inside is a constant added time, not even a multiplicative factor.

You'd be hard-pressed to convert it into a multiplicative cost. Even harder to make it into an exponential cost like the one my example showed. (Barring the one case where the whole global graph simplifies to something like 0 or 1.)

Filip Jorissen

unread,

Sep 4, 2017, 3:13:54 AM9/4/17

to CasADi

Hi Dave,

thanks for sharing these examples, quite useful:) I'm just wondering whether this is not partly caused by the SD _implementation_ rather than SD itself. E.g. when differentiating the example expression symbolically, a LOT of common subexpressions are formed. Casadi exploits this by evaluating every common subexpression only once. Sympy might not, which leaves a lot of optimisation potential.

I'd like to share something too that's somewhat related, based on discussions with Joris Gillis earlier this week. I'm working on an NLP that contains about 40 million expressions (i.e. the SX-generated code is 40 million lines long - 1GB of code). My experience is that CasADi is then still very fast at handling all the expressions but the generated code was not scaling as expected: the evaluation time of Jacobian and Hessian was respectively 20 and 64 times more costly than a constraint evaluation (see below). My 'intuition' however told me that the constraint Jacobian evaluation time should be of the same order as the constraint evaluation time.

proc wall num mean mean

time time evals proc time wall time

nlp_f 14.853 [s] 14.862 [s] 1301 11.42 [ms] 11.42 [ms]

nlp_g 14.115 [s] 14.121 [s] 1302 10.84 [ms] 10.85 [ms]

nlp_grad_f 15.473 [s] 15.481 [s] 979 15.80 [ms] 15.81 [ms]

nlp_jac_g 2513.154 [s] 2514.171 [s] 1086 2314.14 [ms] 2315.07 [ms]

nlp_hess_l 6753.127 [s] 6755.846 [s] 1046 6456.14 [ms] 6458.74 [ms]

all previous 9310.723 [s] 9314.480 [s]

callback_prep 0.017 [s] 0.017 [s] 959 0.02 [ms] 0.02 [ms]

solver 4345.156 [s] 4346.881 [s]

mainloop 13655.896 [s] 13661.378 [s]

This also got me thinking about AD vs SD. I figured that SD would be more efficient and with Joris I found an example that 'illustrates' this.

Consider we have constraints g(x) = Ax where A is a square n by n matrix filled with ones. The evaluation time of g(x) is then O(n^2). Forward/backward AD would require n sweeps of an expression that costs O(n^2). So the total evaluation time is O(n^3). With SD you'd be able to figure out that d(g(x)_i)/dx_j = A_i,j during preprocessing and then simply store that jac_g = A. Storing A is an expression that takes O(n^2).

This is only half the story of course: this is a very specific example that is 1) linear, 2) square, 3) dense. For non-square, sparse problems, AD can probably do the evaluation more efficiently. Moreover, from what Joris tells me, CasADi does a pre-processing (only when using SX?) that uses AD symbolically to figure out that d(g(x)_i)/dx_j = A_i,j. So the pre-processing is O(n^3) but the actual evaluation time is O(n^2). Nonetheless, I'm seeing a difference in computation time between g(x) and jac(g(x)) as indicated above. At this point I don't know whether this is normal. However, by reformulating my problem (adding optimisation variables and therefore increasing sparsity), I now have cost(jac(g)) = 3 cost(g) instead. I'm not sure whether this changed performance indicates a bug in the dense formulation or whether it's just a logical consequence of how AD works..

So I agree that casadi/AD works very well but we should not dismiss SD too casually :) Although I do think it would be hard to beat casadi =)

Dave Goel

unread,

Sep 4, 2017, 3:56:46 AM9/4/17

to CasADi

Hi Filip,

Thanks for sharing. I'm out of my depth when it comes to nlp, But curious: What's the rough size of your parameter space (number of variables) that generates a compiled code that big? Also curious: What kind of problem would call for such a huge nlp?

Do you use SX rather than MX because it takes more memory but is otherwise faster computation-wise?

In my limited experience with casadi + neural networks, where I have some 400k free parameters (organized into 10 matrices) SX is faster computation-wise, indeed. But, is much, much slower during defintion. So much so that I can finish my entire optimization in 10-60 minutes via the "slow" MX comptuation. That would be faster than the time I'd spend merely define via SX. It is also for this reason that compiling-to-c doesn't seem to be much benefit for me either.

Finally, when you are a hammer, everything looks like a nail: I always wonder if I can (approximately) cast nlp into simple cost-minimization problems, for which super-fast first-order solutions (Adam, etc.) are available. That is, to my cost function, I add a large (differentiable) cost for every missed constraint, end up with a scalar cost, and simply minimize away. Example, for a constraint such as x+y<3, I add a cost 1000 * (k>0) * k where k=x+y-3. The NLP world must have surely thought of this (approximate) method and rejected this?

Dave

Filip Jorissen

unread,

Sep 4, 2017, 4:26:17 AM9/4/17

to CasADi

Hi Dave,

I have a (fairly dense) problem with (order of magnitude) 2000 optimisation variables and 2000 constraints. I'm optimising the control of a building.

Joris suggested to try SX to see how it performs relative to MX. The flat code allowed me to debug easier what casadi was actually doing. Based on this I simplified the problem and I'm currently running some benchmarks which suggest that MX is faster.

I haven't noticed a performance penalty for the NLP expansion, but compilation is indeed a problem. However, I need the code to be portable so I have to generate the code. For the SX formulation the code is especially large and I had to change casadi to split the code over multiple files of max 100k lines to avoid excessive compilation times and memory usage (>100GB) when using GCC. Compilation then takes 30s per file when using -O0.

I have thought about the cost-minimisation approach too. For some of my (thermal comfort) inequality constraints this is certainly an option, but I haven't tested it yet. For other (equality) constraints (continuity of state variables) it is less obvious and moreover it's not easy to determine what fixed constant the cost entries should be multiplied with. But it's certainly something I'd like to look into at some point!

If you look at what ipopt does, they also implicitly add constraints to the cost function by multiplying with Lagrangian multipliers. The difference is that the lagrangian multipliers are variable in that approach. Hence, they also need to be computed, which augments the linear system of equations that needs to be solved. This increases problem size and therefore computation time. So in cases where we don't care much about the exact value of the lagrangian multipliers I definitely see an advantage, computation time-wise. A disadvantage is that the large cost entries cause your problem to be badly scaled. I'm not sure whether this is different when using Lagrangian multipliers.. Also, using things like (k>0)*k introduces a non-differentiability, which could (also) introduce convergence problems. I'm sure that many people thought of the same approach but it probably depends a lot on the case whether you can gain a lot from it.

Filip Jorissen

unread,

Sep 4, 2017, 5:52:29 AM9/4/17

to CasADi

Some benchmark results for a toy problem:

Number of nonzeros in equality constraint Jacobian...: 21048

Number of nonzeros in inequality constraint Jacobian.: 144

Number of nonzeros in Lagrangian Hessian.............: 1224

Total number of variables............................: 648

variables with only lower bounds: 72

variables with lower and upper bounds: 216

variables with only upper bounds: 0

Total number of equality constraints.................: 432

Total number of inequality constraints...............: 144

inequality constraints with only lower bounds: 144

inequality constraints with lower and upper bounds: 0

inequality constraints with only upper bounds: 0

All statistics are for c-code generated for an nlp with expand=true/false, using GCC.

Very important remark: I ran these benchmarks while also running my PC on heavy load for some other optimisations. So the system was operating on some performance bottleneck, presumably the memory bandwidth, while running this.

SX (expand=true), -00 ==> 36.90 MB .so file

proc wall num mean mean

time time evals proc time wall time

nlp_f 0.006 [s] 0.006 [s] 14 0.40 [ms] 0.40 [ms]

nlp_g 0.034 [s] 0.034 [s] 15 2.29 [ms] 2.30 [ms]

nlp_grad_f 0.009 [s] 0.009 [s] 15 0.63 [ms] 0.63 [ms]

nlp_jac_g 0.060 [s] 0.060 [s] 15 3.99 [ms] 3.99 [ms]

nlp_hess_l 0.113 [s] 0.113 [s] 13 8.67 [ms] 8.67 [ms]

all previous 0.222 [s] 0.222 [s]

callback_prep 0.000 [s] 0.000 [s] 14 0.02 [ms] 0.02 [ms]

solver 0.175 [s] 0.175 [s]

mainloop 0.397 [s] 0.397 [s]

SX (expand=true), -02 -march=native ==> 11.44 MB .so file

nlp_f 0.002 [s] 0.002 [s] 14 0.14 [ms] 0.14 [ms]

nlp_g 0.011 [s] 0.011 [s] 15 0.71 [ms] 0.72 [ms]

nlp_grad_f 0.004 [s] 0.004 [s] 15 0.25 [ms] 0.25 [ms]

nlp_jac_g 0.022 [s] 0.022 [s] 15 1.44 [ms] 1.44 [ms]

nlp_hess_l 0.047 [s] 0.047 [s] 13 3.59 [ms] 3.59 [ms]

all previous 0.085 [s] 0.085 [s]

callback_prep 0.000 [s] 0.000 [s] 14 0.02 [ms] 0.02 [ms]

solver 0.173 [s] 0.174 [s]

mainloop 0.259 [s] 0.259 [s]

MX, -O0 ==> 3.4 MB .so file

proc wall num mean mean

time time evals proc time wall time

nlp_f 0.001 [s] 0.001 [s] 14 0.06 [ms] 0.06 [ms]

nlp_g 0.006 [s] 0.006 [s] 15 0.37 [ms] 0.37 [ms]

nlp_grad_f 0.002 [s] 0.002 [s] 15 0.12 [ms] 0.12 [ms]

nlp_jac_g 0.029 [s] 0.029 [s] 15 1.95 [ms] 1.96 [ms]

nlp_hess_l 0.167 [s] 0.167 [s] 13 12.83 [ms] 12.83 [ms]

all previous 0.204 [s] 0.204 [s]

callback_prep 0.000 [s] 0.000 [s] 14 0.02 [ms] 0.02 [ms]

solver 0.159 [s] 0.159 [s]

mainloop 0.363 [s] 0.363 [s]

MX, -02 -march=native => 2.04 MB.so file

proc wall num mean mean

time time evals proc time wall time

nlp_f 0.000 [s] 0.000 [s] 14 0.03 [ms] 0.04 [ms]

nlp_g 0.004 [s] 0.004 [s] 15 0.26 [ms] 0.26 [ms]

nlp_grad_f 0.001 [s] 0.001 [s] 15 0.07 [ms] 0.07 [ms]

nlp_jac_g 0.017 [s] 0.017 [s] 15 1.15 [ms] 1.15 [ms]

nlp_hess_l 0.088 [s] 0.088 [s] 13 6.75 [ms] 6.76 [ms]

all previous 0.110 [s] 0.110 [s]

callback_prep 0.000 [s] 0.000 [s] 14 0.02 [ms] 0.02 [ms]

solver 0.165 [s] 0.165 [s]

mainloop 0.276 [s] 0.276 [s]

Conclusions:

1) -O2 kicks ass, especially for SX

2) MX is faster for the objective evaluation and constraint evaluation.

3) MX is slower for the Jacobian and Hessian, which are more expensive.

4) You might therefore say that SX is better, but!:

5) For SX the grad_f/jac_g/hess_lag evaluation is about a factor a/b/c=1.5/2/4 slower than the f/g/g evaluation, while for MX this is a factor d/e/h=2/5/30.

My experience is that a/b/d do not change much with problem size and c/e/h increase with problem size. This makes SX especially suitable for the larger problems that I am dealing with. The big question is: why? I suspect that MX is faster for the f/g evaluations simply because the code is more compact (due to for loops) and therefore less code needs to be transferred from memory to the CPU for execution, which seems to be the bottleneck (see ***). The question is then why does the evaluation of jac_g and hess_l become so slow? I suspect that MX simply generates less efficient code for the derivatives than what it does for SX. Whether this is a bug, I do not know, but it could be something interesting to look into further..

***My system has DDR4 2400MHz memory, which has a capacity of about 30 GB/s. I have 4 DDR4 slots in quad channel, so I suspect the actual bottleneck is the CPU memory bandwidth of 76.8 GB/s. Seven optimisation processes were running, so that's about 11 GB/s per process. The compiled g -code is 1.4 MB. Transferring this code to the CPU thus requires about 0.13 ms. Executing the 88k lines of code takes only ~0.02 ms on a 3.8 GHz CPU and possibly even less if multiple lines can be combined using the advanced instruction sets. The total time required for evaluating nlp_g is 0.71 ms so these orders of magnitude seem reasonable. So memory bandwidth seems to be the bottleneck. When running 12 instead of 1 optimisations in parallel, the evaluation time per process goes up by a factor 6, so this also suggests that the CPU is not the bottleneck.

Joel Andersson

unread,

Sep 4, 2017, 11:07:53 AM9/4/17

to CasADi

Hi Filip,

The cost of calculating the Jacobian can not in general be assumed to be of the same order as calculating the original function. AD only offers this guarantee for Jacobian-times-vector products (forward mode AD) or vector-times-Jacobian products (reverse mode AD). For full Jacobians, AD frameworks like CasADi uses different heuristics that usually try to exploit sparsity. In the case of CasADi, a graph coloring approach is used. So if your Jacobian calculation is very slow compared to the original function, you should have a look at the sparsity pattern of the Jacobian to figure out why this might be the case. You can also enable the "verbose" option for the function, so that it prints out the result of the graph coloring algorithm. You can also call the function Sparsity::spy_matlab to get a Matlab script to visualize the sparsity pattern and post the resulting figure in this forum and we can try to help you :)

About SX vs. MX: MX has the potential to scale much better than SX, but it's also more unforgiving to having a suitable correct problem formulation. So you need to be more careful in how you formulate the problem. To get best speed, you can use a combination of both, most easily done by using the "expand" command, which converts MX into SX. Usually, you should just convert the lowest level of your algorithm into SX, not the whole NLP graph.

Joel

Dave Goel

unread,

Sep 4, 2017, 1:49:38 PM9/4/17

to Filip Jorissen, CasADi

Hi Filip,

Thanks for explaining. Again, I'm out of my depth, but that won't stop me from offering some comments below. Take them with a grain of salt :), but if any of this ends up being helpful to you, I'd be very happy :)

If nothing else works out, you might look more seriously at the cost-minimization method. I understand that k*(k>0) is technically not differentiable at 0, but in my experience, that doesn't matter. Besides, k>0 can easily be softened (example: x/(1+abs(x)); but I never find the need for that. When I want to impose constraints on my (independent, or dependent) variables, I actually prefer (k>0) * (k+k^2) instead. One might be tempted to use things like (exp(k)-1)*(k>0), but exp can blow up.

The ML community has discovered super-fast first-order methods (no hessian needed! Also, even if you use second-order methods, a hessian seems a big O(n^2) deal, but in reality, it's always contracted away, and AD systems manage to figure out the contracted hessian fast!).

In principle, you have to worry about saddle points, about millions of local minima, and about function convexity. *In practice*, the upshot in machine learning world seems to be that:

* Large N is actually a boon! The more, the merrier! More is more!

* With large N, (almost all) minima tend to be very close to the global one. No real need to worry about convexity. Besides, almost all extrema tend to be saddle points rather than minima. Speaking of:

* Saddle points are not much of a problem at all, especially with methods like Adam+Perturbation. (No real need to worry about convexity!)

The above are not foolproof laws, but often hold. There's some theory behind them as well.

As a reminder, my N is 400k right now, and with MX, things converge for me to 100% in-sample accuracy within an hour, no -O3 or even -O0 needed. I do understand that our problem-structures might be very different.

--

Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to a topic in the Google Groups "CasADi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/casadi-users/1mRXfV6G150/unsubscribe.
To unsubscribe from this group and all its topics, send an email to casadi-users+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/casadi-users/b685edad-70fe-464b-913c-8c041481e337%40googlegroups.com.

Chen Yutao

unread,

Sep 5, 2017, 1:27:48 PM9/5/17

to CasADi, filip.j...@gmail.com

My experience is that, MX structure makes expressions really compact, and MX functions are compiled really fast. However, MX is not that suitable for block-wise operations, e.g. part element access and assignment for a big matrix. Another problem is that MX expression usually does not simplify expressions, hence leading to possible numerical problems. For example, for an expression x/x, SX simplifies it to 1 but MX keeps the original structure. So when giving x=0 as an input, MX functions will return NaN but SX can return 1. This problem is very frustrating when integrating and computing sensitivities of dynamic system modeled by very complex expressions. It is hard to check by hand if there will be very small values for intermediate states that may cause numerical problems.

I hope this can be solved in the future.

Yutao

Dave Goel

unread,

Sep 5, 2017, 1:46:53 PM9/5/17

to Chen Yutao, CasADi, Filip Jorissen

Here are some common-sense rules I like to follow:

Add a 1e-100 to all denoms.

Any time you have any x^(fractional power), add a 1e-100 to x (because some derivative could end up in the denominator.)

Replace any log(x) by log(max(x,1e-100)).

Replace any exp(x) by exp(min(x,1e100)). Same goes for things like tanh, etc. because they involve exp.

My programs immediately error out if they detect a single NaN or Inf. If I get a NaN, I did something wrong above.

--

Sent from CasADi's user forum at http://forum.casadi.org.
---
You received this message because you are subscribed to a topic in the Google Groups "CasADi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/casadi-users/1mRXfV6G150/unsubscribe.
To unsubscribe from this group and all its topics, send an email to casadi-users+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/casadi-users/deef055f-266c-4b41-b0b6-f4ab27ac6018%40googlegroups.com.

Dave Goel

unread,

Apr 16, 2018, 1:50:03 AM4/16/18

to CasADi

CASADI vs SYMPY vs THEANO vs TENSORFLOW:

So, I also reproduced these examples in theano and tensorflow. (Code below). I verified that for each calculated k, all four implementations produce identical results.

## For k = 500,

## CASADI 0.03s [INDEPENDENT(!!) OF K??????]

## SYMPY: forever [[exponential in k]

## THEANO: forever: [ exponential in k]

## TF: 30s. [ linear in k]

The real test comes when you make k large. As you might remember, sympy choked very rapidly as you increased k beyond 5, and the behavior is exponential.

It seems that theano suffers from the same problem.

Someone told me that tensorflow does use AD, and upon testng, it shows.

While casadi magically seems constant in k, tf is not too bad either: it seems to be linear in k. That is, if I increase k to 500, it takes 26s, but casadi stays at 0.04s.

How does casadi manage to be independent of k????

Note that for k = 500, casadi is three orders of magnitude better than tf, but both are far, far better than everything else. [Both produce identical results].

Below follow the codes to do this for theano and tf:

----

Theano:

## Can be run directly in debian. Just apt-get install python-numpy.

## Unlike casadi, will hang forever for larger k, say, k=100. In theano, Exponential in k.

## Verified same results for smaller k.

import numpy

import theano

import theano.tensor as T

from theano import pp

x = T.dscalar('x')

k = 20; y = x+0;

for ii in range (0,k):

y = numpy.sin(y + 0.1* y**3)

z = y**4 * numpy.log(y**2) * numpy.exp(y**3 - numpy.tanh(y));

gz = T.grad(z, x)

f = theano.function([x], gz)

print(f(1.2))

----

## To run on debian, create a virtualenv, then pip install tensorflow.

## Again, produces the same results for x=1.2.

## Unlike theano, tf does not take forever for k=100.

## While casadi takes 0.04s, tf takes 7.4s. It's two orders of magnitudes slower, but is still, far, far better than the rest.

import tensorflow as tf

# define symbolic variables

x = tf.placeholder("float")

y = tf.placeholder("float")

k = 500; y = x+0;

for ii in range (0,k):

y = tf.sin(y + 0.1* y**3)

z = y**4 * tf.log(y**2) * tf.exp(y**3 - tf.tanh(y));

# The derivative of R with respect to y

gz = tf.gradients(z, x);

# Launch a session for the default graph to comput dR/dy at (x,y)=(0.362, 0.556)

sess = tf.Session()

result = sess.run(gz, {x:1.2})

print result

----

Joel Andersson

unread,

Apr 16, 2018, 10:30:11 AM4/16/18

to CasADi

The 0.03 s in CasADi are probably mainly Python-Swig-CasADi calling overhead. A rough rule of thumb is 1 ns per elementary operation for well behaving CasADi/SX evaluation. (Think 1 / (2.5 GHz) = 0.4 ns). That should give you an order of magnitude. In "sin(y + 0.1* y**3)", you've got some 5 operations so say 5 ns per k. For k=500, that's still only 2.5 microseconds. You probably need a larger k to see the linear dependency on k.

Joel

Reply all

Reply to author

Forward