Parallel evaluation of a mathematical expression

Jason Moore

unread,

Mar 16, 2014, 9:30:23 PM3/16/14

to sy...@googlegroups.com

In the mechanics package we generate very long mathematical expressions which we then want to evaluate numerically as fast as possible. To do this, we generate code in a low level language and typically use common subexpression elimination as a pre-compile optimization step*. I'm wondering if there are possibilities of parallelizing the evaluation of the these cse'd expressions. It seems that if the cse dependency graph was formed that you may be able to see groups of cses that could be executed in parallel. I haven't found anything online yet that shows anyone doing this or even attempting it. Either I don't have the right search terms, nobody's done it, or it isn't possible to get much (or any) gain for general expressions. Does anyone know if this is possible and, if so, is there a standard way to build the dependency graph?

* this may not gain us much, as good compilers probably do this for us, but it does have gains for higher level interpreted languages

Jason
moorepants.info
+01 530-601-9791

Frédéric Bastien

unread,

Mar 16, 2014, 9:48:06 PM3/16/14

to sympy

Someone contributed recently modificatin that allow Theano elemwise to
run in parallel on the CPU. We found that it is useless or harmful to
run in parallel for the addition of 2 vectors if their is less then
200k elements in the vectors. This is just to show how big is the cost
to start/stop threads via openmp.

So if you don't have more computation then that in your graph, maybe
it won't be possible to have speed gain with g++ openmp
implementation.

If you have enough work, openmp can parallelize tasks. So you can just
fill a list or task ready to run and the openmp back-end will run it.
each task could just add more tasks to that list when it finished.

Fred

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sympy+un...@googlegroups.com.
> To post to this group, send email to sy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sympy.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/CAP7f1AhANchPrPu2gAFzbYNZuUsregFEtbgZpppbNH6fpbhC2A%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Jason Moore

unread,

Mar 16, 2014, 10:10:03 PM3/16/14

to sy...@googlegroups.com

Thanks Fred. That gives me a sense of scale. The longer expressions I've encountered so far in our work far have 1000 to 5000 common subexpressions (although it wouldn't be hard to create problems that have hundreds of thousands of cses). So this isn't at the scale your are talking about for vector addition, so maybe I wouldn't have any gain. If I could speed up the execution in the 1.5x-5x realm, that would be a big help. These expression evaluations are required for simulating the multibody systems and the simulation time is the bottle neck in my data processing routine. We'd like to be able to simulate our systems at 10-50x real time speed so that we can get results back in hours as opposed to days. I've found some papers with the google search "parallel evaluation of mathematical expressions" and have gotten some hits.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADKKbthPPTo8pqBsekBrbO_ngYqdTvpLNJv1dmCjBPEsGw4qEA%40mail.gmail.com.

Matthew Rocklin

unread,

Mar 16, 2014, 10:47:31 PM3/16/14

to sy...@googlegroups.com, Max Hutchinson

I'm not sure that Fred's experience applies in this case. Fred, am I correct in assuming that the element-wise computations that you benchmarked had FLOPs / element ratios less than ten? I.e. you did only a few computations per element but on many elements? In this case you're memory bound and exposing more work for the CPU doesn't buy you much.

My understanding is that Jason lives in a bizarre world of very high computational intensity. He performs very large computations on relatively little data. Jason, can you give an estimate on how many operations are performed on each scalar and how many scalars you have?

High compute intensity applications traditionally do benefit substantially from multithreading. They're also very rare and mostly found just in computations like dense linear algebra. It's weird to find them outside of the SIMD paradigm.

I'm CCing Max Hutchinson, who knows more about these things than I, and might find this topic of interest. Max, can you provide rule of thumb numbers for FLOP/word ratios for CPUs? 30?

Jason, your problem is somewhat rare, which stops me from knowing of a common solution. Most parallel language attempts ask you to chunk up your computation into subtasks, they then schedule these tasks dynamically using some clever heuristic, work-stealing comes to mind. I suspect that chunking on the scale of each scalar operation would generate overwhelming overhead in a dynamically scheduled system. There do exist various static schedulers but alas my experience here is only one schedulers which are overkill for your particular problem. Probably you just need some heuristic to find big relatively isolated chunks of your call graph and wrap those up in some sort of threaded call.

Good news is that you've found a research problem :) If this were my problem my next step would be to pay a call on some programming languages CS postdoc. They might be thrilled to hear from you :)

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1Aj1t2z4%2Bm1ASsBitY4JzAewnzTCR0o310iyzimQRfTW2w%40mail.gmail.com.

Matthew Rocklin

unread,

Mar 17, 2014, 12:06:35 AM3/17/14

to Max Hutchinson, sy...@googlegroups.com

Response from Max follows (for some reason he was getting bounced by the mailing list).

On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <maxh...@gmail.com> wrote:

tl;dr it depends on the DAG, but improved ILP is is likely possible (if difficult) and there could be room for multi-core parallelism as well.

As I understand it, we're talking about a long computation applied to short input vectors. If the computation can be applied to many input vectors at once, independent of each other, then all levels of parallelism (multiple instructions, multiple cores, multiple sockets, multiple nodes) can be used. This is data-parallelism, which is great! However, it doesn't sound like this is the case.

It sounds like you're thinking of building a DAG of these CSEs and trying to use task-parallelism over independent parts of it (automatically using sympy or theano or what have you). The tension here is going to be between locality and parallelism: how much compute hardware can you spread your data across without losing the nice cache performance that your small input vectors gain you. I'd bet that going off-socket is way too wide. Modern multi-core architectures have core-local L2 and L1 caches, so if your input data fits nicely into L2 and your DAG isn't really local, you probably won't get anything out of multiple-cores. Your last stand is single-core parallelism (instruction-level parallelism), which sympy et al may or may not be well equipped to influence.

To start, I'd recommend that you take a look at your DAGs and try to figure out how large the independent chunks are. Then, estimate the amount of instruction level parallelism when you run in 'serial' (which you can do with flop-counting). If your demonstrated ILP is less than your independent chunk size, then at least improved ILP should be possible. Automatically splitting up these DAGs and expressing them in a low-level enough way to affect ILP is a considerable task, though.

To see if multi-core parallelism is worth it, you need to estimate how many extra L3 loads you'd incur by spreading your data of multiple L2s. I don't have great advice for that, maybe someone else here does. The good news is that if your problem has this level of locality, then you can probably get away with emitting C code with pthreads or even openmp. Just bear in mind the thread creation/annihilation overhead (standing thread-pools are your friend) and pin them to cores.

Good luck,
Max

Jason Moore

unread,

Mar 17, 2014, 11:21:06 AM3/17/14

to sy...@googlegroups.com, Max Hutchinson

I'm still digesting what Matthew and Max wrote. Lots of new words for me :) But here is a simple example taken from C code we generate for a simple 2 link pendulum.

First the C code with SymPy's CSE expressions automatically generated:

#include <math.h>
#include "multibody_system_c.h"

void mass_forcing(double constants[6], // constants = [g, m0, l0, m1, l1, m2]
                  double coordinates[3], // coordinates = [q0, q1, q2]
                  double speeds[3], // speeds = [u0, u1, u2]
                  double mass_matrix[36], // computed
                  double forcing_vector[6]) // computed
{
    // common subexpressions
    double z_0 = coordinates[1];
    double z_1 = sin(z_0);
    double z_2 = constants[2]*z_1;
    double z_3 = -constants[3]*z_2 - constants[5]*z_2;
    double z_4 = coordinates[2];
    double z_5 = sin(z_4);
    double z_6 = -constants[4]*constants[5]*z_5;
    double z_7 = pow(constants[2], 2);
    double z_8 = constants[2]*constants[4]*constants[5];
    double z_9 = cos(z_0);
    double z_10 = cos(z_4);
    double z_11 = z_8*(z_1*z_5 + z_10*z_9);
    double z_12 = speeds[1];
    double z_13 = speeds[2];
    double z_14 = pow(z_12, 2);
    double z_15 = constants[2]*z_14*z_9;
    double z_16 = pow(z_13, 2);
    double z_17 = constants[4]*constants[5]*z_10;
    double z_18 = constants[0]*constants[2]*z_9;
    double z_19 = z_5*z_9;
    double z_20 = z_1*z_10;

    // mass matrix
    mass_matrix[0] = 1;
    mass_matrix[1] = 0;
    mass_matrix[2] = 0;
    mass_matrix[3] = 0;
    mass_matrix[4] = 0;
    mass_matrix[5] = 0;
    mass_matrix[6] = 0;
    mass_matrix[7] = 1;
    mass_matrix[8] = 0;
    mass_matrix[9] = 0;
    mass_matrix[10] = 0;
    mass_matrix[11] = 0;
    mass_matrix[12] = 0;
    mass_matrix[13] = 0;
    mass_matrix[14] = 1;
    mass_matrix[15] = 0;
    mass_matrix[16] = 0;
    mass_matrix[17] = 0;
    mass_matrix[18] = 0;
    mass_matrix[19] = 0;
    mass_matrix[20] = 0;
    mass_matrix[21] = constants[1] + constants[3] + constants[5];
    mass_matrix[22] = z_3;
    mass_matrix[23] = z_6;
    mass_matrix[24] = 0;
    mass_matrix[25] = 0;
    mass_matrix[26] = 0;
    mass_matrix[27] = z_3;
    mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;
    mass_matrix[29] = z_11;
    mass_matrix[30] = 0;
    mass_matrix[31] = 0;
    mass_matrix[32] = 0;
    mass_matrix[33] = z_6;
    mass_matrix[34] = z_11;
    mass_matrix[35] = pow(constants[4], 2)*constants[5];

    // forcing vector
    forcing_vector[0] = speeds[0];
    forcing_vector[1] = z_12;
    forcing_vector[2] = z_13;
    forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + z_16*z_17;
    forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 + z_16*z_8*(z_19 - z_20);
    forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);
}

Now I manually group these expression evaluations into "stacks", i.e. those calls which could happen in parallel (there is of course a bit more complicated dependency graph you can draw so that you maximize the time that your cores have a task).

// These are not computations but just value assignments.
z_0 = coordinates[1];
z_4 = coordinates[2];
z_12 = speeds[1];
z_13 = speeds[2];
mass_matrix[0] = 1;
mass_matrix[1] = 0;
mass_matrix[2] = 0;
mass_matrix[3] = 0;
mass_matrix[4] = 0;
mass_matrix[5] = 0;
mass_matrix[6] = 0;
mass_matrix[7] = 1;
mass_matrix[8] = 0;
mass_matrix[9] = 0;
mass_matrix[10] = 0;
mass_matrix[11] = 0;
mass_matrix[12] = 0;
mass_matrix[13] = 0;
mass_matrix[14] = 1;
mass_matrix[15] = 0;
mass_matrix[16] = 0;
mass_matrix[17] = 0;
mass_matrix[18] = 0;
mass_matrix[19] = 0;
mass_matrix[20] = 0;
mass_matrix[24] = 0;
mass_matrix[25] = 0;
mass_matrix[26] = 0;
mass_matrix[30] = 0;
mass_matrix[31] = 0;
mass_matrix[32] = 0;
forcing_vector[0] = speeds[0];
forcing_vector[1] = z_12;
forcing_vector[2] = z_13;

// These are computations that involve the initial values passed into the
// function, i.e. stack #1.
z_7 = pow(constants[2], 2);
z_8 = constants[2]*constants[4]*constants[5];
z_14 = pow(z_12, 2);
z_16 = pow(z_13, 2);
mass_matrix[21] = constants[1] + constants[3] + constants[5];
mass_matrix[35] = pow(constants[4], 2)*constants[5];

// Stack #2
z_1 = sin(z_0);
z_5 = sin(z_4);
z_9 = cos(z_0);
z_10 = cos(z_4);
z_2 = constants[2]*z_1;
mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;

// Stack #3
z_3 = -constants[3]*z_2 - constants[5]*z_2;
z_6 = -constants[4]*constants[5]*z_5;
z_11 = z_8*(z_1*z_5 + z_10*z_9);
z_15 = constants[2]*z_14*z_9;
z_17 = constants[4]*constants[5]*z_10;
z_18 = constants[0]*constants[2]*z_9;
z_19 = z_5*z_9;
z_20 = z_1*z_10;

// Stack #4
mass_matrix[22] = z_3;
mass_matrix[23] = z_6;
mass_matrix[27] = z_3;
mass_matrix[29] = z_11;
mass_matrix[33] = z_6;
mass_matrix[34] = z_11;
forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + z_16*z_17;
forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 + z_16*z_8*(z_19 - z_20);
forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);

So this simplified example of the dependencies in the CSE's shows that if I had enough cores available I could parallelize each stack, potentially increasing the execution speed. So instead of 31 evaluations, you could have 4 evaluations in parallel, ideally a 7.75x speedup. For more complicated problems, there could be thousands and thousands of these CSEs, but I'll need to generate their dependencies with code to see if things stack this nicely for the big problems. I suspect the dependency chain could be such that the higher number stacks could have hundreds of expressions whereas the lower stacks have fewer, or vice versa.

How do I generate a DAG for long expressions in SymPy? Is this part of the internal architecture of SymPy expressions? I don't understand how the cse() code works yet either, but it seems like this information should be computed already. I just need to visualize the graph for some of our bigger problems.

Also, the for the number of scalars and number of operations in each. Here is an bigger problem with 2000 or so CSE's:

https://github.com/moorepants/dissertation/blob/master/src/extensions/arms/ArmsDynamics.c

This problem has 12 scalars that have 2000+ CSE's and there are 5840 additions and subtractions, 9847 multiplications and divisions, 14 cosines, and 14 sines. So roughly 1300 operations per scalar.

Jason
moorepants.info
+01 530-601-9791

--

You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To post to this group, send email to sy...@googlegroups.com.
Visit this group at http://groups.google.com/group/sympy.

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 12:13:15 PM3/17/14

to Max Hutchinson, sy...@googlegroups.com

Thanks for that clarification on the locality. I understand that now.

How do I generate DAG's? Is this something SymPy does automatically? Or are there other software that does it? Or is it easy to code up myself?

How would I "help" the compiler with more information for better ILP?

Jason
moorepants.info
+01 530-601-9791

On Mon, Mar 17, 2014 at 11:49 AM, Max Hutchinson <maxh...@gmail.com> wrote:

If you think about what the DAG would look like, your 'stacks' are like horizontal layers in the graph. The width of each layer (length of each stack) gives an upper bound on the speedup, but it doesn't tell the whole story: you need a way to deal with data locality.

For example, let's look at stack #3. You have 8 independent expressions, so it would seem like you should be able to use 8 pieces of computational hardware (let's call it core). However, z_6, z_11, and z_19 all depend on z_5. Therefore, either z_6, z_11, and z_19 need to be computed local to z_5, or z_5 needs to be copied somewhere else. The copying is much more expensive than the computing (50-100 cycles [1]), so if you only have 3 things that depend on z_5, you're going to want to just compute them all on the same core as z_5.

The complicated thing is that z_5 and z_10 both share a dependency, z_4, so they should be computed locally. Now, we have to compute everything that depends on z_5 or z_10 on the same core. If we don't break locality anywhere, we won't have any available parallelism. This is the tension: copies are expensive but without them we can't expose any parallelism and will be stuck with one core. This is why we really need to build a DAG, not just stacks, and then try to break it into chunks with the fewest edges between them. The number of chunks is the amount of parallelism and the number of edges are the number of copies.

Fortunately, even if the DAGs are strongly connected and you're stuck with one core there is still ILP. In a nutshell: each core can actually do a couple operations at the same time. The core uses a single cache, so the data is local and doesn't require copies. The compiler is supposed to figure out ILP for you, but you might be able to help it out using all the extra information sympy/theano knows about your computation.

Max

[1] http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

Max Hutchinson

unread,

Mar 17, 2014, 12:39:13 PM3/17/14

to Jason Moore, sy...@googlegroups.com

Before trying to improve ILP, you should probably see how well the compiler is already doing. You can calculate a lower bound for the actual ILP by dividing your FLOP rate (FLOPs/sec) by the core frequency (cycles/sec) to get the number of FLOPs per cycle. Then compare that to the number of execution units (or whatever they're calling them now) for your processor.

That number may be hard to find, and it is highly idealized, so better comparison might be your FLOP rate compared to a well known compute-bound vectorized problem: the linpack benchmark [1]. If your single core FLOP rate is near the linpack number, there isn't going to much room for single-core (ILP) improvement. Be sure to run linpack large enough to get it compute-bound.

What to do to actually improve ILP is probably architecture/compiler specific and very much outside my area of expertise. Stack Overflow or the Computational Science SE [2] might be able to help.

Hopefully someone else on this list can help with DAGs and SymPy.

Max

[1] http://www.top500.org/project/linpack/

[2] http://scicomp.stackexchange.com/

Jason Moore

unread,

Mar 17, 2014, 12:42:00 PM3/17/14

to Max Hutchinson, sy...@googlegroups.com

Thanks Max. I'll look into the FLOP counting as you suggest.

Matthew has created visual DAG's before, I'm just not sure what software he uses for it. He'll probably pipe in about that.

Jason
moorepants.info
+01 530-601-9791

Matthew Rocklin

unread,

Mar 17, 2014, 12:46:32 PM3/17/14

to sy...@googlegroups.com, Max Hutchinson

Logically we think of SymPy expressions as Trees. When you consider common sub-expressions then repeated branches of these trees get merged. The result is a DAG.

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1Ajyc6RSDmTecMk4P9GF8%2BjF6qYQ32rNj8wZPtFh-G5zfA%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 12:52:58 PM3/17/14

to sy...@googlegroups.com, Max Hutchinson

How did you create the DAG visualization here:

http://matthewrocklin.com/blog/work/2013/04/05/SymPy-Theano-part-3/

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-GLamse8tVYw2WecjgQWbRxDUSCAKsRYmbC4qKBwWY%3D-w%40mail.gmail.com.

Matthew Rocklin

unread,

Mar 17, 2014, 12:55:29 PM3/17/14

to sy...@googlegroups.com, Max Hutchinson

theano.printing.pydotprint

SymPy has something similar in sympy.printing.dot

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AgODZ7c4dwL7HF6pXo%2BLpDFSFG3GhjAwmJSLjQcM-NQUg%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 17, 2014, 1:00:36 PM3/17/14

to sympy

One thing I wonder is ---- how accurate are your double precision
results from the C code?
Did you try to compare it with high accuracy (e.g. 100 digits) in
SymPy using Floats?

The "minus" sign always causes some numerical cancellation.

In my experience, if the symbolic expressions grow into hundrends of
terms and if you use few
"minus" signs in there, you always get numerical cancellations that
sometimes are just too big
for double precision to handle and you get bogus numbers, or only 1 or
2 digits accuracy.

One has to be very careful about this.

Otherwise I think it's a great idea to parallelize the evaluation.
Btw, with CSymPy I am also
interested in parallelizing the symbolic manipulation, including also
the numerical evaluation
(both double precision and higher accuracy). One can do it on the
"library" level as well
as on the code generation level, which is the problem you are tackling now.

Ondrej

> https://groups.google.com/d/msgid/sympy/CAP7f1AggyuHprs7H%2B_PSVP37kG%2BWQv%3DuzSsBpiExh9U4HQe0dA%40mail.gmail.com.

Frédéric Bastien

unread,

Mar 17, 2014, 4:26:18 PM3/17/14

to sympy

Addition and mul can be counted as 1 operation, but not trigonometric
fct. A roungh estimate would be they are worth 50 operation(order of
magnitude right on CPU) So this give 17k. In the same ball park for
this estimation:)

Can you just run this function on many inputs in parallel? It would be
easier to parallelize an outer loop then this.

Fred

> https://groups.google.com/d/msgid/sympy/CAP7f1AggyuHprs7H%2B_PSVP37kG%2BWQv%3DuzSsBpiExh9U4HQe0dA%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 4:37:36 PM3/17/14

to sy...@googlegroups.com

No, it is not possible to run the function itself with the various inputs in parallel because the function is being called during a numerical integration routine and the each input to the function requires the previous output of the integration of that function, i.e. each call is dependent on the previous call.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADKKbthqjD8sgODmDNyFQjEXN2uTYpZv-juQ5vDn9LGm3qTerg%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 5:13:57 PM3/17/14

to sy...@googlegroups.com

For numerical ODE integration, I don't think double precision presents near as much an issue in accuracy as picking the correct step size. We usually fight the step size issue and once that is good the accuracy is adequate for the solutions we need. I've never run into an issue with double precision being limiting. So I don't really know. I think we can often get away with much lower accuracy than double precision for realistic simulations of typically multibody systems.

Wow, so a minus sign can cause errors so big that you only get 1 or 2 digits of accuracy. That's new to me. I'm not sure what minus signs you are talking about. What would you do instead of minus signs?

But like I said, numerical accuracy is generally not an issue for our systems, unless the ODE integration routine is bad.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVC2Zd4tdJRoAxK7iBin9PWc%2BEihSvf-6X5DTW9m9ph7XA%40mail.gmail.com.

Gilbert Gede

unread,

Mar 17, 2014, 5:26:23 PM3/17/14

to sy...@googlegroups.com

Jason, I've definitely seen examples where small errors in subtraction can lead to larger errors later on - especially if things are supposed to cancel out. In our domain, I think it pops up in mass/inertia division operations. I'm not sure how often it happens with real-world examples vs. academic/idealized examples though.

I would definitely agree that step-size (for stiff systems) is the bigger challenge in practice - although I believe providing the Jacobian of your RHS function can help with that. That might be something we want to automate in the code-gen module.

BTW, thanks for starting this discussion. Very informative.

-Gilbert

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AiZccNH3DdtCvqi7FaeZx5ARXJMAzkPFJ2nci9ChvEXpg%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 5:35:06 PM3/17/14

to sy...@googlegroups.com

Ok, if we had some concrete examples of that non-cancellation then we could do something about it. This is goes for everything that we've been implementing, if we don't have a good benchmark problem then it is hard to design something that solve it. We can look into this more when it becomes and issue, I guess.

Yes, we should have an option in the code generation to generate a function that can evaluate the Jacobian of the RHS. This can be used directly for implicit ODE integration routines that are best suited for for stiff problems.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAGZqWdr6ye5LMvVfvMjDYDterj5oVEA_qObyLQx96Wq9c8SbiQ%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 17, 2014, 5:54:21 PM3/17/14

to sympy

On Mon, Mar 17, 2014 at 3:35 PM, Jason Moore <moore...@gmail.com> wrote:
> Ok, if we had some concrete examples of that non-cancellation then we could
> do something about it. This is goes for everything that we've been
> implementing, if we don't have a good benchmark problem then it is hard to
> design something that solve it. We can look into this more when it becomes
> and issue, I guess.

Just try this:

In [1]: from sympy import *

In [2]: var("x")
Out[2]: x

In [3]: e = legendre_poly(50, x)

In [4]: e.subs(x, 0.9)
Out[4]: 1.25000000000000

In [5]: e.subs(x, S(9)/10).n()
Out[5]: -0.170037659943837

and it only gets worse:

In [6]: e = legendre_poly(100, x)

In [7]: e.subs(x, 0.9)
Out[7]: -3.65635994747142e+17

In [8]: e.subs(x, S(9)/10).n()
Out[8]: 0.102265820558719

In other words, double precision is simply not enough at all, and
that's just Legendre polynomial of order 50, which is not much, i.e.
it looks like this:

In [9]: legendre_poly(50, x)
Out[9]: 12611418068195524166851562157*x**50/140737488355328 -
156050375086257748529223875175*x**48/140737488355328 +
226836112238787036521861509275*x**46/35184372088832 -
823773249709279237895181270525*x**44/35184372088832 +
4189728463575151392735706892025*x**42/70368744177664 -
7928255400303748020099876118755*x**40/70368744177664 +
5790298887862287879848224131675*x**38/35184372088832 -
6684039602901787158511168414725*x**36/35184372088832 +
24770264410753681822717859419275*x**34/140737488355328 -
18602568051449552212241926551825*x**32/140737488355328 +
1423900270604780539702468452115*x**30/17592186044416 -
712769410486857922635873160725*x**28/17592186044416 +
583174972216520118520259858775*x**26/35184372088832 -
194391657405506706173419952925*x**24/35184372088832 +
26248579962778792027330678575*x**22/17592186044416 -
5693353963757653481984400705*x**20/17592186044416 +
7838675747202566388239392275*x**18/140737488355328 -
1052956443654076082002306425*x**16/140737488355328 +
26998883170617335435956575*x**14/35184372088832 -
2052546673789621992207225*x**12/35184372088832 +
222078820442811559812585*x**10/70368744177664 -
8065816723104536070675*x**8/70368744177664 +
90048990529077755175*x**6/35184372088832 -
1067774591253886425*x**4/35184372088832 +
20146690401016725*x**2/140737488355328 -
15801325804719/140737488355328

By "minus" sign, I mean all the minus signs in there, which cause the
cancellation.

Ondrej

> https://groups.google.com/d/msgid/sympy/CAP7f1AhznhOHzKBNrRqxHsuKAWHfdDjF2Bwv0HJuL29HVXR1Jg%40mail.gmail.com.

Jason Moore

unread,

Mar 17, 2014, 6:08:05 PM3/17/14

to sy...@googlegroups.com

Thanks. I'm a dummy when it comes to C, living my life mostly in high level languages. Can I use a floating point number? I'm using doubles in the C code because that is what this proprietary product generates for similar systems that we used to use. It is pretty battle tested and I've never had issues with the double precision nor heard anyone complain about it. Maybe the kinds of expressions we generate are not as susceptible. Not sure.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVDiOmVeOCkUVwjNuwhdkSr4-7cr4eV4FRuKd5ZiAc01Bw%40mail.gmail.com.

Matthew Rocklin

unread,

Mar 17, 2014, 6:55:24 PM3/17/14

to sy...@googlegroups.com

It seems to me like floating point is an orthogonal issue.

However, if you wanted to verify that your computations are well conditioned (if you wanted to make Ondrej happy), you could do the computation in 32bit floating point (i.e. use float rather than double), and then verify that your answer didn't substantially change. Sometimes we can verify our current approach by showing that a more dumb approach still works fine :)

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AgCiXFkf42i%2BWc%3DEYNyNkWMGCVkX0RSfJ%2Bjtfqm4Kzxhw%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 17, 2014, 6:59:40 PM3/17/14

to sympy

On Mon, Mar 17, 2014 at 4:08 PM, Jason Moore <moore...@gmail.com> wrote:
> Thanks. I'm a dummy when it comes to C, living my life mostly in high level
> languages. Can I use a floating point number? I'm using doubles in the C
> code because that is what this proprietary product generates for similar
> systems that we used to use. It is pretty battle tested and I've never had
> issues with the double precision nor heard anyone complain about it. Maybe
> the kinds of expressions we generate are not as susceptible. Not sure.

The example above is in Python, it has nothing to do with C. It's
using Python floats, which
are machine precision doubles usually. This is what is called
"floating point number" usually.

There is this famous essay "What Every Computer Scientist Should Know
About Floating-Point Arithmetic"

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

But it's kind of long. But basically +, *, / don't lose significant
accuracy, but "-" loses. This is easy to understand,
in Python:

In [16]: a = 1+2e-16

In [17]: a
Out[17]: 1.0000000000000002

We still around 16 significant digits. But if we do "minus":

In [18]: a-1
Out[18]: 2.220446049250313e-16

We suddenly only have 1 significant digit correct, the rest is (in
general) random garbage (to be precise, the number we got is the
machine epsilon: http://en.wikipedia.org/wiki/Machine_epsilon, but
that's not the point).

The conclusion is, that when you use "minus", you need to *avoid*
subtracting numbers of similar magnitude, to minimize the loss. E.g.
if you subtract 1 and 2e-16, then it's fine:

In [21]: 1-2e-16
Out[21]: 0.9999999999999998

No issue.

In practice, the legendre polynomial has "minus" of numbers of similar
magnitude, so you are totally out of luck and you simply can't use it.
So that's why I was asking if you are sure that what you get for your
complicated systems is actually correct and not garbage. For example
if you were doing an equivalent of a legendre_poly(50, x) by
evaluating the expression directly in C, you get garbage. So it's not
some academic exercise.

Ondrej

> https://groups.google.com/d/msgid/sympy/CAP7f1AgCiXFkf42i%2BWc%3DEYNyNkWMGCVkX0RSfJ%2Bjtfqm4Kzxhw%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 17, 2014, 7:07:59 PM3/17/14

to sympy

On Mon, Mar 17, 2014 at 4:55 PM, Matthew Rocklin <mroc...@gmail.com> wrote:
> It seems to me like floating point is an orthogonal issue.
>
> However, if you wanted to verify that your computations are well conditioned
> (if you wanted to make Ondrej happy), you could do the computation in 32bit
> floating point (i.e. use float rather than double), and then verify that
> your answer didn't substantially change. Sometimes we can verify our
> current approach by showing that a more dumb approach still works fine :)

Yep, that's the way to test it. I've got bitten by this a lot, not
just for Legendre polynomials,
but also for modified Bessel functions:

http://en.wikipedia.org/wiki/Bessel_function#Modified_Bessel_functions_:_I.CE.B1.2C_K.CE.B1

You code up the formula using sympy, generate C code and it doesn't
work. For Bessel functions, I ended up using rational function
approximation instead of the direct formula:

https://github.com/certik/hfsolver/blob/master/src/special_functions.f90#L371

that generates about 15 digit accuracy.

So overall I would say that if the symbolic expression becomes long
enough and contains minus signs, chances are high that double
precision will not be enough.

Ondrej

Jason Moore

unread,

Mar 17, 2014, 7:12:20 PM3/17/14

to sy...@googlegroups.com

I have the basic understanding of floating point arthmetic and I've read that essay at some point in my career.

As to whether we output garbage with these methods, we typically look to well defined benchmark problems. For example, the bicycle problem that runs as a test in the sympy test suite checks the accuracy of the evaluation to something like 14 decimal places against an independently derived system from a number of other methods. That problem, in our sympy case, has tons of minus's in the symbolic code but it gives the same answer as 10+ other different implementations. Now if we didn't get the same answer, then we'd need to be looking into things like floating point arithmetic.

So, if we ever have a problem that gives dubious results, then we will surely look into floating point arithmetic as a potential cause. But so far, all of the problems we've tried to solve with this don't give garbage, or at least don't give errors with respect to other's implementations.

Jason
moorepants.info
+01 530-601-9791

To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVDEYALyeEyLy%3DP%2BRC4HGe4TBkH1ejuMbOCXvXVm%2Be2cnw%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 17, 2014, 7:27:46 PM3/17/14

to sympy

On Mon, Mar 17, 2014 at 5:12 PM, Jason Moore <moore...@gmail.com> wrote:
> I have the basic understanding of floating point arthmetic and I've read
> that essay at some point in my career.
>
> As to whether we output garbage with these methods, we typically look to
> well defined benchmark problems. For example, the bicycle problem that runs
> as a test in the sympy test suite checks the accuracy of the evaluation to
> something like 14 decimal places against an independently derived system
> from a number of other methods. That problem, in our sympy case, has tons of
> minus's in the symbolic code but it gives the same answer as 10+ other
> different implementations. Now if we didn't get the same answer, then we'd
> need to be looking into things like floating point arithmetic.
>
> So, if we ever have a problem that gives dubious results, then we will
> surely look into floating point arithmetic as a potential cause. But so far,
> all of the problems we've tried to solve with this don't give garbage, or at
> least don't give errors with respect to other's implementations.

Very cool, that's great news.

Ondrej

Vinzent Steinberg

unread,

Mar 18, 2014, 10:23:19 AM3/18/14

to sy...@googlegroups.com

Regarding the cancellation issues: Isn't it possible to rewrite expressions such that they give smaller rounding errors when evaluated using floating point arithmetic? (Like x^2 - y^2 = (x + y)(x - y).)
Sympy could do this.

Does anyone know of some work on rewriting general expressions into numerically more favorable forms?

Vinzent

Ondřej Čertík

unread,

Mar 18, 2014, 2:44:54 PM3/18/14

to sympy

On Tue, Mar 18, 2014 at 8:23 AM, Vinzent Steinberg
<vinzent....@gmail.com> wrote:
> Regarding the cancellation issues: Isn't it possible to rewrite expressions
> such that they give smaller rounding errors when evaluated using floating
> point arithmetic? (Like x^2 - y^2 = (x + y)(x - y).)
> Sympy could do this.

Sometimes it can help to do this, but sometimes the only solution is
some adaptive arbitrary precision,
as sympy does in the .n() function.

>
> Does anyone know of some work on rewriting general expressions into
> numerically more favorable forms?

Ondrej

Aaron Meurer

unread,

Mar 21, 2014, 9:18:27 PM3/21/14

to sy...@googlegroups.com

Is this an issue with any subtraction of similar magnitude, or only if
it changes the exponent of the number (i.e., it gets closer to 0)? In
other words, when you say "magnitude", do you mean "value", or "order
of magnitude (log2(x))"?

Aaron Meurer

> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVDEYALyeEyLy%3DP%2BRC4HGe4TBkH1ejuMbOCXvXVm%2Be2cnw%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 21, 2014, 11:52:09 PM3/21/14

to sympy

On Fri, Mar 21, 2014 at 7:18 PM, Aaron Meurer <asme...@gmail.com> wrote:
> Is this an issue with any subtraction of similar magnitude, or only if
> it changes the exponent of the number (i.e., it gets closer to 0)? In
> other words, when you say "magnitude", do you mean "value", or "order
> of magnitude (log2(x))"?

Roughly speaking, if you have two numbers x*10^a and y*10^b, where 1
<= x, y <= 10, then when you subtract them,
you roughly get only |a-b| correct significant digits. So if a=b, you
get around 1 significant digit. If a = 1, b = -16, then you retain all
16 significant digits. I hope I wrote it up correctly.

http://en.wikipedia.org/wiki/Loss_of_significance

Ondrej

Aaron Meurer

unread,

Mar 22, 2014, 9:06:00 PM3/22/14

to sy...@googlegroups.com

Ah, so it's more or less what you would expect. It's just like if you
were to subtract the two numbers on paper. In other words, you have to
think about lining up the decimal point. If you have 100 - 1.002e-13,
you would have

100.0000000000000000
- 000.0000000000001002

except at the end, you can only have say 15 digits after the first
nonzero digit, but here the first nonzero digit would still be on the
order of 100 (actually in the tens place), so by the time you get to
15 you lose the precision of the 2.

Of course, this all actually happens in base 2, but other than that
it's the same idea.

Am I correct?

Aaron Meurer

> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To post to this group, send email to sy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sympy.

> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVD2xp-JEPCBZGG-OL1hbzmu_nGJ%2BS9t4tvQpqdySffP7Q%40mail.gmail.com.

Ondřej Čertík

unread,

Mar 23, 2014, 12:53:12 AM3/23/14

to sympy

On Sat, Mar 22, 2014 at 7:06 PM, Aaron Meurer <asme...@gmail.com> wrote:
> Ah, so it's more or less what you would expect. It's just like if you
> were to subtract the two numbers on paper. In other words, you have to
> think about lining up the decimal point. If you have 100 - 1.002e-13,
> you would have
>
> 100.0000000000000000
> - 000.0000000000001002
>
> except at the end, you can only have say 15 digits after the first
> nonzero digit, but here the first nonzero digit would still be on the
> order of 100 (actually in the tens place), so by the time you get to
> 15 you lose the precision of the 2.
>
> Of course, this all actually happens in base 2, but other than that
> it's the same idea.
>
> Am I correct?

Yep, that's it. Good numerical algorithms avoid this, but as I've shown above,
taking a long symbolic formula from SymPy, like Legendre, and plugging in
double precision number simply fails miserably.

Ondrej

> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAKgW%3D6Ly4gwj6-MiA9sipbTrwZqAx0FbRe6f_yR%2B%3D6TOHb6peg%40mail.gmail.com.

Reply all

Reply to author

Forward