Basecase division oddity

38 views
Skip to first unread message

Bill Hart

unread,
Apr 9, 2013, 12:02:55 PM4/9/13
to mpir-devel, flint-devel
Hi all,

I've been working extremely hard the past few days to try and get a new basecase (classical) integer division algorithm working fast. Previously I reported timings to the MPIR list without realising that GMP has specialised basecase approximate quotient code which is way faster than the code I was comparing against. So it looked like my new code was way faster, but it wasn't.

In fact, not only so, but the division algorithm I was using for basecase divrem turned out to be way slower than GMP's basecase divrem after just a few limbs. The crossover to the divide and conquer divrem is at something like 170 limbs, making the new code useless. 

So everything looked pretty gloomy for my code and new algorithm.

All I had was a 3x2 limb approximate quotient with precomputed inverse. But GMP is based on a 3x2 divrem, giving full quotient and remainder, based on a (different) precomputed inverse.

So the other night I sat down and tried to find a full 3x2 divrem algorithm, to replace the Moller-Granlund algorithm used in GMP. 

Amazingly, a variant of my 3x2 approximate quotient algorithm exists, giving a full 3x2 divrem like Moller-Granlund. To me it looks like my variant should be slower than Moller-Granlund. And indeed, last night I was just about to give it up when I made a change that resulted in the following times (these times are for 2n x n limb division on an AMD K10, where the top limb of the dividend is zero and divisor is normalised):

GMP 5.1.1

size = 2: classical = 6.9e-08s,
size = 3: classical = 7.2e-08s,
size = 4: classical = 1.07e-07s,
size = 5: classical = 1.25e-07s,
size = 6: classical = 1.61e-07s,
size = 7: classical = 1.95e-07s,
size = 8: classical = 2.29e-07s,
size = 9: classical = 2.58e-07s,
size = 10: classical = 2.96e-07s,
size = 11: classical = 3.38e-07s,
size = 13: classical = 4.17e-07s,
size = 15: classical = 5.17e-07s,
size = 17: classical = 6.08e-07s,
size = 19: classical = 7.26e-07s,
size = 21: classical = 8.34e-07s,
size = 24: classical = 1.031e-06s,

NEW algorithm (implemented in MPIR)

size = 2: classical = 5.2e-08s,
size = 3: classical = 5.9e-08s,
size = 4: classical = 8.5e-08s,
size = 5: classical = 1.1e-07s, 
size = 6: classical = 1.48e-07s,
size = 7: classical = 1.76e-07s, 
size = 8: classical = 2e-07s, 
size = 9: classical = 2.38e-07s,
size = 10: classical = 2.9e-07s, 
size = 11: classical = 3.27e-07s, 
size = 13: classical = 4.11e-07s,
size = 15: classical = 5.04e-07s,
size = 17: classical = 6e-07s, 
size = 19: classical = 7.06e-07s,
size = 21: classical = 8.06e-07s,
size = 24: classical = 1e-06s,

(From this point on, timings are indistinguishable to within timing inconsistencies.)

These times are quite reproducible. I've also eliminated any possibility that it is due to timing differences between random generation or other features of the profiling. 

So clearly for very small sizes, the new code is up to 33% faster. But at this point I have no clue why. 

I still assume I've made a mistake somewhere. (I've found plenty in my timing code over the past few days.). But I'm being very careful, and things are indeed looking up.

I do have a proof that the new algorithm works. So that part is assured at least. And after quite a bit of work on the implementation, the MPIR test code passes with the new code. So if I have made a mistake somewhere, it is not at all obvious to me.

Here's hoping.

Bill.

Bill Hart

unread,
Apr 9, 2013, 12:52:58 PM4/9/13
to mpir-devel, flint-devel
To add to my confusion, here are the times from MPIR-2.6.0. These timings are again very reproducible.

The MPIR-2.6.0 code is identical to my new code, except for the 3x2 divrem macro. Even more odd is that MPIR-2.6.0 uses virtually identical code to GMP-5.1.1. We use the same precomputed inverse, the same divrem 3x2 macro, the same divrem basecase code. I even changed my MPIR to use the same compiler flags as GMP. But the times are better in MPIR, though still substantially worse than my new code in some cases.

size = 2: classical = 5.6e-08s,
size = 3: classical = 6.9e-08s,
size = 4: classical = 9.6e-08s,
size = 5: classical = 1.25e-07s, 
size = 6: classical = 1.52e-07s, 
size = 7: classical = 1.79e-07s, 
size = 8: classical = 2.12e-07s, 
size = 9: classical = 2.51e-07s, 
size = 10: classical = 2.95e-07s,
size = 11: classical = 3.3e-07s, 
size = 13: classical = 4.08e-07s,
size = 15: classical = 5.05e-07s,
size = 17: classical = 5.97e-07s, 
size = 19: classical = 7.13e-07s, 
size = 21: classical = 8.19e-07s,
size = 24: classical = 9.93e-07s,

How can GMP and MPIR use identical code and yet have different timings!?

I don't know whether to declare my new code a success or to blame the difference on some compiler anomaly!

One thing I haven't looked at is possible differences in longlong.h assembly macros between GMP and MPIR. But I can't see how this could be possible.

Wouldn't it be nice if we could just sit down in the morning, find a new algorithm, and write it up in the afternoon without all these complications! I've been working on division code for weeks now, and the only thing I am confident of is a new algorithm for divide and conquer division which is a 6-10% improvement over GMP.

Bill.

Bill Hart

unread,
Apr 10, 2013, 12:32:34 PM4/10/13
to mpir-devel, flint-devel
I've checked everything much more carefully and I found the issue. mpn_copyi appears to be faster in MPIR than in GMP, and I was using it to copy data into the operands before doing a divrem. I've adjusted for this in the timings below and the results look much more sane.

I also noticed the algorithm is technically not valid for size = 2, so I removed that data point.

The remaining timing differences between GMP and MPIR would be due to slight differences in the speed of mpn_submul_1, which is always run once for size - 2 limbs.

gmp-5.1.1 DIVREM

size = 3: classical = 6.7e-08s,
size = 4: classical = 9.4e-08s,
size = 5: classical = 1.18e-07s, 
size = 6: classical = 1.46e-07s, 
size = 7: classical = 1.86e-07s, 
size = 8: classical = 2.11e-07s, 
size = 9: classical = 2.47e-07s, 
size = 10: classical = 2.77e-07s, 
size = 11: classical = 3.26e-07s,
size = 13: classical = 4.03e-07s, 
size = 15: classical = 5.02e-07s, 
size = 17: classical = 5.90e-07s, 
size = 19: classical = 7.06e-07s,
size = 21: classical = 8.12e-07s,
size = 24: classical = 9.94e-07s,
size = 27: classical = 1.221e-06s,
size = 30: classical = 1.416e-06s,
size = 33: classical = 1.672e-06s,
size = 37: classical = 2.035e-06s,

MPIR-2.6.0

size = 3: classical = 6.3e-08s,
size = 4: classical = 9e-08s,
size = 5: classical = 1.17e-07s,
size = 6: classical = 1.44e-07s,
size = 7: classical = 1.69e-07s,
size = 8: classical = 2.02e-07s,
size = 9: classical = 2.4e-07s,
size = 10: classical = 2.84e-07s,
size = 11: classical = 3.16e-07s,
size = 13: classical = 3.93e-07s,
size = 15: classical = 4.89e-07s,
size = 17: classical = 5.78e-07s,
size = 19: classical = 6.92e-07s,
size = 21: classical = 7.96e-07s,
size = 24: classical = 9.69e-07s,
size = 27: classical = 1.201e-06s,
size = 30: classical = 1.434e-06s,
size = 33: classical = 1.655e-06s,
size = 37: classical = 2.008e-06s,

MPIR-git

size = 3: classical = 5.8e-08s,
size = 4: classical = 8.0e-08s,
size = 5: classical = 1.04e-07s,
size = 6: classical = 1.39e-07s,
size = 7: classical = 1.67e-07s,
size = 8: classical = 1.9e-07s,
size = 9: classical = 2.28e-07s,
size = 10: classical = 2.76e-07s,
size = 11: classical = 3.08e-07s,
size = 13: classical = 3.97e-07s,
size = 15: classical = 4.81e-07s,
size = 17: classical = 5.81e-07s,
size = 19: classical = 6.74e-07s,
size = 21: classical = 7.83e-07s,
size = 24: classical = 9.68e-07s,
size = 27: classical = 1.201e-06s,
size = 30: classical = 1.402e-06s,
size = 33: classical = 1.68e-06s,
size = 37: classical = 2.05e-06s,


The good news is that my algorithm is still faster. I've still no idea why, but I believe the timings now. It looks to be about 5-10%. I guess it just boils down to the code being able to be optimised slightly more by the C compiler, better pipelining or something like that.

I'll check branch prediction just to make sure that is not a factor, but I think the branch prediction should be similar between the two algorithms.

Bill.

Bill Hart

unread,
Apr 10, 2013, 1:52:08 PM4/10/13
to mpir-devel, flint-devel
I found the reason for the difference between GMP's 3x2 divrem macro and mine.

They had

_mask = - (mp_limb_t) ((r1) >= _q0);
(q) += _mask;
 add_ssaaaa ((r1), (r0), (r1), (r0), _mask & (d1), _mask & (d0));
   
whereas I had

    if ((r1) >= _q0) 
    {
       (q)--;
       add_ssaaaa ((r1), (r0), (r1), (r0), (d1), (d0));
    } 

I believe these are equivalent. However, the second appears to be faster on my machine.
    
Perhaps they are anticipating a change in gcc that would see the branch eliminated, or they do it to help gcc on certain architectures. 

I am using gcc-4.7.2 which seems to prefer the second version on my machine.

In fact, with this change, GMP is actually just slightly faster than my code and operation for operation does the same amount of work except for a single extra addition in my case.

So I think we'll make this change in MPIR and stick with the code we have for basecase.

Bill.





Bill Hart

unread,
Apr 10, 2013, 6:33:57 PM4/10/13
to mpir-devel, flint-devel
I'd forgotten why I was doing all this work, namely that for small sizes I have an even faster algorithm which uses my new precomputed inverse. I've now merged this faster algorithm and it has a crossover with the original MPIR basecase at size = 30 on my machine. Above that it uses the old MPIR division code.

With this improved code, the times are now as follows:

new code DIVREM

size = 3: classical = 4.6e-08s,
size = 4: classical = 7.2e-08s,
size = 5: classical = 9.7e-08s, 
size = 6: classical = 1.23e-07s,
size = 7: classical = 1.6e-07s,
size = 8: classical = 1.95e-07s,
size = 9: classical = 2.25e-07s,
size = 10: classical = 2.45e-07s,
size = 11: classical = 3.06e-07s,
size = 13: classical = 3.69e-07s,
size = 15: classical = 4.8e-07s,
size = 17: classical = 5.87e-07s,
size = 19: classical = 6.66e-07s,
size = 21: classical = 8.25e-07s,
size = 24: classical = 9.84e-07s,
size = 27: classical = 1.186e-06s,
size = 30: classical = 1.361e-06s,
size = 33: classical = 1.653e-06s,
size = 37: classical = 2.045e-06s,

gmp-5.1.1 DIVREM

size = 3: classical = 6.7e-08s,
size = 4: classical = 9.4e-08s,
size = 5: classical = 1.18e-07s,
size = 6: classical = 1.46e-07s,
size = 7: classical = 1.86e-07s,
size = 8: classical = 2.11e-07s,
size = 9: classical = 2.47e-07s,
size = 10: classical = 2.77e-07s,
size = 11: classical = 3.26e-07s,
size = 13: classical = 4.03e-07s,
size = 15: classical = 5.02e-07s,
size = 17: classical = 5.90e-07s,
size = 19: classical = 7.06e-07s,
size = 21: classical = 8.12e-07s,
size = 24: classical = 9.94e-06s,
size = 27: classical = 1.221e-06s,
size = 30: classical = 1.416e-06s,
size = 33: classical = 1.672e-06s,
size = 37: classical = 2.035e-06s,

Yes, it's really 45% faster for a 6 x 3 division. There's a bit of a tradeoff in that for a couple of sizes GMP wins. But mostly we are faster.

I'll clean up and commit this to the MPIR repo.

Bill.

Bill Hart

unread,
Apr 12, 2013, 12:46:59 PM4/12/13
to mpir-devel, flint-devel
I have now written fast basecase divapprox code (to give approximate quotient). This code might end up being removed in the end, because it is not clear it will ever get called. But it was necessary for me to do it on my way to rewriting the basecase short division code (which replicates the divapprox code as part of it).

The following timings will hopefully be representative of the short division code. but currently I'm just timing GMP's sbpi1_divappr_q against the MPIR equivalent sb_divappr_q.

new code DIVAPPR

size = 3: classical = 5.7e-08s,
size = 4: classical = 7.5e-08s,
size = 5: classical = 9.7e-08s,
size = 6: classical = 1.17e-07s,
size = 7: classical = 1.41e-07s,
size = 8: classical = 1.65e-07s,
size = 9: classical = 1.89e-07s,
size = 10: classical = 2.14e-07s,
size = 11: classical = 2.38e-07s,
size = 13: classical = 2.92e-07s,
size = 15: classical = 3.64e-07s,
size = 17: classical = 4.28e-07s,
size = 19: classical = 4.89e-07s,
size = 21: classical = 5.96e-07s,
size = 24: classical = 7.25e-07s,
size = 27: classical = 8.7e-07s,
size = 30: classical = 1.043e-06s,
size = 33: classical = 1.2e-06s,
size = 37: classical = 1.475e-06s,

gmp-5.1.1 DIVAPPR

size = 3: classical = 6.7e-08s,
size = 4: classical = 8.9e-08s,
size = 5: classical = 1.13e-07s,
size = 6: classical = 1.37e-07s, 
size = 7: classical = 1.85e-07s, 
size = 8: classical = 1.96e-07s,
size = 9: classical = 2.27e-07s,
size = 10: classical = 2.51e-07s, 
size = 11: classical = 2.99e-07s,
size = 13: classical = 3.54e-07s,
size = 15: classical = 4.35e-07s,
size = 17: classical = 5.07e-07s,
size = 19: classical = 6.1e-07s,
size = 21: classical = 7.05e-07s,
size = 24: classical = 8.36e-07s, 
size = 27: classical = 1.059e-06s, 
size = 30: classical = 1.234e-06s,
size = 33: classical = 1.449e-06s,
size = 37: classical = 1.722e-06s,

So, 17.5% - 30% improvement, with roughly 20% improvement on average.

Pretty happy with that! :-)

MPIR was already slightly faster than GMP due to previously mentioned optimisation. But the new code is uniformly faster up to around size = 30.

I have committed this on the dp1_divrem branch in github.

Speeding up sb_div_q is next!

Bill.

Bill Hart

unread,
Apr 12, 2013, 4:47:58 PM4/12/13
to mpir-devel, flint-devel
The new short division code is now working and appears to be about 20% faster than GMP up to size = 30.

I've committed this on the dp1_divrem branch. 

I'll post the timings later tonight.

Bill.

Bill Hart

unread,
Apr 12, 2013, 8:29:33 PM4/12/13
to mpir-devel, flint-devel
Here are the times for short division (2*size-by-size division with quotient only):

new code DIV

size = 3: classical = 6e-08s,
size = 4: classical = 7.8e-08s,
size = 5: classical = 9.9e-08s,
size = 6: classical = 1.17e-07s, 
size = 7: classical = 1.41e-07s,
size = 8: classical = 1.67e-07s,
size = 9: classical = 1.89e-07s,
size = 10: classical = 2.14e-07s,
size = 11: classical = 2.38e-07s,
size = 13: classical = 2.94e-07s,
size = 15: classical = 3.62e-07s,
size = 17: classical = 4.3e-07s,
size = 19: classical = 4.87e-07s,
size = 21: classical = 5.98e-07s,
size = 24: classical = 7.21e-07s,
size = 27: classical = 8.8e-07s,
size = 30: classical = 1.045e-06s,
size = 33: classical = 1.158e-06s,
size = 37: classical = 1.426e-06s,

gmp-5.1.1 DIV

size = 3: classical = 6.8e-08s,
size = 4: classical = 9e-08s,
size = 5: classical = 1.14e-07s,
size = 6: classical = 1.39e-07s,
size = 7: classical = 1.81e-07s,
size = 8: classical = 1.99e-07s,
size = 9: classical = 2.3e-07s,
size = 10: classical = 2.56e-07s,
size = 11: classical = 3.02e-07s,
size = 13: classical = 3.56e-07s,
size = 15: classical = 4.38e-07s,
size = 17: classical = 5.09e-07s,
size = 19: classical = 6.09e-07s,
size = 21: classical = 7.17e-07s,
size = 24: classical = 8.41e-07s,
size = 27: classical = 1.07e-06s,
size = 30: classical = 1.231e-06s,
size = 33: classical = 1.444e-06s,
size = 37: classical = 1.724e-06s,

So it looks like the new code is up to 25% faster and typically 20% faster.

The next step will be to merge the new divide-and-conquer code I have written, which should speed things up over a much bigger range.

Bill.
Reply all
Reply to author
Forward
0 new messages