Fast computation of binomial coefficients

gerrob

unread,

Apr 26, 2009, 6:42:02 PM4/26/09

to mpir-devel

Hi!

I've uploaded two codes on the google group, computation of binomial
coefficients in a much faster way than the currently used in mpir/gmp,
using binary splitting (the idea is similar if we would use product
tree). In gmp there are two functions for this: mpz_bin_ui and
mpz_bin_uiui. Note that my code is slower for small input (small means
that k is small in function of n), so use the old code in such cases.

The idea is old, I've sent a similar contribution to gmp years ago, I
don't know why they haven't used it.

Bill Hart

unread,

Apr 26, 2009, 9:22:49 PM4/26/09

to mpir-...@googlegroups.com

2009/4/26 gerrob <robert....@gmail.com>:

>
> Hi!
>
> I've uploaded two codes on the google group, computation of binomial
> coefficients in a much faster way than the currently used in mpir/gmp,
> using binary splitting (the idea is similar if we would use product
> tree). In gmp there are two functions for this: mpz_bin_ui and
> mpz_bin_uiui. Note that my code is slower for small input (small means
> that k is small in function of n), so use the old code in such cases.

Thanks again.

By the way, I haven't been through your root test code yet. My weekend
has been swallowed up!! But I promise it is very near the top of my
todo list now. I have some refereeing of another nature to do, a
lecture to give and I am about there...

>
> The idea is old, I've sent a similar contribution to gmp years ago, I
> don't know why they haven't used it.

The GMP developers don't appear to commit themselves to using
contributions. We don't either, of course, though we try hard. It is
always easier if people commit to merging the code themselves into our
repo. But this can be difficult and it isn't for everyone. I can only
guess that in the case of GMP, they want the code to be in a certain
style and to meet a certain standard otherwise it doesn't get merged
unless someone finds time to rewrite it. Did you receive any feedback
on the code from them?

Bill.

gerrob

unread,

Apr 27, 2009, 4:56:42 AM4/27/09

to mpir-devel

On ápr. 27, 03:22, Bill Hart <goodwillh...@googlemail.com> wrote:
> Did you receive any feedback
> on the code from them?

As I remember at the same time there were 4 codes from 3 peoples to
speed up the
computation of binomial coefficients. What hasn't been done is to
choose the best one from 5 codes (including original) for different
operand sizes.

It is always interesting to compare different free/non-free software's
speed.
Using the latest versions I get the following times for computing
binomial(2000000,1000000)

my code: 0.5 sec.
Pari-Gp: 8 sec.
Mathematica: 11 sec. (using gmp, but for this their own code)
Sage: 14 sec. (it is using Pari-Gp)
Gmp/Mpir: 268 sec. "the fastest bignum library on the planet!"
Maple: 1362 sec. (-> what is it doing?!, using gmp)

Mathematica is non-free, but it is containing a reference guide, from
that
"Factorial, Binomial and related functions use a divide-and-conquer
algorithm to balance the number of digits in subproducts."
This is what I'm doing.

Bill Hart

unread,

Apr 27, 2009, 12:47:19 PM4/27/09

to mpir-...@googlegroups.com

The figures you provide are very interesting. Though they illustrate
something interesting. None of the packages which use GMP/MPIR
actually use the code therein for binomial coefficients. They all have
their own code. I suspect this sort of thing happens because a
majority of those packages are built on top of the mpn layer of
GMP/MPIR not the mpz layer. So this bothers me a little.

Would such code be suitable for the mpn layer of MPIR? Maybe, but only
if that layer was considerably expanded from its current state.

The current major pushes in MPIR development are:

* Faster assembly code on a variety of platforms
* Optimising Toom 4 and Toom 7 multiplication
* Optimising the FFT
* Faster division code (esp. division by part of a limb, 1 limb or 2
limbs and asymptotically fast algorithms)
* Faster XGCD code (making use of the asymptotically fast code Niel's
Mohler developed for GCD, which is now a part of MPIR)
* Supporting more platforms
* Rewriting the build system in python
* Support for parallel processing (e.g. multithreaded code) and GPU's
* Writing a new mpir_n layer (similar to mpn, but more robust)
* A few other things that some of us have been discussing

You can see that there's a lot to be done before we have things
fleshed out enough to be including mpn implementations of root testing
and binomial computations. And my worry is that, whilst it will be a
nice recognition of your work to have it included in MPIR, it won't be
getting used as much as it should.

My feeling, looking through your code is that:

* It really sits *on top of* the mpz layer
* It relies on some code for single limbs that is not abstracted
anywhere in MPIR, but should be

It should:

* be implemented *in* the new mpir_n layer
* the unsigned long portion should be abstracted out into a new long
layer (just as there is a longlong.h, there should be a long.c and
long.h)

I suspect that we'll include some version of your code now at the mpz
layer, just because we can, and it's significantly faster than what we
have. But I'd also be interested to know, have you ever written
anything for the mpn layer, and would you be interested in working on
some code at that level, to help us towards our goals there. We
currently have a shortage of good people working at the C level in
MPIR.

Bill.

P.S: if you have more code of the kind you've submitted, there's also
my project FLINT (http://www.flintlib.org). I *certainly* want to
include your code in that. That is used by Sage for example, and
*WILL* get used.

2009/4/27 gerrob <robert....@gmail.com>:

Robert Gerbicz

unread,

Apr 27, 2009, 3:27:49 PM4/27/09

to mpir-...@googlegroups.com

My feeling, looking through your code is that:

* It really sits *on top of* the mpz layer
* It relies on some code for single limbs that is not abstracted
anywhere in MPIR, but should be

It should:

* be implemented *in* the new mpir_n layer
* the unsigned long portion should be abstracted out into a new long
layer (just as there is a longlong.h, there should be a long.c and
long.h)

Yesterday using binary splitting method but with zero other optimizations I've written a code for factorial (mpz_fac_ui in gmp) in also mpz layer, but it was slower by about only 40%. By some optimizations it would be about 10% slower. I think that none of the uploaded 4 codes would be much faster in mpn layer, because currently gmp is also using a binary splitting idea for factorial, but in mpn layer.

But I'd also be interested to know, have you ever written
anything for the mpn layer, and would you be interested in working on
some code at that level, to help us towards our goals there.

I have written 0 code in mpn. And this is really not my plan to learn it.

Marc

unread,

Apr 30, 2009, 8:23:03 PM4/30/09

to mpir-devel

Hello,

On Apr 27, 9:47 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> The current major pushes in MPIR development are:
>
> * Faster assembly code on a variety of platforms

(vote for making a mul_1 that uses the new sparc64 VII integer addmul
instruction, if you can get your hands on the hardware)

> * Support for parallel processing (e.g. multithreaded code) and GPU's

This is the part I am curious about: what are your plans there? In
particular, at what level do you place the parallelism? The code is
already thread-safe. Do you want to port it so it runs on the GPU, and
each GPU core can do a separate operation (the parallelism is still at
the level of the application)? Or do you want to put the parallelism
at a lower level so the many cores of the GPU are used for a single
mpn operation? For multi-threading, there are some easy possibilities
in mpq (not so much in mpz, except for things like the Miller-Rabin
test) and inside mpn, for instance in the fft and toom codes (a couple
openmp pragmas and a second thread are enough to reduce the runtime of
a large multiplication by 25%), assuming applications are unable to
get enough parallelism on their own.

I was surprised to see you decided to drop the nails support. A nails
support extended to have a non-unique representation of numbers allows
a locality of operations that can help both for parallelism and vector
architectures, which seem to be 2 of your goals. But I guess there are
other ways...

> * Writing a new mpir_n layer (similar to mpn, but more robust)

What is "more robust"?

--
Marc Glisse

Bill Hart

unread,

Apr 30, 2009, 8:34:51 PM4/30/09

to mpir-...@googlegroups.com

2009/5/1 Marc <marc....@gmail.com>:

>
> Hello,
>
> On Apr 27, 9:47 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> The current major pushes in MPIR development are:
>>
>> * Faster assembly code on a variety of platforms
>
> (vote for making a mul_1 that uses the new sparc64 VII integer addmul
> instruction, if you can get your hands on the hardware)

Yes, that sounds great!

>
>> * Support for parallel processing (e.g. multithreaded code) and GPU's
>
> This is the part I am curious about: what are your plans there? In
> particular, at what level do you place the parallelism? The code is
> already thread-safe. Do you want to port it so it runs on the GPU, and
> each GPU core can do a separate operation (the parallelism is still at
> the level of the application)?

No, that's not the plan at this stage. I doubt it would be very useful.

> Or do you want to put the parallelism
> at a lower level so the many cores of the GPU are used for a single
> mpn operation?

Yes, this is the plan at least.

> For multi-threading, there are some easy possibilities
> in mpq (not so much in mpz, except for things like the Miller-Rabin
> test) and inside mpn, for instance in the fft and toom codes (a couple
> openmp pragmas and a second thread are enough to reduce the runtime of
> a large multiplication by 25%), assuming applications are unable to
> get enough parallelism on their own.

That's pretty much what we have in mind. However we are looking at
writing GPU assembly routines directly too.

>
> I was surprised to see you decided to drop the nails support. A nails
> support extended to have a non-unique representation of numbers allows
> a locality of operations that can help both for parallelism and vector
> architectures, which seem to be 2 of your goals. But I guess there are
> other ways...

For the time being, nails just became an extra headache for developers
to maintain, with no real benefits (at present). Getting lots of
developers is a greater priority at this point in the project.

I suspect that (as you indicate), nails would have to be modified to
make it useful. I personally haven't thought about its implications
for parallel processing.

If you have some ideas, it would be interesting to hear them.

>
>> * Writing a new mpir_n layer (similar to mpn, but more robust)
>
> What is "more robust"?

At present the mpn layer has a kind of inconsistent interface. It is
difficult for end users to make use of. Some functions require inputs
to be restricted in certain ways, and aliasing is sometimes allowed,
other times, not, etc.

Bill.

Marc

unread,

May 3, 2009, 12:45:01 AM5/3/09

to mpir-devel

On Apr 30, 5:34 pm, Bill Hart <goodwillh...@googlemail.com> wrote:

> >> * Support for parallel processing (e.g. multithreaded code) and GPU's
>
> > This is the part I am curious about: what are your plans there? In
> > particular, at what level do you place the parallelism? The code is
> > already thread-safe. Do you want to port it so it runs on the GPU, and
> > each GPU core can do a separate operation (the parallelism is still at
> > the level of the application)?
>
> No, that's not the plan at this stage. I doubt it would be very useful.

Many people use GMP for some kind of portable (and sometimes slightly
longer) long long, so I don't think it is completely unrealistic,
although there would be many restrictions (with allocations, in
particular). For instance, there are RSA implementations on the GPU,
but they use only some parallelism inside each RSA op and rely on
executing a batch of many RSA ops in parallel for performance. Now I
can understand concentrating on very large numbers for MPIR. I see
there are implementations of the FFT available on the GPU (with
float), it already looks more relevant to your goals.

> > For multi-threading, there are some easy possibilities
> > in mpq (not so much in mpz, except for things like the Miller-Rabin
> > test) and inside mpn, for instance in the fft and toom codes (a couple
> > openmp pragmas and a second thread are enough to reduce the runtime of
> > a large multiplication by 25%), assuming applications are unable to
> > get enough parallelism on their own.
>
> That's pretty much what we have in mind.

Is there a branch somewhere with people working on it? Have you
decided on a framework (not sure that's the right term), like openmp
for instance?

> I suspect that (as you indicate), nails would have to be modified to
> make it useful. I personally haven't thought about its implications
> for parallel processing.
> If you have some ideas, it would be interesting to hear them.

Oh, nothing special. If you are in a model of computation where you
can do as many operations in parallel as you want as long as there is
no dependency or interference between them (that is not really a good
approximation of a GPU or anything else though), without nails, an
operation like sub(2^n,1) or add(2^n-1,1) must still take linear time
to propagate the carry. However, if you allow some non-determinism
(including nail bits), you can make add_n a constant time operation
(for instance 1: add limb by limb; 2: canonicalize even limbs, adding
the carry to the next limb; 3: same thing for odd limbs; in the end
some even limbs are not canonical, but that's ok), and similarly for
sub_n (which shows that the nail representation required is not as
simple as it looked from add_n), mul_1, etc. Although it might be
better to group limbs into blocks of 10 limbs and have nails for each
block instead of each limb. Now that may be completely irrelevant for
any realistic implementation... It places the parallelism at the
lowest level, which is why I was mentioning vector architectures, but
I don't think any vector architecture provides the right instructions
to make it useful. And I don't think you are trying to create some
bignum hardware and minimizing the instruction depth either :-)

--
Marc Glisse

Bill Hart

unread,

May 3, 2009, 1:20:56 AM5/3/09

to mpir-...@googlegroups.com

2009/5/3 Marc <marc....@gmail.com>:

>
> On Apr 30, 5:34 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>
>> >> * Support for parallel processing (e.g. multithreaded code) and GPU's
>>
>> > This is the part I am curious about: what are your plans there? In
>> > particular, at what level do you place the parallelism? The code is
>> > already thread-safe. Do you want to port it so it runs on the GPU, and
>> > each GPU core can do a separate operation (the parallelism is still at
>> > the level of the application)?
>>
>> No, that's not the plan at this stage. I doubt it would be very useful.
>
> Many people use GMP for some kind of portable (and sometimes slightly
> longer) long long, so I don't think it is completely unrealistic,
> although there would be many restrictions (with allocations, in
> particular). For instance, there are RSA implementations on the GPU,
> but they use only some parallelism inside each RSA op and rely on
> executing a batch of many RSA ops in parallel for performance. Now I
> can understand concentrating on very large numbers for MPIR. I see
> there are implementations of the FFT available on the GPU (with
> float), it already looks more relevant to your goals.

Probably we'll parallelise more than the FFT. It's hard to say where
parallel code will stop being useful, but I would think we'd get down
to parallel Karatsuba.

>
>> > For multi-threading, there are some easy possibilities
>> > in mpq (not so much in mpz, except for things like the Miller-Rabin
>> > test) and inside mpn, for instance in the fft and toom codes (a couple
>> > openmp pragmas and a second thread are enough to reduce the runtime of
>> > a large multiplication by 25%), assuming applications are unable to
>> > get enough parallelism on their own.
>>
>> That's pretty much what we have in mind.
>
> Is there a branch somewhere with people working on it? Have you
> decided on a framework (not sure that's the right term), like openmp
> for instance?

Hopefully that will sort itself out in the next two weeks. There's no
branch as such at this point, so if you are thinking you'd like to be
involved, you turned up at just the right moment. I personally won't
be starting on that for two more weeks though, as I have some other
things to finish off first.

I believe we will start by using OpenMP. However there will be three
fronts on which we will be working on parallelism:

1) OpenMP C code for "ordinary" multi CPU machines - we'll start with
multiplication, as it is the most important - but possibly not the
FFT, as the FFT we finally want to use in MPIR is not in there yet

My intention is to introduce new functions, mpir_mt_blah(),
mpn_mt_blah(), mpz_mt_blah, where mt stands for multithreaded.

2) Use of SPE's on the Cell architecture (we won't start work on this
for four weeks, as I won't have access to the hardware again until
then)

3) GPU's, specifically via NVIDIA GPU assembly code (we have access to
hardware now)

>
>> I suspect that (as you indicate), nails would have to be modified to
>> make it useful. I personally haven't thought about its implications
>> for parallel processing.
>> If you have some ideas, it would be interesting to hear them.
>
> Oh, nothing special. If you are in a model of computation where you
> can do as many operations in parallel as you want as long as there is
> no dependency or interference between them (that is not really a good
> approximation of a GPU or anything else though), without nails, an
> operation like sub(2^n,1) or add(2^n-1,1) must still take linear time
> to propagate the carry. However, if you allow some non-determinism
> (including nail bits), you can make add_n a constant time operation
> (for instance 1: add limb by limb; 2: canonicalize even limbs, adding
> the carry to the next limb; 3: same thing for odd limbs; in the end
> some even limbs are not canonical, but that's ok), and similarly for
> sub_n (which shows that the nail representation required is not as
> simple as it looked from add_n), mul_1, etc. Although it might be
> better to group limbs into blocks of 10 limbs and have nails for each
> block instead of each limb. Now that may be completely irrelevant for
> any realistic implementation... It places the parallelism at the
> lowest level, which is why I was mentioning vector architectures, but
> I don't think any vector architecture provides the right instructions
> to make it useful. And I don't think you are trying to create some
> bignum hardware and minimizing the instruction depth either :-)

That's a nice idea. I wonder if it makes sense to have a NAILS mode
which is supported by all functions, or whether it is more useful to
implement a nails layer in low level parallel code which supports a
higher level ordinary non-nailified interface. What I mean is, suppose
someone wants to multiply two n limb integers. Firstly these gets
split up into nailified limbs, the data is then operated on using nail
enabled functions, then the data is converted back to ordinary
integers. This means that maintenance of a nails enabled interface can
be managed per architecture and only a handful of functions need to
support it. It also means that nails can be used only where they are
certain to be faster. It also means libraries such as NTL, Pari, etc,
which cannot currently use nails at the mpn level, will still operate
with nails turned on, even on parallel architectures.

Another thing we want to support is scalar operations. One will pass
an array of integers to such a function and the same operation will be
performed on them all. We leave it to the user to figure out what to
do with such scalar functions. Such functions can be trivially
parallelised, if the interface is clever enough.

It might end up that we don't parallelise the FFT, but instead
implement a small prime FFT or multimodular multiplication for large
integers which can more trivially be parallelised. I'm not too keen on
trying to push integer data through parallel floating point FFT's, as
we need to be sure no data is lost, and current provable bounds for
floating point FFT's are pretty pathetic.

Bill.

Reply all

Reply to author

Forward