Thanks again.
By the way, I haven't been through your root test code yet. My weekend
has been swallowed up!! But I promise it is very near the top of my
todo list now. I have some refereeing of another nature to do, a
lecture to give and I am about there...
>
> The idea is old, I've sent a similar contribution to gmp years ago, I
> don't know why they haven't used it.
The GMP developers don't appear to commit themselves to using
contributions. We don't either, of course, though we try hard. It is
always easier if people commit to merging the code themselves into our
repo. But this can be difficult and it isn't for everyone. I can only
guess that in the case of GMP, they want the code to be in a certain
style and to meet a certain standard otherwise it doesn't get merged
unless someone finds time to rewrite it. Did you receive any feedback
on the code from them?
Bill.
My feeling, looking through your code is that:
* It really sits *on top of* the mpz layer
* It relies on some code for single limbs that is not abstracted
anywhere in MPIR, but should be
It should:
* be implemented *in* the new mpir_n layer
* the unsigned long portion should be abstracted out into a new long
layer (just as there is a longlong.h, there should be a long.c and
long.h)
But I'd also be interested to know, have you ever written
anything for the mpn layer, and would you be interested in working on
some code at that level, to help us towards our goals there.
Yes, that sounds great!
>
>> * Support for parallel processing (e.g. multithreaded code) and GPU's
>
> This is the part I am curious about: what are your plans there? In
> particular, at what level do you place the parallelism? The code is
> already thread-safe. Do you want to port it so it runs on the GPU, and
> each GPU core can do a separate operation (the parallelism is still at
> the level of the application)?
No, that's not the plan at this stage. I doubt it would be very useful.
> Or do you want to put the parallelism
> at a lower level so the many cores of the GPU are used for a single
> mpn operation?
Yes, this is the plan at least.
> For multi-threading, there are some easy possibilities
> in mpq (not so much in mpz, except for things like the Miller-Rabin
> test) and inside mpn, for instance in the fft and toom codes (a couple
> openmp pragmas and a second thread are enough to reduce the runtime of
> a large multiplication by 25%), assuming applications are unable to
> get enough parallelism on their own.
That's pretty much what we have in mind. However we are looking at
writing GPU assembly routines directly too.
>
> I was surprised to see you decided to drop the nails support. A nails
> support extended to have a non-unique representation of numbers allows
> a locality of operations that can help both for parallelism and vector
> architectures, which seem to be 2 of your goals. But I guess there are
> other ways...
For the time being, nails just became an extra headache for developers
to maintain, with no real benefits (at present). Getting lots of
developers is a greater priority at this point in the project.
I suspect that (as you indicate), nails would have to be modified to
make it useful. I personally haven't thought about its implications
for parallel processing.
If you have some ideas, it would be interesting to hear them.
>
>> * Writing a new mpir_n layer (similar to mpn, but more robust)
>
> What is "more robust"?
At present the mpn layer has a kind of inconsistent interface. It is
difficult for end users to make use of. Some functions require inputs
to be restricted in certain ways, and aliasing is sometimes allowed,
other times, not, etc.
Bill.
Probably we'll parallelise more than the FFT. It's hard to say where
parallel code will stop being useful, but I would think we'd get down
to parallel Karatsuba.
>
>> > For multi-threading, there are some easy possibilities
>> > in mpq (not so much in mpz, except for things like the Miller-Rabin
>> > test) and inside mpn, for instance in the fft and toom codes (a couple
>> > openmp pragmas and a second thread are enough to reduce the runtime of
>> > a large multiplication by 25%), assuming applications are unable to
>> > get enough parallelism on their own.
>>
>> That's pretty much what we have in mind.
>
> Is there a branch somewhere with people working on it? Have you
> decided on a framework (not sure that's the right term), like openmp
> for instance?
Hopefully that will sort itself out in the next two weeks. There's no
branch as such at this point, so if you are thinking you'd like to be
involved, you turned up at just the right moment. I personally won't
be starting on that for two more weeks though, as I have some other
things to finish off first.
I believe we will start by using OpenMP. However there will be three
fronts on which we will be working on parallelism:
1) OpenMP C code for "ordinary" multi CPU machines - we'll start with
multiplication, as it is the most important - but possibly not the
FFT, as the FFT we finally want to use in MPIR is not in there yet
My intention is to introduce new functions, mpir_mt_blah(),
mpn_mt_blah(), mpz_mt_blah, where mt stands for multithreaded.
2) Use of SPE's on the Cell architecture (we won't start work on this
for four weeks, as I won't have access to the hardware again until
then)
3) GPU's, specifically via NVIDIA GPU assembly code (we have access to
hardware now)
>
>> I suspect that (as you indicate), nails would have to be modified to
>> make it useful. I personally haven't thought about its implications
>> for parallel processing.
>> If you have some ideas, it would be interesting to hear them.
>
> Oh, nothing special. If you are in a model of computation where you
> can do as many operations in parallel as you want as long as there is
> no dependency or interference between them (that is not really a good
> approximation of a GPU or anything else though), without nails, an
> operation like sub(2^n,1) or add(2^n-1,1) must still take linear time
> to propagate the carry. However, if you allow some non-determinism
> (including nail bits), you can make add_n a constant time operation
> (for instance 1: add limb by limb; 2: canonicalize even limbs, adding
> the carry to the next limb; 3: same thing for odd limbs; in the end
> some even limbs are not canonical, but that's ok), and similarly for
> sub_n (which shows that the nail representation required is not as
> simple as it looked from add_n), mul_1, etc. Although it might be
> better to group limbs into blocks of 10 limbs and have nails for each
> block instead of each limb. Now that may be completely irrelevant for
> any realistic implementation... It places the parallelism at the
> lowest level, which is why I was mentioning vector architectures, but
> I don't think any vector architecture provides the right instructions
> to make it useful. And I don't think you are trying to create some
> bignum hardware and minimizing the instruction depth either :-)
That's a nice idea. I wonder if it makes sense to have a NAILS mode
which is supported by all functions, or whether it is more useful to
implement a nails layer in low level parallel code which supports a
higher level ordinary non-nailified interface. What I mean is, suppose
someone wants to multiply two n limb integers. Firstly these gets
split up into nailified limbs, the data is then operated on using nail
enabled functions, then the data is converted back to ordinary
integers. This means that maintenance of a nails enabled interface can
be managed per architecture and only a handful of functions need to
support it. It also means that nails can be used only where they are
certain to be faster. It also means libraries such as NTL, Pari, etc,
which cannot currently use nails at the mpn level, will still operate
with nails turned on, even on parallel architectures.
Another thing we want to support is scalar operations. One will pass
an array of integers to such a function and the same operation will be
performed on them all. We leave it to the user to figure out what to
do with such scalar functions. Such functions can be trivially
parallelised, if the interface is clever enough.
It might end up that we don't parallelise the FFT, but instead
implement a small prime FFT or multimodular multiplication for large
integers which can more trivially be parallelised. I'm not too keen on
trying to push integer data through parallel floating point FFT's, as
we need to be sure no data is lost, and current provable bounds for
floating point FFT's are pretty pathetic.
Bill.