First attempt at openmp

94 views
Skip to first unread message

Bill Hart

unread,
Jun 12, 2009, 10:22:54 PM6/12/09
to mpir-dev
I made a first very rough attempt to parallelise Toom4. I just inserted some

#pragma parallel sections

and

#pragma section

statements in around pairs of independent pointwise multiplications in toom4.

It slows it down nicely. I've got it running about 50% slower now.

This is for multiplying about 450 limbs with toom4. Each of the
pointwise mults is then going to be just over 100 limbs.

Hmm, oh well. It's amazing how pathetic the documentation of openmp is
on the web. It can't be this bad, or no one would use it!

I'll try toom7 tomorrow maybe, see if that is any better. The odd
thing is, 2 cpus start up, but they run at like 45% each.

Bill.

Jason Moxham

unread,
Jun 12, 2009, 10:58:03 PM6/12/09
to mpir...@googlegroups.com
My guess is that overheads/syncing are killing it.What I would try is the kara
mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
suggesting we use karamul at these sizes , but just get some understanding of
how the parelllelization goes with a simple algorithm.


>
>

Bill Hart

unread,
Jun 13, 2009, 11:12:13 AM6/13/09
to mpir...@googlegroups.com
Something is definitely wrong here. I did the same thing for Toom 7
today and there's a similar slowdown.

Basically it splits into two threads OK, but the sum of the total CPU
percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
54.5%, etc, and the end result is slower timings, not faster.

I must be using OpenMP incorrectly. No one would use it if it were
this bad. I'm just taking pairs of pointwise multiplications and
splitting so that one of them is executed by one thread and one by the
other.

Bill.

2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:

Jason Martin

unread,
Jun 13, 2009, 3:16:26 PM6/13/09
to mpir...@googlegroups.com
The best documentation for OpenMP is here:

http://openmp.org/wp/openmp-specifications/

get the complete specification.

Try setting the environmental variable OMP_NUM_THREADS to 2 or 4 (if
you have 4 cores available) so that it doesn't try to create too many
threads (which I found to be the biggest problem).

--jason

Jason Worth Martin
Asst. Professor of Mathematics
http://www.math.jmu.edu/~martin

Jason Martin

unread,
Jun 13, 2009, 3:18:16 PM6/13/09
to mpir...@googlegroups.com
Actually, you might want to bound the thread nesting depth, too.
(Assuming that you're working with karatsuba sizes that are
recursing.) At best, OpenMP is only going to buy us something at the
very top level...

Jason Worth Martin
Asst. Professor of Mathematics
http://www.math.jmu.edu/~martin



Marc

unread,
Jun 13, 2009, 3:19:56 PM6/13/09
to mpir-dev
Hello,

to check whether things are working properly, you could try this in
mpn_mul_fft_full_a:

tp = __GMP_ALLOCATE_FUNC_LIMBS (l);

#pragma omp parallel sections
{
#pragma omp section
muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */

#pragma omp section
mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
}

If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
single thread.

What implementation of openmp are you using, a recent gcc? How are you
taking the timings exactly? Karatsuba with 10000 limbs is faster with
several threads than just 1 (a bit more complicated than the example
above because you need scratch space), so I expect a small speed-up in
the toom4 range is possible.

Bill Hart

unread,
Jun 13, 2009, 9:37:16 PM6/13/09
to mpir...@googlegroups.com
It's the way I'm timing it that's messed up. I was using the
mpirbench, which of course uses cputime not wall time. Doh.

I'll switch it over to use wall time.

Bill.

2009/6/13 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 13, 2009, 10:02:55 PM6/13/09
to mpir...@googlegroups.com
The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
1000000 limbs, both of which are quite acceptable speedups.

I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
at best. That certainly wouldn't represent value for money, given that
two CPU cores are being used.

I think the next experiment will be the one suggested by Jason. See
how Karatsuba performs for different sizes up to say 30000 limbs.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 13, 2009, 10:42:25 PM6/13/09
to mpir...@googlegroups.com
Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
bad. That's a 4s computation. And that's with OMP_NESTED=false.

I'm using speed to time it and speed shows an almost two times speedup
for the parallelised FFT. So I don't know what is going on. It must be
nesting even though I've told it not to.

Jason Moxham

unread,
Jun 13, 2009, 10:46:48 PM6/13/09
to mpir...@googlegroups.com
Your comparing kara and base against kara and base with openmp.
You have ONLY kara and basecase mul , nothing else?

Bill Hart

unread,
Jun 13, 2009, 10:56:05 PM6/13/09
to mpir...@googlegroups.com
I'm timing mpn_kara_mul_n using speed. It's just kara and base against
kara and base with openmp.

I tried forcing it to not nest any parallelism by having openmp
pragmas only in the outer level (it calls off to an identical copy of
mpn_kara_mul_n without openmp pragmas instead of recursing to the same
function. No change. Still twice as slow.

It's 50% slower with only 2 threads instead of 3.

They are pretty enormous operands. Perhaps it just overwhelms the
cache more quickly with two processors working at the same time.

Bill.

2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:

Jason Moxham

unread,
Jun 13, 2009, 11:01:07 PM6/13/09
to mpir...@googlegroups.com
Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
but perhaps it is
basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb

Jason Moxham

unread,
Jun 13, 2009, 11:15:07 PM6/13/09
to mpir...@googlegroups.com
yeah I think thats it
if we run a 10^6 basecase this will take 3*10^12 mem ops
so adding another thread wont help because we are allready waiting for
memory to respond

Bill Hart

unread,
Jun 13, 2009, 11:18:02 PM6/13/09
to mpir...@googlegroups.com
The times jump around a lot. At 1000 limbs it varies from 50% slower
to about the same speed.

At 10000 limbs it is 50% slower, twice as slow at 30000 limbs.

At 300 limbs it is only about 30% slower.

2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:

Marc

unread,
Jun 14, 2009, 1:56:19 AM6/14/09
to mpir-dev
Could you share the patches you are using, both to measure the speed
and to parallelize karatsuba? Last time I tried, just making the last
two calls to mpn_kara_mul_n before the /* Interpolate */ comment in
parallel sections (and giving the last one a fresh buffer instead of ws
+n), the second thread increased the speed for 1000, 10000 or 300000
limbs, it did not make it slower (I used 2 threads, no nesting). For
300000, it is pretty close to a 2x speed-up.
> >> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> >>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
> >>> 1000000 limbs, both of which are quite acceptable speedups.
>
> >>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
> >>> at best. That certainly wouldn't represent value for money, given that
> >>> two CPU cores are being used.
>
> >>> I think the next experiment will be the one suggested by Jason. See
> >>> how Karatsuba performs for different sizes up to say 30000 limbs.
>
> >>> Bill.
>
> >>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> >>>> It's the way I'm timing it that's messed up. I was using the
> >>>> mpirbench, which of course uses cputime not wall time. Doh.
>
> >>>> I'll switch it over to use wall time.
>
> >>>> Bill.
>
> >>>> 2009/6/13 Marc <marc.gli...@gmail.com>:

Bill Hart

unread,
Jun 14, 2009, 10:41:32 AM6/14/09
to mpir...@googlegroups.com
Sure. From line 204 of tune/time.c I just hack it to make it think it
does not have access to getrusage, but only gettimeofday:

#if 1 //&& defined( _MSC_VER)
#define HAVE_GETRUSAGE 0
#define HAVE_GETTIMEOFDAY 1
//#include "getrusage.h"
//#include "gettimeofday.h"
#endif

That does seem to give what I expect for the FFT.

I've attached the mul_n.c which does what I think you said you tried.

Bill.

2009/6/14 Marc <marc....@gmail.com>:
mul_n.c

Marc

unread,
Jun 14, 2009, 12:56:14 PM6/14/09
to mpir-dev
On Jun 14, 7:41 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Sure. From line 204 of tune/time.c I just hack it to make it think it
> does not have access to getrusage, but only gettimeofday:
>
> #if 1 //&& defined( _MSC_VER)
> #define HAVE_GETRUSAGE 0
> #define HAVE_GETTIMEOFDAY 1
> //#include "getrusage.h"
> //#include "gettimeofday.h"
> #endif
>
> That does seem to give what I expect for the FFT.
>
> I've attached the mul_n.c which does what I think you said you tried.

I think your version doesn't pass the testsuite with
OMP_NUM_THREADS=2. Of the 3 recursive calls to karamul, the first
reads p while the second and third write to p, so you can't make the
first in parallel to the others unless you use some other place to
store the temporary values. Since you end up with one thread writing
to p and one reading from it, the synchronization may be causing your
slow-down.

What I tried is something like:
http://www.loria.fr/~glisse/tmp/mul_n.c
(I purposely try to avoid sending explicit code because I am not sure
I am allowed to, but this should be trivial enough)

Bill Hart

unread,
Jun 14, 2009, 12:56:22 PM6/14/09
to mpir...@googlegroups.com
OK, I did what Jason suggested and made the toom3 cutoff really high,
etc, so karatsuba is always used. I also modified the bench program to
time lots of different sizes. Here are the results, which are more
what I expected to see:

1 thread:
MPIRbench.base.multiply.6400,6400 result: 170945
MPIRbench.base.multiply.19200,19200 result: 28837
MPIRbench.base.multiply.64000,64000 result: 4299
MPIRbench.base.multiply.131072,131072 result: 1391
MPIRbench.base.multiply.192000,192000 result: 722
MPIRbench.base.multiply.640000,640000 result: 106
MPIRbench.base.multiply.2097152,2097152 result: 16.5
MPIRbench.base.multiply.6400000,6400000 result: 2.65

2 threads:
MPIRbench.base.multiply.6400,6400 result: 45924
MPIRbench.base.multiply.19200,19200 result: 21737
MPIRbench.base.multiply.64000,64000 result: 4755
MPIRbench.base.multiply.131072,131072 result: 1425
MPIRbench.base.multiply.192000,192000 result: 890
MPIRbench.base.multiply.640000,640000 result: 137
MPIRbench.base.multiply.2097152,2097152 result: 17.8
MPIRbench.base.multiply.6400000,6400000 result: 3.45

3 threads:
MPIRbench.base.multiply.6400,6400 result: 39237
MPIRbench.base.multiply.19200,19200 result: 21635
MPIRbench.base.multiply.64000,64000 result: 7055
MPIRbench.base.multiply.131072,131072 result: 2332
MPIRbench.base.multiply.192000,192000 result: 1546
MPIRbench.base.multiply.640000,640000 result: 246
MPIRbench.base.multiply.2097152,2097152 result: 31.3
MPIRbench.base.multiply.6400000,6400000 result: 6.58

I think whatever I did to time.c wasn't right. So I'll stick to timing
with make bench.

I've attached my copy of runbench and timing.h which I hacked so it
would time parallel code correctly. These are from the bench
directory.

All I did to modify mul_n.c was put openmp pragmas around the central
3 calls to mpn_kara_mul_n and allocate extra scratch space:

else
{
mp_limb_t tp[MPN_KARA_MUL_N_TSIZE (n2)];
mp_limb_t up[MPN_KARA_MUL_N_TSIZE (n2)];
#pragma omp parallel sections
{
#pragma omp section
mpn_kara_mul_n (ws, p, p + n2, n2, ws + n);
#pragma omp section
mpn_kara_mul_n (p, a, b, n2, tp);
#pragma omp section
mpn_kara_mul_n (p + n, a + n2, b + n2, n2, up);
}
}

and of course in the appropriate gmp-mparam.h for my architecture I
changed the toom3, toom4 and toom7 cutoffs to very large values.

Naturally this is all very hackish at this stage and not much use as
no one in their right mind is going to use karatsuba for 1000 limbs.
Nothing we've tried scales beyond two or three threads either. But at
least openmp does work.

I think the goal should be to have algorithms which scale well with
the number of threads and have everything tuned so that for n = 2
threads there is at least a 50% speedup and for n > 2 threads, there
is at least an n-1 times speedup.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:
runbench
timing.h

Bill Hart

unread,
Jun 14, 2009, 1:01:37 PM6/14/09
to mpir...@googlegroups.com
You're absolutely right. I missed that in my original and my new
version. Let me try and get this right.

And no problems with the code. We won't use it unless you explicitly
indicate you are contributing it to the project, say after you checked
it was OK with your institution - in the event that you did work on
something you wanted to contribute.

Bill.

2009/6/14 Marc <marc....@gmail.com>:

Marc

unread,
Jun 14, 2009, 1:14:19 PM6/14/09
to mpir-dev
On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> And no problems with the code. We won't use it unless you explicitly
> indicate you are contributing it to the project, say after you checked
> it was OK with your institution - in the event that you did work on
> something you wanted to contribute.

Anything I post on this list, you can use. I am just explaining why I
am not posting much :-/

Bill Hart

unread,
Jun 14, 2009, 1:54:08 PM6/14/09
to mpir...@googlegroups.com
I attach a version of mul_n.c which does pass tests.

Here are the timings:

1 thread:
MPIRbench.base.multiply.6400,6400 result: 170945
MPIRbench.base.multiply.19200,19200 result: 28837
MPIRbench.base.multiply.64000,64000 result: 4299
MPIRbench.base.multiply.131072,131072 result: 1391
MPIRbench.base.multiply.192000,192000 result: 722
MPIRbench.base.multiply.640000,640000 result: 106
MPIRbench.base.multiply.2097152,2097152 result: 16.5

2 threads:
MPIRbench.base.multiply.6400,6400 result: 54658
MPIRbench.base.multiply.19200,19200 result: 20500
MPIRbench.base.multiply.64000,64000 result: 4287
MPIRbench.base.multiply.131072,131072 result: 1467
MPIRbench.base.multiply.192000,192000 result: 876
MPIRbench.base.multiply.640000,640000 result: 127
MPIRbench.base.multiply.2097152,2097152 result: 18.0

3 threads:
MPIRbench.base.multiply.6400,6400 result: 39920
MPIRbench.base.multiply.19200,19200 result: 21620
MPIRbench.base.multiply.64000,64000 result: 6434
MPIRbench.base.multiply.131072,131072 result: 2310
MPIRbench.base.multiply.192000,192000 result: 1560
MPIRbench.base.multiply.640000,640000 result: 228
MPIRbench.base.multiply.2097152,2097152 result: 32.9

It looks like at around 3000 limbs with 3 threads one gets a better
than 2x speedup.

Of course at 3000 limbs there are about 6 recursions before it hits
basecase using just karatsuba. In the real world we'd be well into the
FFT range by the time that many recursions had occurred with
Toom3/4/7.

Having said that, the usual score at 131072 x 131072 when toom7 is
used is 2074, and at 192000,192000 the score is usually 1153, so we
actually beat those. with parallel karatsuba. At 640000,640000 the
score is usually 246, so we don't beat that.

There is hope, but it is unclear whether we'll ever get a worthwhile
speedup at the karatsuba range using openmp.

Bill.

2009/6/14 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 14, 2009, 1:54:46 PM6/14/09
to mpir...@googlegroups.com
Oops, the attachment.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:
mul_n.c

Bill Hart

unread,
Jun 14, 2009, 2:48:42 PM6/14/09
to mpir...@googlegroups.com
I tried parallelising just the toom3 (and setting the toom4 cutoff
extremely high, etc). As there are 5 pointwise mults for toom3, one
can only make two pairs of parallelised pointwise mults and use 2
threads. With a little rearrangement and addition of an extra
temporary space, this works. Here are the times:

MPIRbench.base.multiply.6400,6400 result: 34044
MPIRbench.base.multiply.19200,19200 result: 10874
MPIRbench.base.multiply.64000,64000 result: 4090
MPIRbench.base.multiply.131072,131072 result: 2140
MPIRbench.base.multiply.192000,192000 result: 1161
MPIRbench.base.multiply.640000,640000 result: 234
MPIRbench.base.multiply.2097152,2097152 result: 43.0
MPIRbench.base.multiply.6400000,6400000 result: 8.28
MPIRbench.base.multiply.19200000,19200000 result: 1.65

Again, nothing to write home about. At 131072 and 192000 limbs we get
just slightly better performance than with a tuned single threaded
implementation.

Bill Hart

unread,
Jun 14, 2009, 3:15:41 PM6/14/09
to mpir...@googlegroups.com
Herewith timings for just toom4 parallelised with 2 threads:

MPIRbench.base.multiply.6400,6400 result: 173445
MPIRbench.base.multiply.19200,19200 result: 31849
MPIRbench.base.multiply.64000,64000 result: 5632
MPIRbench.base.multiply.131072,131072 result: 2107
MPIRbench.base.multiply.192000,192000 result: 1482
MPIRbench.base.multiply.640000,640000 result: 299
MPIRbench.base.multiply.2097152,2097152 result: 53.1
MPIRbench.base.multiply.6400000,6400000 result: 11.8
MPIRbench.base.multiply.19200000,19200000 result: 2.43

Here we just beat ordinary mpir at 131072 bits (2000 limbs), there is
hope at 192000 bits (3000 limbs) and the bench at 640000 bits (100000
limbs) is usually 246, so we are just ahead there.

Marc

unread,
Jun 14, 2009, 3:18:45 PM6/14/09
to mpir-dev
On Jun 14, 10:54 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> I attach a version of mul_n.c which does pass tests.
>
> Here are the timings:
[...]
> It looks like at around 3000 limbs with 3 threads one gets a better
> than 2x speedup.

That looks better indeed.

With 2 threads (I don't have any hardware to play with 3 threads,
which is what your implementation is targeting), at 1000 limbs, your
version is just breaking even with the single-threaded one, whereas
the one I posted (a proof of concept targeting 2 threads) already
shows a 20% speedup at least. I wonder how hard it will be to make
something that works well whatever the number of threads you give
it...

> There is hope, but it is unclear whether we'll ever get a worthwhile
> speedup at the karatsuba range using openmp.

Indeed. But I think Jason suggested this mostly (or at least
partially) as an easy way to find out how much hope there is for a
parallel toom, and it doesn't look so bad. Anyway, the most
interesting part is the FFT, that offers many possibilities for
parallelism, but where you really need to consider memory locality.

I also expect that implementations are making progress on reducing the
overhead, at some point it will be necessary to do some comparisons of
the various implementations available.

Bill Hart

unread,
Jun 14, 2009, 3:36:49 PM6/14/09
to mpir...@googlegroups.com
Here are the benches for toom7 parallelised with 2 threads:

MPIRbench.base.multiply.64000,64000 result: 4149
MPIRbench.base.multiply.131072,131072 result: 2120
MPIRbench.base.multiply.192000,192000 result: 1419
MPIRbench.base.multiply.640000,640000 result: 318
MPIRbench.base.multiply.2097152,2097152 result: 62.5
MPIRbench.base.multiply.6400000,6400000 result: 14.7
MPIRbench.base.multiply.19200000,19200000 result: 3.10

Certainly at 640000 there is hope (by 2097152 we just about catch up
with the FFT). But that's 100000 limbs. Amazing that numbers have to
be so large before there is a really clear benefit, even from 2
threads.

It does seem that parallelising the FFT is the main hope.

Bill.


2009/6/14 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 14, 2009, 3:44:25 PM6/14/09
to mpir...@googlegroups.com
2009/6/14 Marc <marc....@gmail.com>:

>
> On Jun 14, 10:54 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> I attach a version of mul_n.c which does pass tests.
>>
>> Here are the timings:
> [...]
>> It looks like at around 3000 limbs with 3 threads one gets a better
>> than 2x speedup.
>
> That looks better indeed.
>
> With 2 threads (I don't have any hardware to play with 3 threads,
> which is what your implementation is targeting), at 1000 limbs, your
> version is just breaking even with the single-threaded one, whereas
> the one I posted (a proof of concept targeting 2 threads) already
> shows a 20% speedup at least. I wonder how hard it will be to make
> something that works well whatever the number of threads you give
> it...

Very hard with these algorithms I think.

>
>> There is hope, but it is unclear whether we'll ever get a worthwhile
>> speedup at the karatsuba range using openmp.
>
> Indeed.  But I think Jason suggested this mostly (or at least
> partially) as an easy way to find out how much hope there is for a
> parallel toom, and it doesn't look so bad.

Sure.

> Anyway, the most
> interesting part is the FFT, that offers many possibilities for
> parallelism, but where you really need to consider memory locality.

Yes.

>
> I also expect that implementations are making progress on reducing the
> overhead, at some point it will be necessary to do some comparisons of
> the various implementations available.

Yes, though I think the main focus of reducing overhead is using less
memory. However this is precisely the thing which prevents the toom
algorithms being parallelised. So I am not too hopeful that this will
be useful.

I didn't actually read the GMP code in detail for Toom 4 so I don't
know precisely what they have done, and I see they are still working
on it since. But we already have a pretty competitive benchmark in
that range, so I'm not sure there's too much in the way of overhead to
shave off.

Had we not implemented toom7 then parallelising toom4 would probably
look pretty good. But toom7 eventually beats toom4 by quite a bit, so
parallelising toom4 looks much worse.

There is also the question of whether we should have a toom13 or
something like that. This might make parallelising the toom7 look bad.

Essentially I feel we are best parallelising the FFT for now and
working our way down. There's certainly plenty of opportunity to
parallelise the FFT. The only problem there is we don't intend to use
the current FFT forever, as a much better one exists, and the other
one (in FLINT) can also be parallelised for many threads. But I
suppose if for the time being we can demonstrate a meaningful speedup
in the current FFT without too much work, it will be worth doing.

Bill.

Bill Hart

unread,
Jun 14, 2009, 3:53:47 PM6/14/09
to mpir...@googlegroups.com
Oh, there is one sense in which optimising the toom code may help, and
that is in the addition of more and faster assembly code for some of
the evaluation and interpolation stuff. As the overhead there drops,
the effect of speeding up the pointwise mults becomes more noticeable
as it represents a greater proportion of the time in the first place.
So you are certainly right in that sense.

The one catch with this argument of course is that by the time
parallelising the toom code is a clear winner, the linear parts of the
toom algorithms contribute so little that optimising them becomes
almost worthless. Also, there are fewer and fewer opportunities for
such optimisation as k increases in toom-k.

I would be surprised if it were possible to improve toom7 by more than
2% by optimising the linear parts. Using 2 threads instead of 1 in the
pointwise mults maybe makes this look more like 4%. But it still isn't
enough to justify parallelising toom7, especially if the FFT is much
faster when parallelised.

Of course we won't know until we try....

Bill.

2009/6/14 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 14, 2009, 4:10:38 PM6/14/09
to mpir...@googlegroups.com
Here I have parallelised the FFT with two threads:

MPIRbench.base.multiply.192000,192000 result: 1692
MPIRbench.base.multiply.640000,640000 result: 450
MPIRbench.base.multiply.2097152,2097152 result: 83.3
MPIRbench.base.multiply.6400000,6400000 result: 17.2
MPIRbench.base.multiply.19200000,19200000 result: 6.25

Here are the standard benches:

MPIRbench.base.multiply.192000,192000 result: 1151
multiply 640000 640000
MPIRbench.base.multiply.640000,640000 result: 249
multiply 2097152 2097152
MPIRbench.base.multiply.2097152,2097152 result: 63.0
multiply 6400000 6400000
MPIRbench.base.multiply.6400000,6400000 result: 15.4
multiply 19200000 19200000
MPIRbench.base.multiply.19200000,19200000 result: 5.44

So there is a clear initial win, with diminishing returns later on.
Interesting. I'd bet good money that toom13 would make this look
nowhere near as appealing.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:

Marc

unread,
Jun 14, 2009, 4:15:17 PM6/14/09
to mpir-dev
On Jun 14, 12:44 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> 2009/6/14 Marc <marc.gli...@gmail.com>:
> > I also expect that implementations are making progress on reducing the
> > overhead, at some point it will be necessary to do some comparisons of
> > the various implementations available.
>
> Yes, though I think the main focus of reducing overhead is using less
> memory. However this is precisely the thing which prevents the toom
> algorithms being parallelised. So I am not too hopeful that this will
> be useful.
>
> I didn't actually read the GMP code in detail for Toom 4 so I don't
> know precisely what they have done, and I see they are still working
> on it since. But we already have a pretty competitive benchmark in
> that range, so I'm not sure there's too much in the way of overhead to
> shave off.

Oh, I realize my post wasn't clear at all. I was talking about
comparing openmp implementations (gcc vs intel vs sun vs ..., and
maybe linux vs freebsd vs ...), not comparing toom implementations.
There is certainly a lot of freedom in how the split/merge of threads
is done, and the comparison between the single-thread and parallel
versions of toom may not be the same for better implementations of
openmp.

> Essentially I feel we are best parallelising the FFT for now and
> working our way down. There's certainly plenty of opportunity to
> parallelise the FFT. The only problem there is we don't intend to use
> the current FFT forever, as a much better one exists, and the other
> one (in FLINT) can also be parallelised for many threads.

Any idea on when you plan to integrate it in mpir?

William Stein

unread,
Jun 14, 2009, 4:15:41 PM6/14/09
to mpir...@googlegroups.com
On Sun, Jun 14, 2009 at 9:18 PM, Marc<marc....@gmail.com> wrote:
>
> On Jun 14, 10:54 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> I attach a version of mul_n.c which does pass tests.
>>
>> Here are the timings:
> [...]
>> It looks like at around 3000 limbs with 3 threads one gets a better
>> than 2x speedup.
>
> That looks better indeed.
>
> With 2 threads (I don't have any hardware to play with 3 threads,
> which is what your implementation is targeting), at 1000 limbs,

I hope you guys will try some of these benchmarks/experiments on the
sage.math hardware. I have some Sun X4450's with 24-cores (4
six-core Dunnington Xeon 2.6Ghz chips) and a Sun T2 with I guess 16
cores. Marc, if you want to do some serious experimentation on such
hardware let me know and I can get you an account.

-- William

Bill Hart

unread,
Jun 14, 2009, 4:17:34 PM6/14/09
to mpir...@googlegroups.com
2009/6/14 Marc <marc....@gmail.com>:

>
> On Jun 14, 12:44 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>> 2009/6/14 Marc <marc.gli...@gmail.com>:
>> > I also expect that implementations are making progress on reducing the
>> > overhead, at some point it will be necessary to do some comparisons of
>> > the various implementations available.
>>
>> Yes, though I think the main focus of reducing overhead is using less
>> memory. However this is precisely the thing which prevents the toom
>> algorithms being parallelised. So I am not too hopeful that this will
>> be useful.
>>
>> I didn't actually read the GMP code in detail for Toom 4 so I don't
>> know precisely what they have done, and I see they are still working
>> on it since. But we already have a pretty competitive benchmark in
>> that range, so I'm not sure there's too much in the way of overhead to
>> shave off.
>
> Oh, I realize my post wasn't clear at all. I was talking about
> comparing openmp implementations (gcc vs intel vs sun vs ..., and
> maybe linux vs freebsd vs ...), not comparing toom implementations.
> There is certainly a lot of freedom in how the split/merge of threads
> is done, and the comparison between the single-thread and parallel
> versions of toom may not be the same for better implementations of
> openmp.

Yes, that is a very important consideration. We do currently have
access to the Windows and Intel compilers.

>
>> Essentially I feel we are best parallelising the FFT for now and
>> working our way down. There's certainly plenty of opportunity to
>> parallelise the FFT. The only problem there is we don't intend to use
>> the current FFT forever, as a much better one exists, and the other
>> one (in FLINT) can also be parallelised for many threads.
>
> Any idea on when you plan to integrate it in mpir?

It will take a year probably, as the lower level stuff needs to be
done entirely in assembly (it's currently done in C), and each of the
higher levels will need to be rewritten in a manner that is consistent
with the way the rest of MPIR is written. I also expect that we can
improve it considerably.

Bill.

Bill Hart

unread,
Jun 14, 2009, 4:19:15 PM6/14/09
to mpir...@googlegroups.com
At the moment the code is only *able* to use two threads. But when we
get code that can use more threads, we'll certainly do that. Thanks.

Of course that doesn't stop Marc getting an account on there, as he
may be interested in having a play with more threads, especially as he
indicates he only has access to 2 cores atm.

Bill.

2009/6/14 William Stein <wst...@gmail.com>:

Marc

unread,
Jun 14, 2009, 4:22:15 PM6/14/09
to mpir-dev
On Jun 14, 1:10 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> Here I have parallelised the FFT with two threads:
[...]
> So there is a clear initial win, with diminishing returns later on.
> Interesting. I'd bet good money that toom13 would make this look
> nowhere near as appealing.

Earlier in the thread, you said the FFT was 50% faster at 32000 limbs
and nearly twice as fast for 10^6 limbs. What part of the FFT did you
parallelise for this test?

Bill Hart

unread,
Jun 14, 2009, 4:34:46 PM6/14/09
to mpir...@googlegroups.com
The same part. That was very approximate. I was taking 83.3/63.0 ~ 90/30 = 1.50.

At 1000000 limbs the times are respectively:

standard: 1.48 parallel(2 threads): 2.62

I realise why the improvement factor jumps around so much. The new FFT
that we use does a Mersenne convolution mod 2^al - 1 and a Fermat
convolution mod 2^l+1. So the convolutions are completely different
lengths (and therefore take radically different times), depending on
the factor a, which varies as the number of limbs being multiplied
goes up. At 1000000 limbs we just happen to be lucky and get a = 1.

At 100000 limbs we have a = 6.

Bill.

2009/6/14 Marc <marc....@gmail.com>:

Marc

unread,
Jun 14, 2009, 4:49:04 PM6/14/09
to mpir-dev
On Jun 14, 1:34 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> The same part. That was very approximate. I was taking 83.3/63.0 ~ 90/30 = 1.50.
>
> At 1000000 limbs the times are respectively:
>
> standard: 1.48 parallel(2 threads): 2.62
>
> I realise why the improvement factor jumps around so much. The new FFT
> that we use does a Mersenne convolution mod 2^al - 1 and a Fermat
> convolution mod 2^l+1. So the convolutions are completely different
> lengths (and therefore take radically different times), depending on
> the factor a, which varies as the number of limbs being multiplied
> goes up. At 1000000 limbs we just happen to be lucky and get a = 1.
>
> At 100000 limbs we have a = 6.

Ah, ok, thanks for the explanation. I was actually surprised by the 2x
speedup, but now it makes sense. There are certainly other places to
parallelize the code: the pointwise multiplications for instance, or
when we aren't squaring we could compute the transform of both
arguments in parallel.

Bill Hart

unread,
Jun 14, 2009, 4:49:23 PM6/14/09
to mpir...@googlegroups.com
Ah, bingo, here are the times for 3000-1000000 limbs with a=1 always:

MPIRbench.base.multiply.192000,192000 result: 1686
MPIRbench.base.multiply.640000,640000 result: 449
MPIRbench.base.multiply.2097152,2097152 result: 114
MPIRbench.base.multiply.6400000,6400000 result: 30.4
MPIRbench.base.multiply.19200000,19200000 result: 8.94
MPIRbench.base.multiply.64000000,64000000 result: 2.61

So from 3000 limbs onwards, with a=1 in the FFT it is always worth
using two threads in the FFT.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 14, 2009, 4:58:28 PM6/14/09
to mpir...@googlegroups.com
For very large operands, there's also the Bailey's 4 step method,
which breaks the large FFT into lots of smaller ones. I once (with
David Harvey's help) parallelised an FFT in this manner using old
fashioned pthreads. It was about a month's work, but it did work.

Here are links to 2 graphs showing the speedup for various sized
polynomials. The two graphs correspond to single thread vs two threads
and two threads vs four threads respectively:

http://sage.math.washington.edu/home/wbhart/FLINT-talk/msri-par2-vs-nothreads.png
http://sage.math.washington.edu/home/wbhart/FLINT-talk/msri-par4-vs-par2.png

The axes are: bottom axis log_2(poly length), vertical axis
log_2(coefficient bits).

Bill.

2009/6/14 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 14, 2009, 5:01:11 PM6/14/09
to mpir...@googlegroups.com
Oh I forgot to explain, bright blue points are 2x faster, bright red
2x slower. The black region is break even, but the lower left corner
in each case did not use the FFT at all so is completely black.

Bill.

2009/6/14 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 14, 2009, 10:55:26 PM6/14/09
to mpir...@googlegroups.com
I found a way to get a better than 2x speedup with 3 threads. One uses
a karatsuba on top of toom7. So initially one splits into 2 pieces
using one layer of karatsuba and uses toom7 for the pointwise mults.
Here are the benches (I adjusted the FFT back up to cut in at 5880
limbs).

MPIRbench.base.multiply.131072,131072 result: 4079
MPIRbench.base.multiply.192000,192000 result: 2518

So from somewhere around 2000 limbs it is worthwhile doing this if one
has at least 3 cores.

Marc

unread,
Jun 15, 2009, 1:33:06 AM6/15/09
to mpir-dev
On Jun 14, 12:36 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> Here are the benches for toom7 parallelised with 2 threads:
>
> MPIRbench.base.multiply.64000,64000 result: 4149
> MPIRbench.base.multiply.131072,131072 result: 2120
> MPIRbench.base.multiply.192000,192000 result: 1419
> MPIRbench.base.multiply.640000,640000 result: 318
> MPIRbench.base.multiply.2097152,2097152 result: 62.5
> MPIRbench.base.multiply.6400000,6400000 result: 14.7
> MPIRbench.base.multiply.19200000,19200000 result: 3.10
>
> Certainly at 640000 there is hope (by 2097152 we just about catch up
> with the FFT). But that's 100000 limbs. Amazing that numbers have to
> be so large before there is a really clear benefit, even from 2
> threads.

I tried a parallel (2 threads, but trivial to adapt to more threads)
version of toom7, and at 1000 limbs (right in the middle of the toom7
range I believe) it is 30% faster than the non-threaded toom7 in mpir.
At 10000 limbs, it is 70% faster. Does that match your observations?

Bill Hart

unread,
Jun 15, 2009, 7:37:19 AM6/15/09
to mpir...@googlegroups.com
2009/6/15 Marc <marc....@gmail.com>:

For toom7 I got 318 vs the usual 246, so not even 50% faster.

Are you using gcc? I'm using gcc 4.3.1. Also, did you simply
parallelise the pointwise mults in pairs or something more
sophisticated?

Bill.

Marc

unread,
Jun 15, 2009, 1:54:13 PM6/15/09
to mpir-dev
On Jun 15, 4:37 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> 2009/6/15 Marc <marc.gli...@gmail.com>:
I am using gcc-4.1 (on fedora 8, x64). I added some extra temp space
and renamed a few variables so that I could put all the
multiplications together at the end. It would be possible to include
part of the preliminary computations in the parallel regions but I did
not try that (probably not enough to gain for a proof of concept, but
it will be worth trying in a serious version, and similarly for the
interpolation code).

As part of the experiment I had to de-obfuscate (i.e. de-optimize) the
code a bit, but it did not make it significantly slower.

Marc

unread,
Jun 15, 2009, 4:18:04 PM6/15/09
to mpir-dev
On Jun 15, 4:37 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> For toom7 I got 318 vs the usual 246, so not even 50% faster.

Er, 246 is with the FFT, right? I only compared to single-thread
toom7, not the FFT... My point was mostly about the 10^3 limbs case
(30% improvement), which is below the FFT threshold. Actually, at 10^4
limbs, it looks like I have less improvement over the FFT than you do.

Bill Hart

unread,
Jun 15, 2009, 5:02:22 PM6/15/09
to mpir...@googlegroups.com
2009/6/15 Marc <marc....@gmail.com>:

Ah, you are absolutely correct. That 246 should be with the FFT. Let
me just check that....

Bill.

Bill Hart

unread,
Jun 15, 2009, 5:11:25 PM6/15/09
to mpir...@googlegroups.com
I get 226 for toom7 without FFT and I got 318 with two threads at
10000 limbs. So still not 50%. So I think you actually have more
improvement for toom7 than me. It's possibly architecture dependent
and also dependent on the OpenMP implementation, not to mention how
exactly one goes about the parallelisation.

Let me try again and confirm that I didn't do anything wrong. Your
method of moving everything to the end might be better.

Bill.

2009/6/15 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 15, 2009, 5:28:41 PM6/15/09
to mpir...@googlegroups.com
There's some variation in the timing, but 310-320 seems typical for
combining the mults in pairs.

So your idea of putting all the multiplications at the end seems to be
a good one. It's also obviously much easier to adjust for more threads
then.

Have you tried 3 threads with your approach?

Jason Moxham

unread,
Jun 15, 2009, 5:32:29 PM6/15/09
to mpir...@googlegroups.com
latest trunk on sage.math

PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-unknown-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug

on cicero a x86 pent4

PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=--enable-fat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=cc CXX=c++ configure=
PASSED CC=cc CXX=c++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=cc CXX=c++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=cc CXX=c++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=cc CXX=c++ configure=--enable-fat
PASSED CC=cc CXX=c++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=cc CXX=c++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug

on box3 a k7
box3

PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug


but now I get this on cygwin

/bin/sh ../libtool --tag=CC --mode=compile
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I
/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test_stuff/../../trunk/mpn -I..
-D__GMP_WITHIN_GMP -I/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test_stuff/
../../trunk -DOPERATION_`echo divis | sed
's/_$//'` -m32 -O2 -fomit-frame-poi
nter -mtune=nocona -march=nocona -c -o divis.lo divis.c
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I/cygdrive/c/Users/jasonadmin/mpir/mpir/bra
nches/test_stuff/../../trunk/mpn -I.. -D__GMP_WITHIN_GMP -I/cygdrive/c/Users/jas
onadmin/mpir/mpir/branches/test_stuff/../../trunk -DOPERATION_divis -m32 -O2
-fo
mit-frame-pointer -mtune=nocona -march=nocona -c divis.c -o divis.o
divis.c: In function `__gmpn_divisible_p':
divis.c:122: error: `SIZEOF_MP_LIMB_T' undeclared (first use in this
function)
divis.c:122: error: (Each undeclared identifier is reported only once
divis.c:122: error: for each function it appears in.)
make[2]: *** [divis.lo] Error 1
make[2]: Leaving directory
`/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test
_stuff/box1-win64/mpn'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test
_stuff/box1-win64'
make: *** [all] Error 2
box1-win64

FAILED CC=gcc CXX=g++ configure=

Jason Moxham

unread,
Jun 15, 2009, 5:42:40 PM6/15/09
to mpir...@googlegroups.com
I think the cygwin is OK , normally when you configure if there is old junk
lying around it warns you , but in this case it didn't

Jason Moxham

unread,
Jun 15, 2009, 7:45:49 PM6/15/09
to mpir...@googlegroups.com
cygwin passes fat now , so I think we are ready to go.
I'm running a few more tests , but I dont expect any problems...

Jason Moxham

unread,
Jun 15, 2009, 8:41:09 PM6/15/09
to mpir...@googlegroups.com
cygwin passed all tests

box1-win64

PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert
--enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert
--enable-alloca=debug --build=none-pc-cygwin
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --e
nable-assert --enable-alloca=debug



Marc

unread,
Jun 16, 2009, 1:44:05 AM6/16/09
to mpir-dev
On Jun 15, 2:28 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> There's some variation in the timing, but 310-320 seems typical for
> combining the mults in pairs.
>
> So your idea of putting all the multiplications at the end seems to be
> a good one. It's also obviously much easier to adjust for more threads
> then.
>
> Have you tried 3 threads with your approach?

No. I expect it would be close to a 100% speedup at 10^4 limbs, but I
don't know. If you want to measure, the toom7 file (no need to modify
it for a different number of threads) is here:

http://webloria.loria.fr/~glisse/tmp/mpir/

Bill Hart

unread,
Jun 16, 2009, 9:39:13 AM6/16/09
to mpir...@googlegroups.com
Great. I've merged the changes to trunk with mpir-1.2 branch and
updated all version numbers.

Jason can you do aclocal, autoheader, autoconf, automake

on mpir-1.2 and commit and I'll issue the release.

Bill.

2009/6/16 Jason Moxham <ja...@njkfrudils.plus.com>:

Jason Moxham

unread,
Jun 16, 2009, 9:49:36 AM6/16/09
to mpir...@googlegroups.com
done

Bill Hart

unread,
Jun 17, 2009, 6:22:32 PM6/17/09
to mpir...@googlegroups.com
I ran your code Marc and it performs quite well. Here are the benches
up to 3 threads (someone is currently using the fourth core on my
machine at present):

1 thread: 218-228
2 threads: 308-337
3 threads: 385-419

It's actually quite close to being efficient according to the measure
I was hoping to achieve, though we are well outside the toom7 range.

I also timed, at 10000 limbs, my karatsuba on top of toom7 code, which
is really only useful at 3 threads:

1 thread: 179
2 threads: 249-250
3 threads: 451-455

Bill.

2009/6/16 Marc <marc....@gmail.com>:

Marc

unread,
Jun 17, 2009, 10:30:19 PM6/17/09
to mpir-dev
On Jun 17, 3:22 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> I ran your code Marc and it performs quite well. Here are the benches
> up to 3 threads (someone is currently using the fourth core on my
> machine at present):
>
> 1 thread: 218-228
> 2 threads: 308-337
> 3 threads: 385-419

Strange, at 10^4 limbs, on this core 2 duo, with 1 thread the
multiplication takes 9ms and with 2 threads it takes about 5.4ms,
something like a 65% speedup, whereas in your measures it only gives a
45% speedup. Too bad the benefits are not as good on different
machines. Did you try it in the toom7 range? For 10^3 limbs, I have .
37ms for 1 thread and .28ms for 2 threads, a 30% speedup, but you
might have almost no speedup on your machine...

> It's actually quite close to being efficient according to the measure
> I was hoping to achieve, though we are well outside the toom7 range.
>
> I also timed, at 10000 limbs, my karatsuba on top of toom7 code, which
> is really only useful at 3 threads:
>
> 1 thread: 179
> 2 threads: 249-250
> 3 threads: 451-455

Makes sense that karatsuba would work well for 3 threads, there is
almost nothing to do outside of the 3-way parallel region, whereas for
toom it is a bit more work to include part of the evaluations and the
interpolation in the parallel region, although it should be doable
(for the evaluations there is an obvious way to do it, but it is
likely not the most efficient unless you are using 13 threads :-).

Marc

unread,
Jun 18, 2009, 12:11:37 AM6/18/09
to mpir-dev
I analyzed a bit the running time of toom7 for 10^4 limbs. Looking
only at the topmost call to toom7, the evaluations account for 2% of
the running time, the interpolation for 6%, and the multiplications
for 92% (I actually measured 2, 7 and 93 but rounded it so it looked
like the total wasn't more than 100 ;-). Running the 13
multiplications in 2 threads is kind of like having only 7
multiplications, and running them in 3 threads like having only 5. The
running time would then be .08+.92*(7 or 5)/13 times the regular one,
ie a speedup of 73% with 2 threads and 130% for 3 threads, and indeed
with 2 threads I get almost 70%. I guess I should try and experiment
on some other machines to understand why it doesn't work as well on
yours.

Bill Hart

unread,
Jun 18, 2009, 9:00:29 AM6/18/09
to mpir...@googlegroups.com


2009/6/18 Marc <marc....@gmail.com>


I analyzed a bit the running time of toom7 for 10^4 limbs. Looking
only at the topmost call to toom7, the evaluations account for 2% of
the running time, the interpolation for 6%, and the multiplications
for 92% (I actually measured 2, 7 and 93 but rounded it so it looked
like the total wasn't more than 100 ;-). Running the 13
multiplications in 2 threads is kind of like having only 7
multiplications, and running them in 3 threads like having only 5. The
running time would then be .08+.92*(7 or 5)/13 times the regular one,
ie a speedup of 73% with 2 threads and 130% for 3 threads, and indeed
with 2 threads I get almost 70%. I guess I should try and experiment
on some other machines to understand why it doesn't work as well on
yours.

That's a very interesting analysis. I've been lazy and never actually conducted any timings, but the relative proportions of times roughly agree with what I had guessed. They are certainly very interesting analyses.
 


On Jun 17, 7:30 pm, Marc <marc.gli...@gmail.com> wrote:
> On Jun 17, 3:22 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>
> > I ran your code Marc and it performs quite well. Here are the benches
> > up to 3 threads (someone is currently using the fourth core on my
> > machine at present):
>
> > 1 thread:   218-228
> > 2 threads: 308-337
> > 3 threads: 385-419
>
> Strange, at 10^4 limbs, on this core 2 duo, with 1 thread the
> multiplication takes 9ms and with 2 threads it takes about 5.4ms,
> something like a 65% speedup, whereas in your measures it only gives a
> 45% speedup. Too bad the benefits are not as good on different
> machines.

On Core 2 basecase multiplication has a latency of closer to 4 cycles per limb whereas on K8 we have code running at about 2.375 c/l. This means that all the cutoffs for the various algorithms are much lower for core 2 (because multiplication now takes much longer compared to other operations). Of course the relative proportions for evaluation/pointwise multiplication/interpolation may be similar on core2 and K8. I don't know how the lower cutoffs might affect things.

Another (perhaps more relevant) effect is that AMD machines are much more critically dependent on good cache locality. By splitting the data up in the way you have, we might be causing issues for the K8.

I've attached my solution. It would be interesting to compare it with your code on your core2 machine too see if my approach still works as well on that machine. It does pass the try tests, but note that it only speeds things up for an even number of limbs. This is for experimentation purposes only. As with all this code we are sharing, it obviously isn't intended to be complete or optimal. You have to rename it to toom7_mul_n.c, but it is a drop in replacement otherwise.

Bill.
 
Did you try it in the toom7 range? For 10^3 limbs, I have .
> 37ms for 1 thread and .28ms for 2 threads, a 30% speedup, but you
> might have almost no speedup on your machine...

Yes, I timed it over the Toom7 range (currently 590 limbs -> infinity). 

Firstly here are the usual MPIR benches over this range (the ones in brackets are with the FFT):

      MPIRbench.base.multiply.64000,64000 result: 5732
      MPIRbench.base.multiply.131072,131072 result: 2076
      MPIRbench.base.multiply.192000,192000 result: 1150
      MPIRbench.base.multiply.640000,640000 result: 225 (249)
      MPIRbench.base.multiply.2097152,2097152 result: 40.2 (63.7)

Here is your code:

1 thread:

      MPIRbench.base.multiply.64000,64000 result: 5580
      MPIRbench.base.multiply.131072,131072 result: 2046
      MPIRbench.base.multiply.192000,192000 result: 1153
      MPIRbench.base.multiply.640000,640000 result: 218
      MPIRbench.base.multiply.2097152,2097152 result: 40.2

2 threads:

      MPIRbench.base.multiply.64000,64000 result: 6404
      MPIRbench.base.multiply.131072,131072 result: 2696
      MPIRbench.base.multiply.192000,192000 result: 1619
      MPIRbench.base.multiply.640000,640000 result: 341
      MPIRbench.base.multiply.2097152,2097152 result: 65.2

3 threads:

      MPIRbench.base.multiply.64000,64000 result: 6478
      MPIRbench.base.multiply.131072,131072 result: 3179
      MPIRbench.base.multiply.192000,192000 result: 1796
      MPIRbench.base.multiply.640000,640000 result: 422
      MPIRbench.base.multiply.2097152,2097152 result: 83.7


1 thread:

      MPIRbench.base.multiply.64000,64000 result: 4771
      MPIRbench.base.multiply.131072,131072 result: 1774
      MPIRbench.base.multiply.192000,192000 result: 1037
      MPIRbench.base.multiply.640000,640000 result: 179
      MPIRbench.base.multiply.2097152,2097152 result: 32.9

2 threads:

      MPIRbench.base.multiply.64000,64000 result: 6137
      MPIRbench.base.multiply.131072,131072 result: 2476
      MPIRbench.base.multiply.192000,192000 result: 1492
      MPIRbench.base.multiply.640000,640000 result: 252
      MPIRbench.base.multiply.2097152,2097152 result: 48.6

3 threads:
 
      MPIRbench.base.multiply.64000,64000 result: 9034
      MPIRbench.base.multiply.131072,131072 result: 4122
      MPIRbench.base.multiply.192000,192000 result: 2491
      MPIRbench.base.multiply.640000,640000 result: 456
      MPIRbench.base.multiply.2097152,2097152 result: 88.5

So my approach is the clear winner for 3 threads, but yours is the clear winner for 2.


>
> > It's actually quite close to being efficient according to the measure
> > I was hoping to achieve, though we are well outside the toom7 range.
>
> > I also timed, at 10000 limbs, my karatsuba on top of toom7 code, which
> > is really only useful at 3 threads:
>
> > 1 thread:   179
> > 2 threads: 249-250
> > 3 threads: 451-455
>
> Makes sense that karatsuba would work well for 3 threads, there is
> almost nothing to do outside of the 3-way parallel region, whereas for
> toom it is a bit more work to include part of the evaluations and the
> interpolation in the parallel region, although it should be doable
> (for the evaluations there is an obvious way to do it, but it is
> likely not the most efficient unless you are using 13 threads :-)

The main issue is data dependence, and we've seen that slows things down. It may actually be faster to make copies of all the data before proceeding with the evaluations. But as you've noted, it is the interpolations which account for most of the rest of the time in toom7 and that will be harder to parallelise effectively. 

toom7_kara_mul_n.c

Bill Hart

unread,
Jun 18, 2009, 9:04:01 AM6/18/09
to mpir...@googlegroups.com
Sorry, pressed the send button too soon:

2009/6/18 Bill Hart <goodwi...@googlemail.com>:

too see -> to see

Here are the timings with my code:

>
> 1 thread:

Bill.

Marc

unread,
Jun 18, 2009, 2:10:56 PM6/18/09
to mpir-dev
On Jun 18, 6:00 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Another (perhaps more relevant) effect is that AMD machines are much more
> critically dependent on good cache locality. By splitting the data up in the
> way you have, we might be causing issues for the K8.
>
> I've attached my solution. It would be interesting to compare it with your
> code on your core2 machine too see if my approach still works as well on
> that machine.

Up to some scaling, the running time at 10^4 limbs is 4.55 for regular
toom7, 5.55 for your code single-threaded and 3.7 for your code with 2
threads. 3.7/5.55 is close to 2/3, the theoretical value, so there is
no relevant effect from caching on my machine. For comparison, my code
is at 2.7 for 2 threads and I project 1.95 for 3 threads, whereas I
project your code as 1.85 for 3 threads (a winner), although it uses
22% more cpu cycles, so the best choice might depend on how much other
stuff you want to run at the same time as this multiplication.

When I get access to an AMD machine, I'll try to reorganize things in
a more cache friendly way and see what happens.

I hope we don't end up needing different strategies on amd and
intel...

> Yes, I timed it over the Toom7 range (currently 590 limbs -> infinity).

So you only get a 10% speedup from a second thread at 10^3 limbs
(compared to 30% on core2), not worth it.

> The main issue is data dependence, and we've seen that slows things down. It may actually be faster to make copies of all the data before proceeding with the evaluations. But as you've noted, it is the interpolations which account for most of the rest of the time in toom7 and that will be harder to parallelise effectively.

Yes, I am not so worried about the overhead of evaluation
+interpolation when executing 2 things in parallel is so far from
being twice faster. I am not sure exactly what data dependence you are
talking about. The 13 multiplications write to different places, and
they even take mostly different arguments, although the writing places
are adjacent so they might be on the same page.

Bill Hart

unread,
Jun 18, 2009, 2:29:26 PM6/18/09
to mpir...@googlegroups.com


2009/6/18 Marc <marc....@gmail.com>


On Jun 18, 6:00 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Another (perhaps more relevant) effect is that AMD machines are much more
> critically dependent on good cache locality. By splitting the data up in the
> way you have, we might be causing issues for the K8.
>
> I've attached my solution. It would be interesting to compare it with your
> code on your core2 machine too see if my approach still works as well on
> that machine.

Up to some scaling, the running time at 10^4 limbs is 4.55 for regular
toom7, 5.55 for your code single-threaded and 3.7 for your code with 2
threads. 3.7/5.55 is close to 2/3, the theoretical value, so there is
no relevant effect from caching on my machine.

Oh yeah, you only have two threads, I forgot.
 
For comparison, my code
is at 2.7 for 2 threads and I project 1.95 for 3 threads, whereas I
project your code as 1.85 for 3 threads (a winner), although it uses
22% more cpu cycles, so the best choice might depend on how much other
stuff you want to run at the same time as this multiplication.

When I get access to an AMD machine, I'll try to reorganize things in
a more cache friendly way and see what happens.

If it helps I can give you access to one.
 


I hope we don't end up needing different strategies on amd and
intel...

Me too. Also not different strategies for differing numbers of threads!
 

> Yes, I timed it over the Toom7 range (currently 590 limbs -> infinity).

So you only get a 10% speedup from a second thread at 10^3 limbs
(compared to 30% on core2), not worth it.

> The main issue is data dependence, and we've seen that slows things down. It may actually be faster to make copies of all the data before proceeding with the evaluations. But as you've noted, it is the interpolations which account for most of the rest of the time in toom7 and that will be harder to parallelise effectively.

Yes, I am not so worried about the overhead of evaluation
+interpolation when executing 2 things in parallel is so far from
being twice faster. I am not sure exactly what data dependence you are
talking about. The 13 multiplications write to different places, and
they even take mostly different arguments, although the writing places
are adjacent so they might be on the same page.

I was merely talking about reading from the same data in the evaluation phase. One can't write to the same location without putting guards in anyway, so I was thinking more about all cores wanting to read the same data.

Depends how it is organised though.

Bill.

Bill Hart

unread,
Jun 20, 2009, 10:16:05 AM6/20/09
to mpir...@googlegroups.com
Hi Marc, I notice the Loria site is down and I didn't make a copy of
your toom7 parallel code. I was going to try it on a Core 2 machine
with more than 2 cores. Do you have a copy of it elsewhere you can
email me, or upload to the files section of this group. Or does
someone else have a copy of it?

Bill.

2009/6/18 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 20, 2009, 11:13:31 AM6/20/09
to mpir...@googlegroups.com
Here are the results on a core2 machine with lots of cores (my code only, I don;t seem to have a copy of Marc's code atm):

1 thread gcc 4.2.4 (with fft):
      MPIRbench.base.multiply.64000,64000 result: 3595
      MPIRbench.base.multiply.131072,131072 result: 1293
      MPIRbench.base.multiply.192000,192000 result: 788
      MPIRbench.base.multiply.640000,640000 result: 212
      MPIRbench.base.multiply.2097152,2097152 result: 56.0
      MPIRbench.base.multiply.6400000,6400000 result: 12.1

1 thread (without fft):
      MPIRbench.base.multiply.64000,64000 result: 3597
      MPIRbench.base.multiply.131072,131072 result: 1290
      MPIRbench.base.multiply.192000,192000 result: 736
      MPIRbench.base.multiply.640000,640000 result: 145
      MPIRbench.base.multiply.2097152,2097152 result: 29.4
      MPIRbench.base.multiply.6400000,6400000 result: 6.64

My code 

3 threads gcc 4.2.4:
      MPIRbench.base.multiply.64000,64000 result: 4742
      MPIRbench.base.multiply.131072,131072 result: 2187
      MPIRbench.base.multiply.192000,192000 result: 1383
      MPIRbench.base.multiply.640000,640000 result: 291
      MPIRbench.base.multiply.2097152,2097152 result: 62.9
      MPIRbench.base.multiply.6400000,6400000 result: 13.9

Note, the libc on this machine is too old to build gcc-4.3.2. I don;t think gcc-4.2.4 is a very good implementation of openmp at all.

Bill.


2009/6/20 Bill Hart <goodwi...@googlemail.com>

Bill Hart

unread,
Jun 20, 2009, 12:13:44 PM6/20/09
to mpir...@googlegroups.com
The results with gcc 4.3.0 are *much* more impressive. (I found a machine with this version of gcc on it.)

1 thread (with fft):

      MPIRbench.base.multiply.32000,32000 result: 8459
      MPIRbench.base.multiply.64000,64000 result: 3240
      MPIRbench.base.multiply.131072,131072 result: 1158
      MPIRbench.base.multiply.192000,192000 result: 657
      MPIRbench.base.multiply.640000,640000 result: 187
      MPIRbench.base.multiply.2097152,2097152 result: 50.4
      MPIRbench.base.multiply.6400000,6400000 result: 11.1

1 thread (without fft):
      MPIRbench.base.multiply.32000,32000 result: 8459
      MPIRbench.base.multiply.64000,64000 result: 3239
      MPIRbench.base.multiply.131072,131072 result: 1154
      MPIRbench.base.multiply.192000,192000 result: 656
      MPIRbench.base.multiply.640000,640000 result: 131
      MPIRbench.base.multiply.2097152,2097152 result: 26.2
      MPIRbench.base.multiply.6400000,6400000 result: 6.12

my code:

3 threads (note I lowered the toom7 cutoff to 359):

      MPIRbench.base.multiply.32000,32000 result: 17763
      MPIRbench.base.multiply.64000,64000 result: 7491
      MPIRbench.base.multiply.131072,131072 result: 2909
      MPIRbench.base.multiply.192000,192000 result: 1692
      MPIRbench.base.multiply.640000,640000 result: 299
      MPIRbench.base.multiply.2097152,2097152 result: 62.3
      MPIRbench.base.multiply.6400000,6400000 result: 14.1

It's well and truly worth using 3 threads from 500 limbs onwards!!

Bill Hart

unread,
Jun 20, 2009, 2:34:45 PM6/20/09
to mpir...@googlegroups.com
Here are the timings from Marc's code on the core2 machine with gcc 4.3.0.

Something really odd happens at 4 threads. I've checked the timings
and something really does go wrong, it's not just a temporary timing
anomaly.

2 threads:

MPIRbench.base.multiply.32000,32000 result: 10673
MPIRbench.base.multiply.64000,64000 result: 4721
MPIRbench.base.multiply.131072,131072 result: 1875
MPIRbench.base.multiply.192000,192000 result: 1073
MPIRbench.base.multiply.640000,640000 result: 211
MPIRbench.base.multiply.2097152,2097152 result: 45.9
MPIRbench.base.multiply.6400000,6400000 result: 10.7

3 threads:

MPIRbench.base.multiply.32000,32000 result: 12942
MPIRbench.base.multiply.64000,64000 result: 5861
MPIRbench.base.multiply.131072,131072 result: 2310
MPIRbench.base.multiply.192000,192000 result: 1322
MPIRbench.base.multiply.640000,640000 result: 278
MPIRbench.base.multiply.2097152,2097152 result: 58.5
MPIRbench.base.multiply.6400000,6400000 result: 15.4

4 threads:

MPIRbench.base.multiply.32000,32000 result: 13914
MPIRbench.base.multiply.64000,64000 result: 553
MPIRbench.base.multiply.131072,131072 result: 2623
MPIRbench.base.multiply.192000,192000 result: 464
MPIRbench.base.multiply.640000,640000 result: 327
MPIRbench.base.multiply.2097152,2097152 result: 70.4
MPIRbench.base.multiply.6400000,6400000 result: 16.8

Bill.

2009/6/20 Bill Hart <goodwi...@googlemail.com>:

Marc

unread,
Jun 21, 2009, 3:37:39 AM6/21/09
to mpir-dev
Hello,

for the difference you saw between gcc-4.2.4 (not sure what you mean
about the old libc that prevents from installing a new gcc) and
gcc-4.3.0, I think the cause may be that the test was on a different
machine. I made a static version of the test program that shows a 70%
improvement going from 1 to 2 threads on my machine, copied it on
sage, and there the improvement was closer to 50%, although both
processors are in the core2 family.

On sage still, I tried moving from gcc 4.2.4 to 4.4.0 and didn't
notice any improvement in the scaling. Actually, the performance is
better using the libgomp from 4.2.4 than the one from 4.4.0, though
more random.

I am not sure exactly what happens with 4 threads with my code on the
second machine you tested. In my experiments on sage, the timings
become more and more random as you increase the number of threads, but
keeping only the best time from several runs (I know, not the most
scientific thing to do), I get something consistent. Here are some
numbers:

1000 limbs (*5000)
1: 1.40 / 1.68
2: 1.03 / 1.16
3: .89 / .65
4: .82
5: .75

10000 limbs (*300)
1: 2.06 / 2.42
2: 1.29 / 1.66
3: 1.02 / .91
4: .84
5: .74
7: .64

100000 limbs (*18)
1: 2.80 / 3.70
2: 1.62 / 2.22
3: 1.25 / 1.23
4: 1.04
5: .87
7: .72

1000000 limbs (*1)
1: 3.78 / 5.04
2: 2.21 / 3.02
3: 1.76 / 1.99
4: 1.51
5: 1.27
7: 1.07

The *n number is how many operations I perform, and below are the
timings in seconds for 1, 2, etc threads, on the left for my code and
on the right for yours. Your code scales very well on that machine.

Gonzalo Tornaria

unread,
Jun 21, 2009, 7:10:18 PM6/21/09
to mpir...@googlegroups.com
On Sun, Jun 21, 2009 at 4:37 AM, Marc<marc....@gmail.com> wrote:
> I am not sure exactly what happens with 4 threads with my code on the
> second machine you tested. In my experiments on sage, the timings
> become more and more random as you increase the number of threads, but
> keeping only the best time from several runs (I know, not the most
> scientific thing to do), I get something consistent. Here are some
> numbers:

I would guess the timings depend a lot on which cores the threads are
scheduled, i.e. what level of cache they share. You can use taskset(1)
to force your threads to use particular cores.

AFAIK there are 4 packages with 6 cores each:

phys 0 = core 0 4 5 6 7 8
phys 1 = core 1 9 10 11 12 13
phys 2 = core 2 14 15 16 17 18
phys 3 = core 19 20 21 22 23

I'm not sure how the 6 cores are organized inside each package in
terms of L2, but I think the whole package shares a 16Mb L3 cache, so
forcing your program to run in a single physical cpu (6 threads max)
may give you more consistent results.

Gonzalo

Bill Hart

unread,
Jun 22, 2009, 9:53:57 AM6/22/09
to mpir...@googlegroups.com
I decided to try karatsuba on top of toom4. That seems to work well down to about 300 limbs on core2 with 3 threads:

Here are the usual times:

single threaded (without fft):

      MPIRbench.base.multiply.12800,12800 result: 32867
      MPIRbench.base.multiply.19200,19200 result: 17439
      MPIRbench.base.multiply.25600,25600 result: 11337
      MPIRbench.base.multiply.32000,32000 result: 8459
      MPIRbench.base.multiply.64000,64000 result: 3239

Here are the times for the new code:

toom4_kara 3 threads:

      MPIRbench.base.multiply.12800,12800 result: 57980
      MPIRbench.base.multiply.19200,19200 result: 36353
      MPIRbench.base.multiply.25600,25600 result: 23645
      MPIRbench.base.multiply.32000,32000 result: 18102
      MPIRbench.base.multiply.64000,64000 result: 7281

Only a few things left to try....

Bill.

2009/6/21 Marc <marc....@gmail.com>

Bill Hart

unread,
Jun 22, 2009, 10:48:42 AM6/22/09
to mpir...@googlegroups.com
I tried the same trick with toom3 and karatsuba itself and on this
machine at least, it is basically efficient with 3 threads down to 200
limbs with kara_kara (though there is some timing fluctuation):

toom3_kara 3 threads:

MPIRbench.base.multiply.6400,6400 result: 125389
MPIRbench.base.multiply.12800,12800 result: 59315
MPIRbench.base.multiply.19200,19200 result: 37642
MPIRbench.base.multiply.25600,25600 result: 25932

kara_kara 3 threads:

MPIRbench.base.multiply.6400,6400 result: 137269
MPIRbench.base.multiply.12800,12800 result: 64131
MPIRbench.base.multiply.19200,19200 result: 36390
MPIRbench.base.multiply.25600,25600 result: 25109

Bill.

2009/6/22 Bill Hart <goodwi...@googlemail.com>:

Bill Hart

unread,
Jun 22, 2009, 10:50:19 AM6/22/09
to mpir...@googlegroups.com
Sorry I forgot to mention that at 100 limbs (6400 bits) it scores about 92000 with the usual single threaded code.

Bill Hart

unread,
Jun 22, 2009, 11:42:52 AM6/22/09
to mpir...@googlegroups.com
I had a go at parallelising the first part of karatsuba, as there are
two completely independent evaluations that get done there. But this
slowed the whole thing down by at least a factor of 2.

The interpolation only contains code for which there are data
dependencies, so parallelising that seems out of the question.

So it seems that for 3 threads at least, 200 limbs is the cutoff, and
even then it is touch and go.

It's a shame that there is such a huge gap between 8192 and 192000 in
the benchmark, as 8192 bits is too small to show a really significant
improvement with threads. You get 50% with three threads, but that is
really a waste of CPU cycles.

Bill.

2009/6/22 Bill Hart <goodwi...@googlemail.com>:

Marc

unread,
Jun 22, 2009, 1:42:20 PM6/22/09
to mpir-dev
On Jun 22, 8:42 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> I had a go at parallelising the first part of karatsuba, as there are
> two completely independent evaluations that get done there. But this
> slowed the whole thing down by at least a factor of 2.

Yes, that's just a couple subtractions, it takes almost no time. You
could include them in the section of the first multiplication though.

> So it seems that for 3 threads at least, 200 limbs is the cutoff, and
> even then it is touch and go.
>
> It's a shame that there is such a huge gap between 8192 and 192000 in
> the benchmark, as 8192 bits is too small to show a really significant
> improvement with threads. You get 50% with three threads, but that is
> really a waste of CPU cycles.

I tried with your code, and 50% is the best I can get with 3 treads at
200 limbs (not 128), and only once every 10 runs...

By the way I changed your function to make it call mpn_mul_n so I
don't need 4 versions of it.

Note: I keep forgetting to set OMP_NUM_THREADS=1 before running the
testsuite, which is not thread-safe.

Bill Hart

unread,
Jun 22, 2009, 5:05:04 PM6/22/09
to mpir...@googlegroups.com


2009/6/22 Marc <marc....@gmail.com>


On Jun 22, 8:42 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> I had a go at parallelising the first part of karatsuba, as there are
> two completely independent evaluations that get done there. But this
> slowed the whole thing down by at least a factor of 2.

Yes, that's just a couple subtractions, it takes almost no time. You
could include them in the section of the first multiplication though.

I tried that and it didn't really help, at least not on this machine.
 


> So it seems that for 3 threads at least, 200 limbs is the cutoff, and
> even then it is touch and go.
>
> It's a shame that there is such a huge gap between 8192 and 192000 in
> the benchmark, as 8192 bits is too small to show a really significant
> improvement with threads. You get 50% with three threads, but that is
> really a waste of CPU cycles.

I tried with your code, and 50% is the best I can get with 3 treads at
200 limbs (not 128), and only once every 10 runs...

Yes, sadly it depends heavily on the machine.
 


By the way I changed your function to make it call mpn_mul_n so I
don't need 4 versions of it.

Yep that's the obvious thing to do. It doesn't work as well for toom3 or karatsuba though I don't think because of the extra memory allocations that need to be done. 

Bill.

Marc

unread,
Jun 22, 2009, 5:36:11 PM6/22/09
to mpir-dev
On Jun 22, 2:05 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> 2009/6/22 Marc <marc.gli...@gmail.com>
> > On Jun 22, 8:42 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > I had a go at parallelising the first part of karatsuba, as there are
> > > two completely independent evaluations that get done there. But this
> > > slowed the whole thing down by at least a factor of 2.
>
> > Yes, that's just a couple subtractions, it takes almost no time. You
> > could include them in the section of the first multiplication though.
>
> I tried that and it didn't really help, at least not on this machine.

I know (I didn't measure but I really think the 2 subtractions are
negligible in the running time, even at 100 limbs), it is more of a
"shouldn't hurt" type of thing.

> Yes, sadly it depends heavily on the machine.

We still need to figure out a way to get reasonable results on AMD.
The slow results there are likely due to the NUMA nature of these
processors, and Intel is also moving in that direction with nehalem
and future processors...

I will also try a different compiler. At small numbers of limbs, gcc44
is faster than gcc42 by orders of magnitude.

> > By the way I changed your function to make it call mpn_mul_n so I
> > don't need 4 versions of it.
>
> Yep that's the obvious thing to do. It doesn't work as well for toom3 or
> karatsuba though I don't think because of the extra memory allocations that
> need to be done.

I don't think I understand this sentence. Anyway, in the toom3/
karatsuba range, if you use TMP_ALLOC_LIMBS, it gets memory on the
stack, so it should be quite fast. Although I did notice how your
modification of karatsuba (use a separate buffer) makes it slower for
1 thread in a surprising proportion.

Bill Hart

unread,
Jun 22, 2009, 6:58:01 PM6/22/09
to mpir...@googlegroups.com
2009/6/22 Marc <marc....@gmail.com>:

>
> On Jun 22, 2:05 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>> 2009/6/22 Marc <marc.gli...@gmail.com>
>> > On Jun 22, 8:42 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> > > I had a go at parallelising the first part of karatsuba, as there are
>> > > two completely independent evaluations that get done there. But this
>> > > slowed the whole thing down by at least a factor of 2.
>>
>> > Yes, that's just a couple subtractions, it takes almost no time. You
>> > could include them in the section of the first multiplication though.
>>
>> I tried that and it didn't really help, at least not on this machine.
>
> I know (I didn't measure but I really think the 2 subtractions are
> negligible in the running time, even at 100 limbs), it is more of a
> "shouldn't hurt" type of thing.
>
>> Yes, sadly it depends heavily on the machine.
>
> We still need to figure out a way to get reasonable results on AMD.
> The slow results there are likely due to the NUMA nature of these
> processors, and Intel is also moving in that direction with nehalem
> and future processors...
>

Yes, we need to figure something out. Maybe it's not possible.

> I will also try a different compiler. At small numbers of limbs, gcc44
> is faster than gcc42 by orders of magnitude.

I should also try gcc 4.4 on sage.math. I presume you have built it
there somewhere. When I tried to build gcc 4.3 it complained about
everything including the libc.

>
>> > By the way I changed your function to make it call mpn_mul_n so I
>> > don't need 4 versions of it.
>>
>> Yep that's the obvious thing to do. It doesn't work as well for toom3 or
>> karatsuba though I don't think because of the extra memory allocations that
>> need to be done.
>
> I don't think I understand this sentence. Anyway, in the toom3/
> karatsuba range, if you use TMP_ALLOC_LIMBS, it gets memory on the
> stack, so it should be quite fast. Although I did notice how your
> modification of karatsuba (use a separate buffer) makes it slower for
> 1 thread in a surprising proportion.

If you let mpn_mul_n do the memory allocation, as the karatsuba calls
it three times, three lots of allocation will be done. Yes it should
be fast, but it can be made slightly faster.

Bill.

Marc

unread,
Jun 22, 2009, 7:24:40 PM6/22/09
to mpir-dev
On Jun 22, 2:36 pm, Marc <marc.gli...@gmail.com> wrote:
> I will also try a different compiler. At small numbers of limbs, gcc44
> is faster than gcc42 by orders of magnitude.

Sun Studio on Linux is comparable to gcc42, that is it has an overhead
that is way too large for small operands. On the other hand, it seems
to handle slightly better huge operands with an unlimited number of
threads, and much better OMP_NESTED=true.

gcc44 (and likely also gcc43, looking at the results posted here)
seems to have a trick and ends up spending very little "system" time
at execution. Things might vary wildly on a different OS.

Marc

unread,
Jun 22, 2009, 7:38:30 PM6/22/09
to mpir-dev
On Jun 22, 3:58 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> 2009/6/22 Marc <marc.gli...@gmail.com>:
> I should also try gcc 4.4 on sage.math. I presume you have built it
> there somewhere.

~glisse/bin/gcc

> When I tried to build gcc 4.3 it complained about
> everything including the libc.

Ah, asking configure for a 64bit-only compiler worked fine for me.

> If you let mpn_mul_n do the memory allocation, as the karatsuba calls
> it three times, three lots of allocation will be done. Yes it should
> be fast, but it can be made slightly faster.

I see what you mean. Although it might also have been better to let
each thread allocate the memory it was going to use, to improve the
locality, but since you measured otherwise, I understand better the
several variants.

Bill Hart

unread,
Jun 26, 2009, 8:20:58 PM6/26/09
to mpir...@googlegroups.com
I ran my code on the K8 to see how the different algorithms perform:

1 thread (fft):

MPIRbench.base.multiply.8192,8192 result: 117169
MPIRbench.base.multiply.6400,6400 result: 172774
MPIRbench.base.multiply.12800,12800 result: 60019
MPIRbench.base.multiply.19200,19200 result: 31860
MPIRbench.base.multiply.25600,25600 result: 20782
MPIRbench.base.multiply.64000,64000 result: 5716
MPIRbench.base.multiply.131072,131072 result: 2073
MPIRbench.base.multiply.192000,192000 result: 1147
MPIRbench.base.multiply.640000,640000 result: 246
MPIRbench.base.multiply.2097152,2097152 result: 63.5
MPIRbench.base.multiply.6400000,6400000 result: 15.5
MPIRbench.base.multiply.19200000,19200000 result: 5.47

1 thread (no fft):

MPIRbench.base.multiply.8192,8192 result: 117169
MPIRbench.base.multiply.6400,6400 result: 172774
MPIRbench.base.multiply.12800,12800 result: 60019
MPIRbench.base.multiply.19200,19200 result: 31860
MPIRbench.base.multiply.25600,25600 result: 20782
MPIRbench.base.multiply.64000,64000 result: 5716
MPIRbench.base.multiply.131072,131072 result: 2073
MPIRbench.base.multiply.192000,192000 result: 1147
MPIRbench.base.multiply.640000,640000 result: 246
MPIRbench.base.multiply.2097152,2097152 result: 63.5
MPIRbench.base.multiply.640000,640000 result: 223
MPIRbench.base.multiply.2097152,2097152 result: 40.2
MPIRbench.base.multiply.6400000,6400000 result: 8.38
MPIRbench.base.multiply.19200000,19200000 result: 1.90

3 threads (kara_toom7): efficient ~2000-10000 limbs

MPIRbench.base.multiply.64000,64000 result: 8699
MPIRbench.base.multiply.131072,131072 result: 4099
MPIRbench.base.multiply.192000,192000 result: 2561
MPIRbench.base.multiply.640000,640000 result: 476
MPIRbench.base.multiply.2097152,2097152 result: 92.6
MPIRbench.base.multiply.6400000,6400000 result: 19.2
MPIRbench.base.multiply.19200000,19200000 result: 4.29

3 threads (kara_toom4): efficient ~2000-3000 limbs

MPIRbench.base.multiply.64000,64000 result: 9329
MPIRbench.base.multiply.131072,131072 result: 3977
MPIRbench.base.multiply.192000,192000 result: 2387
MPIRbench.base.multiply.640000,640000 result: 429
MPIRbench.base.multiply.2097152,2097152 result: 77.6
MPIRbench.base.multiply.6400000,6400000 result: 15.6
MPIRbench.base.multiply.19200000,19200000 result: 3.22

3 threads (kara_toom3):

MPIRbench.base.multiply.64000,64000 result: 8838
MPIRbench.base.multiply.131072,131072 result: 3784
MPIRbench.base.multiply.192000,192000 result: 2322
MPIRbench.base.multiply.640000,640000 result: 431
MPIRbench.base.multiply.2097152,2097152 result: 74.3
MPIRbench.base.multiply.6400000,6400000 result: 14.1
MPIRbench.base.multiply.19200000,19200000 result: 2.74

3 threads (kara_kara):

MPIRbench.base.multiply.64000,64000 result: 7993
MPIRbench.base.multiply.131072,131072 result: 3315
MPIRbench.base.multiply.192000,192000 result: 1893
MPIRbench.base.multiply.640000,640000 result: 300
MPIRbench.base.multiply.2097152,2097152 result: 47.1

So basically kara_kara and kara_toom3 are never efficient, but we get
good efficiency from about 2000 limbs to about 10000 limbs on this
machine, even into the FFT region.

It's really surprising that a core2 machine does so well, but
comparatively a penryn core2 and a K8 handle ok over a much smaller
range.

I'm also surprised that a clear strategy for parallel support using
openmp still hasn't made it clear to us. At this stage the strategy
seems to be to only switch some algorithms on on some machines and to
use different algorithms for 2 and 3 threads. Anything with a larger
number of threads is quite difficult at this stage.

I strongly suspect at this stage that the cache and memory
architecture of the machine is critical for parallel stuff. We
probably need to run some tests on various machines with the same CPU
core, but different cache and memory features and see how things
perform.

Bill.

2009/6/23 Marc <marc....@gmail.com>:

Bill Hart

unread,
Jun 26, 2009, 9:22:43 PM6/26/09
to mpir...@googlegroups.com
For reference, here are the times for the core2 penryn, using gcc
4.4.0. Actually things are better than on the K8:

1 thread (fft):

MPIRbench.base.multiply.8192,8192 result: 73685
MPIRbench.base.multiply.6400,6400 result: 104910
MPIRbench.base.multiply.12800,12800, result: 36686
MPIRbench.base.multiply.19200,19200 result: 19953
MPIRbench.base.multiply.25600,25600 result: 12790
MPIRbench.base.multiply.64000,64000 result: 3618
MPIRbench.base.multiply.131072,131072 result: 1293
MPIRbench.base.multiply.192000,192000 result: 804
MPIRbench.base.multiply.640000,640000 result: 215
MPIRbench.base.multiply.2097152,2097152 result: 57.0

1 thread (no fft):

MPIRbench.base.multiply.8192,8192 result: 73685
MPIRbench.base.multiply.6400,6400 result: 104910
MPIRbench.base.multiply.12800,12800, result: 36686
MPIRbench.base.multiply.19200,19200 result: 19953
MPIRbench.base.multiply.25600,25600 result: 12790
MPIRbench.base.multiply.64000,64000 result: 3618
MPIRbench.base.multiply.131072,131072 result: 1293
MPIRbench.base.multiply.192000,192000 result: 740
MPIRbench.base.multiply.640000,640000 result: 146
MPIRbench.base.multiply.2097152,2097152 result: 29.4

3 threads (kara_toom7): efficient ~1000-3000 limbs

MPIRbench.base.multiply.64000,64000 result: 6938
MPIRbench.base.multiply.131072,131072 result: 2810
MPIRbench.base.multiply.192000,192000 result: 1666
MPIRbench.base.multiply.640000,640000 result: 328
MPIRbench.base.multiply.2097152,2097152 result: 65.9

3 threads (kara_toom4): efficient ~1000-3000 limbs

MPIRbench.base.multiply.64000,64000 result: 7034
MPIRbench.base.multiply.131072,131072 result: 2539
MPIRbench.base.multiply.192000,192000 result: 1580
MPIRbench.base.multiply.640000,640000 result: 292
MPIRbench.base.multiply.2097152,2097152 result: 54.0

3 threads (kara_toom3): efficient ~ 400-3000 limbs

MPIRbench.base.multiply.6400,6400 result: 76853
MPIRbench.base.multiply.12800,12800, result: 46307
MPIRbench.base.multiply.19200,19200 result: 29339
MPIRbench.base.multiply.25600,25600 result: 21203
MPIRbench.base.multiply.64000,64000 result: 6742
MPIRbench.base.multiply.131072,131072 result: 2504
MPIRbench.base.multiply.192000,192000 result: 1498
MPIRbench.base.multiply.640000,640000 result: 261
MPIRbench.base.multiply.2097152,2097152 result: 47.6

3 threads (kara_kara):

MPIRbench.base.multiply.6400,6400 result: 80521
MPIRbench.base.multiply.12800,12800, result: 46180
MPIRbench.base.multiply.19200,19200 result: 29699
MPIRbench.base.multiply.25600,25600 result: 20713
MPIRbench.base.multiply.64000,64000 result: 6027
MPIRbench.base.multiply.131072,131072 result: 1992
MPIRbench.base.multiply.192000,192000 result: 1205
MPIRbench.base.multiply.640000,640000 result: 188
MPIRbench.base.multiply.2097152,2097152 result: 28.9

Bill.

2009/6/27 Bill Hart <goodwi...@googlemail.com>:

Marc

unread,
Jun 26, 2009, 10:15:49 PM6/26/09
to mpir-dev
In case someone else does the same mistake: at the end of a parallel
region, the threads are not killed, they wait until they are reused.
On the machines I tried, gcc42 and sunpro default to a passive wait
(sleep) while gcc44 defaults to an active wait (spin). The active wait
is faster, but wasteful if other applications could have used those
CPUs instead (yes, some compilers implement the obvious solution of
starting with an active wait and switching to a passive one if the
thread wasn't reused soon enough). Telling gcc and sunpro to use the
same wait strategy gives them comparable results (slight advantage to
gcc44, but this is a matter of adapting a tuning parameter, not using
a different strategy).

jason

unread,
Jul 16, 2009, 12:15:59 PM7/16/09
to mpir-dev
Hi

What we are doing with the parallel code may not be the best way ,
consider

Now starting threads has an overhead so a coarse grained approach
seems best , ie at the top level .
definitions: when I say "mul_single" I mean our standard single
threaded mul we have at the moment optimized for a single thread
when I say base-mul-n , kara-mul-n, toom3-mul-n etc I mean a n-way
threaded version of base|kara|toom3|etc that calls mul_single for it's
point wise muls.

I'm ignoring overheads , and assuming that running time of mul_single
is proportional to n^2 when in basecase range , and n^1.585 in the
kara range and n^1.465 in the toom3 range ,etc.

Say we have a multiplication which is in the mul_single toom4 range ,
so the single threaded mul takes n^1.403 . And a kara-mul-3 takes (n/2)
^1.403 , which gives us a speedup of 2.64x with an efficiency of
2.64/3=88.15%

Say we have a multiplication which is in the mul_single fft ranges ,
so the single threaded mul takes n^1.00 , and a base-mul-2 takes 2(n/2)
^1.00 , which give us a speedup of 1.00x and efficiency of
1.00/2=50% , what about a kara-mul-2 ? , takes 2(n/2)^1.00 which is
the same , what about toom3-mul-2 which takes ceil(5/2)*(n/3)^1.00 ,
so ignoring overheads and other complications there is no speed up
using two threads . For 3-threads we get , kara-mul-3 takes (n/2)^1.00
which gives us a speedup of 2x and an efficiency of 66% , and toom3-
mul-3 takes ceil(5/3)(n/3)^1.00 which is speedup of 1.5x and
efficiency of 50% , and toom4-mul-3 takes ceil(7/3)(n/4)^1.00 which is
speedup of 1.33x , and a toom7-mul-3 takes ceil(13/3)(n/7)^1.00 which
is speedup of 1.4x

Taking kara-mul-3 as the example I'm going to use , doing it at the
top level saves on overheads for the threading but we lose on using
kara at such a large n (remember only for the first iteration) . We
want to push the threaded part down inside the mul so it's operating
(in this case) where n is kara-sized , clearly the threaded overheads
will dominate more here , so in practice it will be somewhere
between , and overheads (and the very rough approximations) will push
it up and down . My guess is another set of tuning params

Jason


Bill Hart

unread,
Jul 16, 2009, 2:18:50 PM7/16/09
to mpir...@googlegroups.com
If I understand what you are suggesting, you are saying that having a
karamul-3 on top of a toom7 may not be optimal. It might be better to
put a toom7 on top of a karamul-3.

Actually I don't know if this works. It actually seems that in
practice the speedup you gain from threads increases the larger the
operands are. So pushing the threaded part down the chain would only
make it less efficient. The parts you can't parallelise dominate more.

I don't see why the complexity should be any different if the threaded
part is further down. The reason one normally uses karatsuba further
down, is not, as far as I know, because the complexity is then better.
If you could use toom3 all the way down, the complexity would be
better. The reason is just that toom3 has relatively too much overhead
compared with karatsuba to use it lower down (the cost of all those
scalar operations is much higher compared with the pointwise
multiplications when the operands are small, than when they are
large).

Obviously the threaded case should be no different. If your operands
are large enough, you should use a parallel toom3 instead of a
parallel kara if you have large enough operands. So it might be better
to use a toom3-5 on top of a toom4 than a kara-3 on top of a toom7,
for instance.

Well, I'm not against trying what you suggest. But I don't think it will work.

Bill.

2009/7/16 jason <ja...@njkfrudils.plus.com>:

Jason Moxham

unread,
Jul 16, 2009, 7:29:14 PM7/16/09
to mpir...@googlegroups.com
On Thursday 16 July 2009 19:18:50 Bill Hart wrote:
> If I understand what you are suggesting, you are saying that having a
> karamul-3 on top of a toom7 may not be optimal. It might be better to
> put a toom7 on top of a karamul-3.
>

Yes

> Actually I don't know if this works. It actually seems that in
> practice the speedup you gain from threads increases the larger the
> operands are. So pushing the threaded part down the chain would only
> make it less efficient. The parts you can't parallelise dominate more.
>

It makes the threaded part less efficient but we are doing less work , so
overall it's faster (in theory) .


> I don't see why the complexity should be any different if the threaded
> part is further down. The reason one normally uses karatsuba further
> down, is not, as far as I know, because the complexity is then better.
> If you could use toom3 all the way down, the complexity would be
> better. The reason is just that toom3 has relatively too much overhead
> compared with karatsuba to use it lower down (the cost of all those
> scalar operations is much higher compared with the pointwise
> multiplications when the operands are small, than when they are
> large).
>

This is exactly why we dont want karamul-3 on top of toom7 , we pushing the
toom7 down (in relation to the karamul-3 on top) where the overheads of toom7
are bigger.


> Obviously the threaded case should be no different. If your operands
> are large enough, you should use a parallel toom3 instead of a
> parallel kara if you have large enough operands. So it might be better
> to use a toom3-5 on top of a toom4 than a kara-3 on top of a toom7,
> for instance.
>
> Well, I'm not against trying what you suggest. But I don't think it will
> work.
>

It may not , I'm just guessing . When we do a big mul we normally choose an
optimal algorithm , but if we are parrallizing then we choose an algorithm
(say kararmul3) which although parrilizes well , it involves significantly
more computational work.

Bill Hart

unread,
Jul 16, 2009, 8:38:27 PM7/16/09
to mpir...@googlegroups.com
Hmm, well it is very easy to try this idea out, so I'll give it a go
when I find some more time.

I'm currently writing a Windows program to allow one to easily create
maths libraries in C. So much of the work we do in writing libraries
is repetitive and boring and a nice tool to do it for us would be fun.

It's proving to be quite complex. First thing I want is C syntax
highlighting, and I found Pygments for that. Then I need something
which does cross platform windows, and I found wxPython for that, then
I need a GUI designer, and I found BoaConstructor for that.

My first task is to try and get a basic windows app running with a
rich text edit box with syntax highlighting for C code, so that I can
basically open and edit a C program with full syntax highlighting.

The program I eventually want to create will allow you to add modules
and functions to a C library with a handful of clicks. It'll do all
the makefile maintenance for you, allow you to easily manage test code
for each module and function, etc.

Not sure it could be made to work easily with an existing library, as
the structure of the source code of each existing math library is
subtly different. But maybe if the user was willing to do a little
setting up, telling it where to find things it expects to find, or
registers various bits of code with it, it could be made to work. The
problem there would as usual be the nasty makefile business that goes
on in libraries like MPIR. You'd have to basically let it rewrite the
makefile for you I think. No program has any hope of being able to
parse the MPIR build system to the point where it could meaningfully
add new stuff to it.

Anyhow, the short term goal is to write something that'll let you
easily create new libraries, with a fairly fixed structure (taking
some of the good design features from some maths libraries I'm
familiar with).

So far my code just brings up a window and a file menu. Not very exciting....

Bill.

2009/7/17 Jason Moxham <ja...@njkfrudils.plus.com>:

Marc

unread,
Jul 17, 2009, 3:11:42 AM7/17/09
to mpir-dev
On Jul 16, 4:29 pm, Jason Moxham <ja...@njkfrudils.plus.com> wrote:
> It makes the threaded part less efficient but we are doing less work , so
> overall it's faster (in theory) .

The threading is also work, and you're doing it 13 times instead of 1.

> This is exactly why we dont want karamul-3 on top of toom7 , we pushing the
> toom7 down (in relation to the karamul-3 on top) where the overheads of toom7
> are bigger.
[...]
> It may not , I'm just guessing . When we do a big mul we normally choose an
> optimal algorithm , but if we are parrallizing then we choose an algorithm
> (say kararmul3) which although parrilizes well , it involves significantly
> more computational work.

There are several ways to look at things. From a pure "throughput"
point of view, you should run a single-threaded version of mpir and
use parallelism at a higher level (run enough programs on your
computer to keep it busy. That also seems to be the current GMP
philosophy (concentrate on reducing the total amount of work)). From a
pure "latency" point of view, you should have a computer with many
idle cores, and anything that lets an operation complete sooner is
good, even if it means doing twice more work spread on 3 cores (that
would be parallel karamul-3 on top of regular mul_n. It also fits with
the usual GPU code where people end up doing many times more total
work than on the CPU, but with enough parallelism to make it
worthwhile). And then you can try to compromise, delegating some work
to other threads to improve latency, but only doing so if it doesn't
increase the total amount of work too significantly (which is what I
was aiming for with the parallel toom7). All of those make sense in
different use cases.

jason

unread,
Jul 30, 2009, 2:55:35 PM7/30/09
to mpir-dev

jason

unread,
Aug 11, 2009, 9:44:02 PM8/11/09
to mpir-dev
In errno.c and rands.c there are some global varibles which are not
thread safe , so could present a problem. They were used in the old
random functions , which are now gone. The only use left is , for the
testing "make check" , so we can easily replace that use , and to
report an error of div by zero or sqrt of <0 , which I dont know how
that was ever of any use.

Jason
Reply all
Reply to author
Forward
0 new messages