> I made a first very rough attempt to parallelise Toom4. I just inserted
> some
> #pragma parallel sections
> and
> #pragma section
> statements in around pairs of independent pointwise multiplications in
> toom4.
> It slows it down nicely. I've got it running about 50% slower now.
> This is for multiplying about 450 limbs with toom4. Each of the
> pointwise mults is then going to be just over 100 limbs.
> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
> on the web. It can't be this bad, or no one would use it!
> I'll try toom7 tomorrow maybe, see if that is any better. The odd
> thing is, 2 cpus start up, but they run at like 45% each.
> Bill.
My guess is that overheads/syncing are killing it.What I would try is the kara mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not suggesting we use karamul at these sizes , but just get some understanding of how the parelllelization goes with a simple algorithm.
Something is definitely wrong here. I did the same thing for Toom 7
today and there's a similar slowdown.
Basically it splits into two threads OK, but the sum of the total CPU
percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
54.5%, etc, and the end result is slower timings, not faster.
I must be using OpenMP incorrectly. No one would use it if it were
this bad. I'm just taking pairs of pointwise multiplications and
splitting so that one of them is executed by one thread and one by the
other.
Bill.
2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
> On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>> I made a first very rough attempt to parallelise Toom4. I just inserted
>> some
>> #pragma parallel sections
>> and
>> #pragma section
>> statements in around pairs of independent pointwise multiplications in
>> toom4.
>> It slows it down nicely. I've got it running about 50% slower now.
>> This is for multiplying about 450 limbs with toom4. Each of the
>> pointwise mults is then going to be just over 100 limbs.
>> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>> on the web. It can't be this bad, or no one would use it!
>> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>> thing is, 2 cpus start up, but they run at like 45% each.
>> Bill.
> My guess is that overheads/syncing are killing it.What I would try is the kara
> mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
> suggesting we use karamul at these sizes , but just get some understanding of
> how the parelllelization goes with a simple algorithm.
Try setting the environmental variable OMP_NUM_THREADS to 2 or 4 (if
you have 4 cores available) so that it doesn't try to create too many
threads (which I found to be the biggest problem).
On Sat, Jun 13, 2009 at 11:12 AM, Bill Hart<goodwillh...@googlemail.com> wrote:
> Something is definitely wrong here. I did the same thing for Toom 7
> today and there's a similar slowdown.
> Basically it splits into two threads OK, but the sum of the total CPU
> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
> 54.5%, etc, and the end result is slower timings, not faster.
> I must be using OpenMP incorrectly. No one would use it if it were
> this bad. I'm just taking pairs of pointwise multiplications and
> splitting so that one of them is executed by one thread and one by the
> other.
> Bill.
> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>> On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>> I made a first very rough attempt to parallelise Toom4. I just inserted
>>> some
>>> #pragma parallel sections
>>> and
>>> #pragma section
>>> statements in around pairs of independent pointwise multiplications in
>>> toom4.
>>> It slows it down nicely. I've got it running about 50% slower now.
>>> This is for multiplying about 450 limbs with toom4. Each of the
>>> pointwise mults is then going to be just over 100 limbs.
>>> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>>> on the web. It can't be this bad, or no one would use it!
>>> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>> thing is, 2 cpus start up, but they run at like 45% each.
>>> Bill.
>> My guess is that overheads/syncing are killing it.What I would try is the kara
>> mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
>> suggesting we use karamul at these sizes , but just get some understanding of
>> how the parelllelization goes with a simple algorithm.
Actually, you might want to bound the thread nesting depth, too.
(Assuming that you're working with karatsuba sizes that are
recursing.) At best, OpenMP is only going to buy us something at the
very top level...
> Try setting the environmental variable OMP_NUM_THREADS to 2 or 4 (if
> you have 4 cores available) so that it doesn't try to create too many
> threads (which I found to be the biggest problem).
> On Sat, Jun 13, 2009 at 11:12 AM, Bill Hart<goodwillh...@googlemail.com> wrote:
>> Something is definitely wrong here. I did the same thing for Toom 7
>> today and there's a similar slowdown.
>> Basically it splits into two threads OK, but the sum of the total CPU
>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>> 54.5%, etc, and the end result is slower timings, not faster.
>> I must be using OpenMP incorrectly. No one would use it if it were
>> this bad. I'm just taking pairs of pointwise multiplications and
>> splitting so that one of them is executed by one thread and one by the
>> other.
>> Bill.
>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>> On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>> I made a first very rough attempt to parallelise Toom4. I just inserted
>>>> some
>>>> #pragma parallel sections
>>>> and
>>>> #pragma section
>>>> statements in around pairs of independent pointwise multiplications in
>>>> toom4.
>>>> It slows it down nicely. I've got it running about 50% slower now.
>>>> This is for multiplying about 450 limbs with toom4. Each of the
>>>> pointwise mults is then going to be just over 100 limbs.
>>>> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>>>> on the web. It can't be this bad, or no one would use it!
>>>> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>> thing is, 2 cpus start up, but they run at like 45% each.
>>>> Bill.
>>> My guess is that overheads/syncing are killing it.What I would try is the kara
>>> mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
>>> suggesting we use karamul at these sizes , but just get some understanding of
>>> how the parelllelization goes with a simple algorithm.
to check whether things are working properly, you could try this in
mpn_mul_fft_full_a:
tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
#pragma omp parallel sections
{
#pragma omp section
muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
#pragma omp section
mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
}
If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
single thread.
What implementation of openmp are you using, a recent gcc? How are you
taking the timings exactly? Karatsuba with 10000 limbs is faster with
several threads than just 1 (a bit more complicated than the example
above because you need scratch space), so I expect a small speed-up in
the toom4 range is possible.
On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Something is definitely wrong here. I did the same thing for Toom 7
> today and there's a similar slowdown.
> Basically it splits into two threads OK, but the sum of the total CPU
> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
> 54.5%, etc, and the end result is slower timings, not faster.
> I must be using OpenMP incorrectly. No one would use it if it were
> this bad. I'm just taking pairs of pointwise multiplications and
> splitting so that one of them is executed by one thread and one by the
> other.
> Bill.
> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
> >> I made a first very rough attempt to parallelise Toom4. I just inserted
> >> some
> >> #pragma parallel sections
> >> and
> >> #pragma section
> >> statements in around pairs of independent pointwise multiplications in
> >> toom4.
> >> It slows it down nicely. I've got it running about 50% slower now.
> >> This is for multiplying about 450 limbs with toom4. Each of the
> >> pointwise mults is then going to be just over 100 limbs.
> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
> >> on the web. It can't be this bad, or no one would use it!
> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
> >> thing is, 2 cpus start up, but they run at like 45% each.
> >> Bill.
> > My guess is that overheads/syncing are killing it.What I would try is the kara
> > mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
> > suggesting we use karamul at these sizes , but just get some understanding of
> > how the parelllelization goes with a simple algorithm.
> to check whether things are working properly, you could try this in
> mpn_mul_fft_full_a:
> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
> #pragma omp parallel sections
> {
> #pragma omp section
> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
> #pragma omp section
> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
> }
> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
> single thread.
> What implementation of openmp are you using, a recent gcc? How are you
> taking the timings exactly? Karatsuba with 10000 limbs is faster with
> several threads than just 1 (a bit more complicated than the example
> above because you need scratch space), so I expect a small speed-up in
> the toom4 range is possible.
> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> Something is definitely wrong here. I did the same thing for Toom 7
>> today and there's a similar slowdown.
>> Basically it splits into two threads OK, but the sum of the total CPU
>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>> 54.5%, etc, and the end result is slower timings, not faster.
>> I must be using OpenMP incorrectly. No one would use it if it were
>> this bad. I'm just taking pairs of pointwise multiplications and
>> splitting so that one of them is executed by one thread and one by the
>> other.
>> Bill.
>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>> >> I made a first very rough attempt to parallelise Toom4. I just inserted
>> >> some
>> >> #pragma parallel sections
>> >> and
>> >> #pragma section
>> >> statements in around pairs of independent pointwise multiplications in
>> >> toom4.
>> >> It slows it down nicely. I've got it running about 50% slower now.
>> >> This is for multiplying about 450 limbs with toom4. Each of the
>> >> pointwise mults is then going to be just over 100 limbs.
>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>> >> on the web. It can't be this bad, or no one would use it!
>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>> >> thing is, 2 cpus start up, but they run at like 45% each.
>> >> Bill.
>> > My guess is that overheads/syncing are killing it.What I would try is the kara
>> > mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
>> > suggesting we use karamul at these sizes , but just get some understanding of
>> > how the parelllelization goes with a simple algorithm.
The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
1000000 limbs, both of which are quite acceptable speedups.
I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
at best. That certainly wouldn't represent value for money, given that
two CPU cores are being used.
I think the next experiment will be the one suggested by Jason. See
how Karatsuba performs for different sizes up to say 30000 limbs.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> to check whether things are working properly, you could try this in
>> mpn_mul_fft_full_a:
>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>> #pragma omp parallel sections
>> {
>> #pragma omp section
>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>> #pragma omp section
>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>> }
>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>> single thread.
>> What implementation of openmp are you using, a recent gcc? How are you
>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>> several threads than just 1 (a bit more complicated than the example
>> above because you need scratch space), so I expect a small speed-up in
>> the toom4 range is possible.
>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>> Something is definitely wrong here. I did the same thing for Toom 7
>>> today and there's a similar slowdown.
>>> Basically it splits into two threads OK, but the sum of the total CPU
>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>> 54.5%, etc, and the end result is slower timings, not faster.
>>> I must be using OpenMP incorrectly. No one would use it if it were
>>> this bad. I'm just taking pairs of pointwise multiplications and
>>> splitting so that one of them is executed by one thread and one by the
>>> other.
>>> Bill.
>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>> >> I made a first very rough attempt to parallelise Toom4. I just inserted
>>> >> some
>>> >> #pragma parallel sections
>>> >> and
>>> >> #pragma section
>>> >> statements in around pairs of independent pointwise multiplications in
>>> >> toom4.
>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>> >> pointwise mults is then going to be just over 100 limbs.
>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>>> >> on the web. It can't be this bad, or no one would use it!
>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>> >> Bill.
>>> > My guess is that overheads/syncing are killing it.What I would try is the kara
>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
>>> > suggesting we use karamul at these sizes , but just get some understanding of
>>> > how the parelllelization goes with a simple algorithm.
Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
bad. That's a 4s computation. And that's with OMP_NESTED=false.
I'm using speed to time it and speed shows an almost two times speedup
for the parallelised FFT. So I don't know what is going on. It must be
nesting even though I've told it not to.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
> 1000000 limbs, both of which are quite acceptable speedups.
> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
> at best. That certainly wouldn't represent value for money, given that
> two CPU cores are being used.
> I think the next experiment will be the one suggested by Jason. See
> how Karatsuba performs for different sizes up to say 30000 limbs.
> Bill.
> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> It's the way I'm timing it that's messed up. I was using the
>> mpirbench, which of course uses cputime not wall time. Doh.
>>> to check whether things are working properly, you could try this in
>>> mpn_mul_fft_full_a:
>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>> #pragma omp parallel sections
>>> {
>>> #pragma omp section
>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>> #pragma omp section
>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>> }
>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>> single thread.
>>> What implementation of openmp are you using, a recent gcc? How are you
>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>> several threads than just 1 (a bit more complicated than the example
>>> above because you need scratch space), so I expect a small speed-up in
>>> the toom4 range is possible.
>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>> today and there's a similar slowdown.
>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>> splitting so that one of them is executed by one thread and one by the
>>>> other.
>>>> Bill.
>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>> >> I made a first very rough attempt to parallelise Toom4. I just inserted
>>>> >> some
>>>> >> #pragma parallel sections
>>>> >> and
>>>> >> #pragma section
>>>> >> statements in around pairs of independent pointwise multiplications in
>>>> >> toom4.
>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp is
>>>> >> on the web. It can't be this bad, or no one would use it!
>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>> >> Bill.
>>>> > My guess is that overheads/syncing are killing it.What I would try is the kara
>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
>>>> > suggesting we use karamul at these sizes , but just get some understanding of
>>>> > how the parelllelization goes with a simple algorithm.
----- Original Message ----- From: "Bill Hart" <goodwillh...@googlemail.com>
To: <mpir-dev@googlegroups.com>
Sent: Sunday, June 14, 2009 3:42 AM
Subject: Re: First attempt at openmp
Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
bad. That's a 4s computation. And that's with OMP_NESTED=false.
I'm using speed to time it and speed shows an almost two times speedup
for the parallelised FFT. So I don't know what is going on. It must be
nesting even though I've told it not to.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
> 1000000 limbs, both of which are quite acceptable speedups.
> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
> at best. That certainly wouldn't represent value for money, given that
> two CPU cores are being used.
> I think the next experiment will be the one suggested by Jason. See
> how Karatsuba performs for different sizes up to say 30000 limbs.
> Bill.
> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> It's the way I'm timing it that's messed up. I was using the
>> mpirbench, which of course uses cputime not wall time. Doh.
>>> to check whether things are working properly, you could try this in
>>> mpn_mul_fft_full_a:
>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>> #pragma omp parallel sections
>>> {
>>> #pragma omp section
>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>> #pragma omp section
>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>> }
>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>> single thread.
>>> What implementation of openmp are you using, a recent gcc? How are you
>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>> several threads than just 1 (a bit more complicated than the example
>>> above because you need scratch space), so I expect a small speed-up in
>>> the toom4 range is possible.
>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>> today and there's a similar slowdown.
>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>> splitting so that one of them is executed by one thread and one by the
>>>> other.
>>>> Bill.
>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>> >> I made a first very rough attempt to parallelise Toom4. I just >>>> >> inserted
>>>> >> some
>>>> >> #pragma parallel sections
>>>> >> and
>>>> >> #pragma section
>>>> >> statements in around pairs of independent pointwise multiplications >>>> >> in
>>>> >> toom4.
>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp >>>> >> is
>>>> >> on the web. It can't be this bad, or no one would use it!
>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>> >> Bill.
>>>> > My guess is that overheads/syncing are killing it.What I would try is >>>> > the kara
>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs. >>>> > I'm not
>>>> > suggesting we use karamul at these sizes , but just get some >>>> > understanding of
>>>> > how the parelllelization goes with a simple algorithm.
I'm timing mpn_kara_mul_n using speed. It's just kara and base against
kara and base with openmp.
I tried forcing it to not nest any parallelism by having openmp
pragmas only in the outer level (it calls off to an identical copy of
mpn_kara_mul_n without openmp pragmas instead of recursing to the same
function. No change. Still twice as slow.
It's 50% slower with only 2 threads instead of 3.
They are pretty enormous operands. Perhaps it just overwhelms the
cache more quickly with two processors working at the same time.
Bill.
2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
> Your comparing kara and base against kara and base with openmp.
> You have ONLY kara and basecase mul , nothing else?
> ----- Original Message -----
> From: "Bill Hart" <goodwillh...@googlemail.com>
> To: <mpir-dev@googlegroups.com>
> Sent: Sunday, June 14, 2009 3:42 AM
> Subject: Re: First attempt at openmp
> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
> bad. That's a 4s computation. And that's with OMP_NESTED=false.
> I'm using speed to time it and speed shows an almost two times speedup
> for the parallelised FFT. So I don't know what is going on. It must be
> nesting even though I've told it not to.
> Bill.
> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>> 1000000 limbs, both of which are quite acceptable speedups.
>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>> at best. That certainly wouldn't represent value for money, given that
>> two CPU cores are being used.
>> I think the next experiment will be the one suggested by Jason. See
>> how Karatsuba performs for different sizes up to say 30000 limbs.
>> Bill.
>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> It's the way I'm timing it that's messed up. I was using the
>>> mpirbench, which of course uses cputime not wall time. Doh.
>>>> to check whether things are working properly, you could try this in
>>>> mpn_mul_fft_full_a:
>>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>>> #pragma omp parallel sections
>>>> {
>>>> #pragma omp section
>>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>>> #pragma omp section
>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>>> }
>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>>> single thread.
>>>> What implementation of openmp are you using, a recent gcc? How are you
>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>>> several threads than just 1 (a bit more complicated than the example
>>>> above because you need scratch space), so I expect a small speed-up in
>>>> the toom4 range is possible.
>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>>> today and there's a similar slowdown.
>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>>> splitting so that one of them is executed by one thread and one by the
>>>>> other.
>>>>> Bill.
>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>>>>> >> inserted
>>>>> >> some
>>>>> >> #pragma parallel sections
>>>>> >> and
>>>>> >> #pragma section
>>>>> >> statements in around pairs of independent pointwise multiplications
>>>>> >> in
>>>>> >> toom4.
>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
>>>>> >> is
>>>>> >> on the web. It can't be this bad, or no one would use it!
>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>>> >> Bill.
>>>>> > My guess is that overheads/syncing are killing it.What I would try is
>>>>> > the kara
>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
>>>>> > I'm not
>>>>> > suggesting we use karamul at these sizes , but just get some
>>>>> > understanding of
>>>>> > how the parelllelization goes with a simple algorithm.
Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely but perhaps it is
basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
----- Original Message ----- From: "Bill Hart" <goodwillh...@googlemail.com>
To: <mpir-dev@googlegroups.com>
Sent: Sunday, June 14, 2009 3:56 AM
Subject: Re: First attempt at openmp
I'm timing mpn_kara_mul_n using speed. It's just kara and base against
kara and base with openmp.
I tried forcing it to not nest any parallelism by having openmp
pragmas only in the outer level (it calls off to an identical copy of
mpn_kara_mul_n without openmp pragmas instead of recursing to the same
function. No change. Still twice as slow.
It's 50% slower with only 2 threads instead of 3.
They are pretty enormous operands. Perhaps it just overwhelms the
cache more quickly with two processors working at the same time.
Bill.
2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
> Your comparing kara and base against kara and base with openmp.
> You have ONLY kara and basecase mul , nothing else?
> ----- Original Message -----
> From: "Bill Hart" <goodwillh...@googlemail.com>
> To: <mpir-dev@googlegroups.com>
> Sent: Sunday, June 14, 2009 3:42 AM
> Subject: Re: First attempt at openmp
> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
> bad. That's a 4s computation. And that's with OMP_NESTED=false.
> I'm using speed to time it and speed shows an almost two times speedup
> for the parallelised FFT. So I don't know what is going on. It must be
> nesting even though I've told it not to.
> Bill.
> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>> 1000000 limbs, both of which are quite acceptable speedups.
>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>> at best. That certainly wouldn't represent value for money, given that
>> two CPU cores are being used.
>> I think the next experiment will be the one suggested by Jason. See
>> how Karatsuba performs for different sizes up to say 30000 limbs.
>> Bill.
>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> It's the way I'm timing it that's messed up. I was using the
>>> mpirbench, which of course uses cputime not wall time. Doh.
>>>> to check whether things are working properly, you could try this in
>>>> mpn_mul_fft_full_a:
>>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>>> #pragma omp parallel sections
>>>> {
>>>> #pragma omp section
>>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>>> #pragma omp section
>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>>> }
>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>>> single thread.
>>>> What implementation of openmp are you using, a recent gcc? How are you
>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>>> several threads than just 1 (a bit more complicated than the example
>>>> above because you need scratch space), so I expect a small speed-up in
>>>> the toom4 range is possible.
>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>>> today and there's a similar slowdown.
>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>>> splitting so that one of them is executed by one thread and one by the
>>>>> other.
>>>>> Bill.
>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>>>>> >> inserted
>>>>> >> some
>>>>> >> #pragma parallel sections
>>>>> >> and
>>>>> >> #pragma section
>>>>> >> statements in around pairs of independent pointwise multiplications
>>>>> >> in
>>>>> >> toom4.
>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
>>>>> >> is
>>>>> >> on the web. It can't be this bad, or no one would use it!
>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>>> >> Bill.
>>>>> > My guess is that overheads/syncing are killing it.What I would try >>>>> > is
>>>>> > the kara
>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
>>>>> > I'm not
>>>>> > suggesting we use karamul at these sizes , but just get some
>>>>> > understanding of
>>>>> > how the parelllelization goes with a simple algorithm.
yeah I think thats it
if we run a 10^6 basecase this will take 3*10^12 mem ops
so adding another thread wont help because we are allready waiting for memory to respond
----- Original Message ----- From: "Jason Moxham" <ja...@njkfrudils.plus.com>
To: <mpir-dev@googlegroups.com>
Sent: Sunday, June 14, 2009 4:01 AM
Subject: Re: First attempt at openmp
> Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
> but perhaps it is
> basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
> ----- Original Message ----- > From: "Bill Hart" <goodwillh...@googlemail.com>
> To: <mpir-dev@googlegroups.com>
> Sent: Sunday, June 14, 2009 3:56 AM
> Subject: Re: First attempt at openmp
> I'm timing mpn_kara_mul_n using speed. It's just kara and base against
> kara and base with openmp.
> I tried forcing it to not nest any parallelism by having openmp
> pragmas only in the outer level (it calls off to an identical copy of
> mpn_kara_mul_n without openmp pragmas instead of recursing to the same
> function. No change. Still twice as slow.
> It's 50% slower with only 2 threads instead of 3.
> They are pretty enormous operands. Perhaps it just overwhelms the
> cache more quickly with two processors working at the same time.
> Bill.
> 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>> Your comparing kara and base against kara and base with openmp.
>> You have ONLY kara and basecase mul , nothing else?
>> ----- Original Message -----
>> From: "Bill Hart" <goodwillh...@googlemail.com>
>> To: <mpir-dev@googlegroups.com>
>> Sent: Sunday, June 14, 2009 3:42 AM
>> Subject: Re: First attempt at openmp
>> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
>> bad. That's a 4s computation. And that's with OMP_NESTED=false.
>> I'm using speed to time it and speed shows an almost two times speedup
>> for the parallelised FFT. So I don't know what is going on. It must be
>> nesting even though I've told it not to.
>> Bill.
>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>>> 1000000 limbs, both of which are quite acceptable speedups.
>>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>>> at best. That certainly wouldn't represent value for money, given that
>>> two CPU cores are being used.
>>> I think the next experiment will be the one suggested by Jason. See
>>> how Karatsuba performs for different sizes up to say 30000 limbs.
>>> Bill.
>>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>>> It's the way I'm timing it that's messed up. I was using the
>>>> mpirbench, which of course uses cputime not wall time. Doh.
>>>>> to check whether things are working properly, you could try this in
>>>>> mpn_mul_fft_full_a:
>>>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>>>> #pragma omp parallel sections
>>>>> {
>>>>> #pragma omp section
>>>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>>>> #pragma omp section
>>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>>>> }
>>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>>>> single thread.
>>>>> What implementation of openmp are you using, a recent gcc? How are you
>>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>>>> several threads than just 1 (a bit more complicated than the example
>>>>> above because you need scratch space), so I expect a small speed-up in
>>>>> the toom4 range is possible.
>>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>>>> today and there's a similar slowdown.
>>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>>>> splitting so that one of them is executed by one thread and one by >>>>>> the
>>>>>> other.
>>>>>> Bill.
>>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>>>>>> >> inserted
>>>>>> >> some
>>>>>> >> #pragma parallel sections
>>>>>> >> and
>>>>>> >> #pragma section
>>>>>> >> statements in around pairs of independent pointwise >>>>>> >> multiplications
>>>>>> >> in
>>>>>> >> toom4.
>>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of >>>>>> >> openmp
>>>>>> >> is
>>>>>> >> on the web. It can't be this bad, or no one would use it!
>>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>>>> >> Bill.
>>>>>> > My guess is that overheads/syncing are killing it.What I would try
>>>>>> > is
>>>>>> > the kara
>>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
>>>>>> > I'm not
>>>>>> > suggesting we use karamul at these sizes , but just get some
>>>>>> > understanding of
>>>>>> > how the parelllelization goes with a simple algorithm.
> Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
> but perhaps it is
> basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
> ----- Original Message -----
> From: "Bill Hart" <goodwillh...@googlemail.com>
> To: <mpir-dev@googlegroups.com>
> Sent: Sunday, June 14, 2009 3:56 AM
> Subject: Re: First attempt at openmp
> I'm timing mpn_kara_mul_n using speed. It's just kara and base against
> kara and base with openmp.
> I tried forcing it to not nest any parallelism by having openmp
> pragmas only in the outer level (it calls off to an identical copy of
> mpn_kara_mul_n without openmp pragmas instead of recursing to the same
> function. No change. Still twice as slow.
> It's 50% slower with only 2 threads instead of 3.
> They are pretty enormous operands. Perhaps it just overwhelms the
> cache more quickly with two processors working at the same time.
> Bill.
> 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>> Your comparing kara and base against kara and base with openmp.
>> You have ONLY kara and basecase mul , nothing else?
>> ----- Original Message -----
>> From: "Bill Hart" <goodwillh...@googlemail.com>
>> To: <mpir-dev@googlegroups.com>
>> Sent: Sunday, June 14, 2009 3:42 AM
>> Subject: Re: First attempt at openmp
>> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
>> bad. That's a 4s computation. And that's with OMP_NESTED=false.
>> I'm using speed to time it and speed shows an almost two times speedup
>> for the parallelised FFT. So I don't know what is going on. It must be
>> nesting even though I've told it not to.
>> Bill.
>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>>> 1000000 limbs, both of which are quite acceptable speedups.
>>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>>> at best. That certainly wouldn't represent value for money, given that
>>> two CPU cores are being used.
>>> I think the next experiment will be the one suggested by Jason. See
>>> how Karatsuba performs for different sizes up to say 30000 limbs.
>>> Bill.
>>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>>> It's the way I'm timing it that's messed up. I was using the
>>>> mpirbench, which of course uses cputime not wall time. Doh.
>>>>> to check whether things are working properly, you could try this in
>>>>> mpn_mul_fft_full_a:
>>>>> tp = __GMP_ALLOCATE_FUNC_LIMBS (l);
>>>>> #pragma omp parallel sections
>>>>> {
>>>>> #pragma omp section
>>>>> muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */
>>>>> #pragma omp section
>>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>>>> }
>>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>>>> single thread.
>>>>> What implementation of openmp are you using, a recent gcc? How are you
>>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>>>> several threads than just 1 (a bit more complicated than the example
>>>>> above because you need scratch space), so I expect a small speed-up in
>>>>> the toom4 range is possible.
>>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>>>>> today and there's a similar slowdown.
>>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>>>>> splitting so that one of them is executed by one thread and one by the
>>>>>> other.
>>>>>> Bill.
>>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>>>>>> >> inserted
>>>>>> >> some
>>>>>> >> #pragma parallel sections
>>>>>> >> and
>>>>>> >> #pragma section
>>>>>> >> statements in around pairs of independent pointwise multiplications
>>>>>> >> in
>>>>>> >> toom4.
>>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>>>>> >> pointwise mults is then going to be just over 100 limbs.
>>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
>>>>>> >> is
>>>>>> >> on the web. It can't be this bad, or no one would use it!
>>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>>>>>> >> Bill.
>>>>>> > My guess is that overheads/syncing are killing it.What I would try
>>>>>> > is
>>>>>> > the kara
>>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
>>>>>> > I'm not
>>>>>> > suggesting we use karamul at these sizes , but just get some
>>>>>> > understanding of
>>>>>> > how the parelllelization goes with a simple algorithm.
Could you share the patches you are using, both to measure the speed
and to parallelize karatsuba? Last time I tried, just making the last
two calls to mpn_kara_mul_n before the /* Interpolate */ comment in
parallel sections (and giving the last one a fresh buffer instead of ws
+n), the second thread increased the speed for 1000, 10000 or 300000
limbs, it did not make it slower (I used 2 threads, no nesting). For
300000, it is pretty close to a 2x speed-up.
On Jun 13, 8:18 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
> The times jump around a lot. At 1000 limbs it varies from 50% slower
> to about the same speed.
> At 10000 limbs it is 50% slower, twice as slow at 30000 limbs.
> At 300 limbs it is only about 30% slower.
> 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
> > Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
> > but perhaps it is
> > basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
> > ----- Original Message -----
> > From: "Bill Hart" <goodwillh...@googlemail.com>
> > To: <mpir-dev@googlegroups.com>
> > Sent: Sunday, June 14, 2009 3:56 AM
> > Subject: Re: First attempt at openmp
> > I'm timing mpn_kara_mul_n using speed. It's just kara and base against
> > kara and base with openmp.
> > I tried forcing it to not nest any parallelism by having openmp
> > pragmas only in the outer level (it calls off to an identical copy of
> > mpn_kara_mul_n without openmp pragmas instead of recursing to the same
> > function. No change. Still twice as slow.
> > It's 50% slower with only 2 threads instead of 3.
> > They are pretty enormous operands. Perhaps it just overwhelms the
> > cache more quickly with two processors working at the same time.
> > Bill.
> > 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
> >> Your comparing kara and base against kara and base with openmp.
> >> You have ONLY kara and basecase mul , nothing else?
> >> ----- Original Message -----
> >> From: "Bill Hart" <goodwillh...@googlemail.com>
> >> To: <mpir-dev@googlegroups.com>
> >> Sent: Sunday, June 14, 2009 3:42 AM
> >> Subject: Re: First attempt at openmp
> >> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
> >> bad. That's a 4s computation. And that's with OMP_NESTED=false.
> >> I'm using speed to time it and speed shows an almost two times speedup
> >> for the parallelised FFT. So I don't know what is going on. It must be
> >> nesting even though I've told it not to.
> >> Bill.
> >> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> >>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
> >>> 1000000 limbs, both of which are quite acceptable speedups.
> >>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
> >>> at best. That certainly wouldn't represent value for money, given that
> >>> two CPU cores are being used.
> >>> I think the next experiment will be the one suggested by Jason. See
> >>> how Karatsuba performs for different sizes up to say 30000 limbs.
> >>> Bill.
> >>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> >>>> It's the way I'm timing it that's messed up. I was using the
> >>>> mpirbench, which of course uses cputime not wall time. Doh.
> >>>>> #pragma omp section
> >>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
> >>>>> }
> >>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
> >>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
> >>>>> single thread.
> >>>>> What implementation of openmp are you using, a recent gcc? How are you
> >>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
> >>>>> several threads than just 1 (a bit more complicated than the example
> >>>>> above because you need scratch space), so I expect a small speed-up in
> >>>>> the toom4 range is possible.
> >>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> >>>>>> Something is definitely wrong here. I did the same thing for Toom 7
> >>>>>> today and there's a similar slowdown.
> >>>>>> Basically it splits into two threads OK, but the sum of the total CPU
> >>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
> >>>>>> 54.5%, etc, and the end result is slower timings, not faster.
> >>>>>> I must be using OpenMP incorrectly. No one would use it if it were
> >>>>>> this bad. I'm just taking pairs of pointwise multiplications and
> >>>>>> splitting so that one of them is executed by one thread and one by the
> >>>>>> other.
> >>>>>> Bill.
> >>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
> >>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
> >>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
> >>>>>> >> inserted
> >>>>>> >> some
> >>>>>> >> #pragma parallel sections
> >>>>>> >> and
> >>>>>> >> #pragma section
> >>>>>> >> statements in around pairs of independent pointwise multiplications
> >>>>>> >> in
> >>>>>> >> toom4.
> >>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
> >>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
> >>>>>> >> pointwise mults is then going to be just over 100 limbs.
> >>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
> >>>>>> >> is
> >>>>>> >> on the web. It can't be this bad, or no one would use it!
> >>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
> >>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
> >>>>>> >> Bill.
> >>>>>> > My guess is that overheads/syncing are killing it.What I would try
> >>>>>> > is
> >>>>>> > the kara
> >>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
> >>>>>> > I'm not
> >>>>>> > suggesting we use karamul at these sizes , but just get some
> >>>>>> > understanding of
> >>>>>> > how the parelllelization goes with a simple algorithm.
> Could you share the patches you are using, both to measure the speed
> and to parallelize karatsuba? Last time I tried, just making the last
> two calls to mpn_kara_mul_n before the /* Interpolate */ comment in
> parallel sections (and giving the last one a fresh buffer instead of ws
> +n), the second thread increased the speed for 1000, 10000 or 300000
> limbs, it did not make it slower (I used 2 threads, no nesting). For
> 300000, it is pretty close to a 2x speed-up.
> On Jun 13, 8:18 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>> The times jump around a lot. At 1000 limbs it varies from 50% slower
>> to about the same speed.
>> At 10000 limbs it is 50% slower, twice as slow at 30000 limbs.
>> At 300 limbs it is only about 30% slower.
>> 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>> > Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
>> > but perhaps it is
>> > basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
>> > ----- Original Message -----
>> > From: "Bill Hart" <goodwillh...@googlemail.com>
>> > To: <mpir-dev@googlegroups.com>
>> > Sent: Sunday, June 14, 2009 3:56 AM
>> > Subject: Re: First attempt at openmp
>> > I'm timing mpn_kara_mul_n using speed. It's just kara and base against
>> > kara and base with openmp.
>> > I tried forcing it to not nest any parallelism by having openmp
>> > pragmas only in the outer level (it calls off to an identical copy of
>> > mpn_kara_mul_n without openmp pragmas instead of recursing to the same
>> > function. No change. Still twice as slow.
>> > It's 50% slower with only 2 threads instead of 3.
>> > They are pretty enormous operands. Perhaps it just overwhelms the
>> > cache more quickly with two processors working at the same time.
>> > Bill.
>> > 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>> >> Your comparing kara and base against kara and base with openmp.
>> >> You have ONLY kara and basecase mul , nothing else?
>> >> ----- Original Message -----
>> >> From: "Bill Hart" <goodwillh...@googlemail.com>
>> >> To: <mpir-dev@googlegroups.com>
>> >> Sent: Sunday, June 14, 2009 3:42 AM
>> >> Subject: Re: First attempt at openmp
>> >> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
>> >> bad. That's a 4s computation. And that's with OMP_NESTED=false.
>> >> I'm using speed to time it and speed shows an almost two times speedup
>> >> for the parallelised FFT. So I don't know what is going on. It must be
>> >> nesting even though I've told it not to.
>> >> Bill.
>> >> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> >>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>> >>> 1000000 limbs, both of which are quite acceptable speedups.
>> >>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>> >>> at best. That certainly wouldn't represent value for money, given that
>> >>> two CPU cores are being used.
>> >>> I think the next experiment will be the one suggested by Jason. See
>> >>> how Karatsuba performs for different sizes up to say 30000 limbs.
>> >>> Bill.
>> >>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> >>>> It's the way I'm timing it that's messed up. I was using the
>> >>>> mpirbench, which of course uses cputime not wall time. Doh.
>> >>>>> #pragma omp section
>> >>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>> >>>>> }
>> >>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>> >>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>> >>>>> single thread.
>> >>>>> What implementation of openmp are you using, a recent gcc? How are you
>> >>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>> >>>>> several threads than just 1 (a bit more complicated than the example
>> >>>>> above because you need scratch space), so I expect a small speed-up in
>> >>>>> the toom4 range is possible.
>> >>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> >>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>> >>>>>> today and there's a similar slowdown.
>> >>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>> >>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>> >>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>> >>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>> >>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>> >>>>>> splitting so that one of them is executed by one thread and one by the
>> >>>>>> other.
>> >>>>>> Bill.
>> >>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>> >>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>> >>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>> >>>>>> >> inserted
>> >>>>>> >> some
>> >>>>>> >> #pragma parallel sections
>> >>>>>> >> and
>> >>>>>> >> #pragma section
>> >>>>>> >> statements in around pairs of independent pointwise multiplications
>> >>>>>> >> in
>> >>>>>> >> toom4.
>> >>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>> >>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>> >>>>>> >> pointwise mults is then going to be just over 100 limbs.
>> >>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
>> >>>>>> >> is
>> >>>>>> >> on the web. It can't be this bad, or no one would use it!
>> >>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
>> >>>>>> >> thing is, 2 cpus start up, but they run at like 45% each.
>> >>>>>> >> Bill.
>> >>>>>> > My guess is that overheads/syncing are killing it.What I would try
>> >>>>>> > is
>> >>>>>> > the kara
>> >>>>>> > mul but at sizes 100 to 30000 limbs , just to see how it performs.
>> >>>>>> > I'm not
>> >>>>>> > suggesting we use karamul at these sizes , but just get some
>> >>>>>> > understanding of
>> >>>>>> > how the parelllelization goes with a simple algorithm.
> That does seem to give what I expect for the FFT.
> I've attached the mul_n.c which does what I think you said you tried.
I think your version doesn't pass the testsuite with
OMP_NUM_THREADS=2. Of the 3 recursive calls to karamul, the first
reads p while the second and third write to p, so you can't make the
first in parallel to the others unless you use some other place to
store the temporary values. Since you end up with one thread writing
to p and one reading from it, the synchronization may be causing your
slow-down.
What I tried is something like:
http://www.loria.fr/~glisse/tmp/mul_n.c (I purposely try to avoid sending explicit code because I am not sure
I am allowed to, but this should be trivial enough)
OK, I did what Jason suggested and made the toom3 cutoff really high,
etc, so karatsuba is always used. I also modified the bench program to
time lots of different sizes. Here are the results, which are more
what I expected to see:
and of course in the appropriate gmp-mparam.h for my architecture I
changed the toom3, toom4 and toom7 cutoffs to very large values.
Naturally this is all very hackish at this stage and not much use as
no one in their right mind is going to use karatsuba for 1000 limbs.
Nothing we've tried scales beyond two or three threads either. But at
least openmp does work.
I think the goal should be to have algorithms which scale well with
the number of threads and have everything tuned so that for n = 2
threads there is at least a 50% speedup and for n > 2 threads, there
is at least an n-1 times speedup.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> Could you share the patches you are using, both to measure the speed
>> and to parallelize karatsuba? Last time I tried, just making the last
>> two calls to mpn_kara_mul_n before the /* Interpolate */ comment in
>> parallel sections (and giving the last one a fresh buffer instead of ws
>> +n), the second thread increased the speed for 1000, 10000 or 300000
>> limbs, it did not make it slower (I used 2 threads, no nesting). For
>> 300000, it is pretty close to a 2x speed-up.
>> On Jun 13, 8:18 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>>> The times jump around a lot. At 1000 limbs it varies from 50% slower
>>> to about the same speed.
>>> At 10000 limbs it is 50% slower, twice as slow at 30000 limbs.
>>> At 300 limbs it is only about 30% slower.
>>> 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>>> > Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
>>> > but perhaps it is
>>> > basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb
>>> > ----- Original Message -----
>>> > From: "Bill Hart" <goodwillh...@googlemail.com>
>>> > To: <mpir-dev@googlegroups.com>
>>> > Sent: Sunday, June 14, 2009 3:56 AM
>>> > Subject: Re: First attempt at openmp
>>> > I'm timing mpn_kara_mul_n using speed. It's just kara and base against
>>> > kara and base with openmp.
>>> > I tried forcing it to not nest any parallelism by having openmp
>>> > pragmas only in the outer level (it calls off to an identical copy of
>>> > mpn_kara_mul_n without openmp pragmas instead of recursing to the same
>>> > function. No change. Still twice as slow.
>>> > It's 50% slower with only 2 threads instead of 3.
>>> > They are pretty enormous operands. Perhaps it just overwhelms the
>>> > cache more quickly with two processors working at the same time.
>>> > Bill.
>>> > 2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:
>>> >> Your comparing kara and base against kara and base with openmp.
>>> >> You have ONLY kara and basecase mul , nothing else?
>>> >> ----- Original Message -----
>>> >> From: "Bill Hart" <goodwillh...@googlemail.com>
>>> >> To: <mpir-dev@googlegroups.com>
>>> >> Sent: Sunday, June 14, 2009 3:42 AM
>>> >> Subject: Re: First attempt at openmp
>>> >> Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
>>> >> bad. That's a 4s computation. And that's with OMP_NESTED=false.
>>> >> I'm using speed to time it and speed shows an almost two times speedup
>>> >> for the parallelised FFT. So I don't know what is going on. It must be
>>> >> nesting even though I've told it not to.
>>> >> Bill.
>>> >> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> >>> The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
>>> >>> 1000000 limbs, both of which are quite acceptable speedups.
>>> >>> I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
>>> >>> at best. That certainly wouldn't represent value for money, given that
>>> >>> two CPU cores are being used.
>>> >>> I think the next experiment will be the one suggested by Jason. See
>>> >>> how Karatsuba performs for different sizes up to say 30000 limbs.
>>> >>> Bill.
>>> >>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> >>>> It's the way I'm timing it that's messed up. I was using the
>>> >>>> mpirbench, which of course uses cputime not wall time. Doh.
>>> >>>>> #pragma omp section
>>> >>>>> mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */
>>> >>>>> }
>>> >>>>> If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
>>> >>>>> mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
>>> >>>>> single thread.
>>> >>>>> What implementation of openmp are you using, a recent gcc? How are you
>>> >>>>> taking the timings exactly? Karatsuba with 10000 limbs is faster with
>>> >>>>> several threads than just 1 (a bit more complicated than the example
>>> >>>>> above because you need scratch space), so I expect a small speed-up in
>>> >>>>> the toom4 range is possible.
>>> >>>>> On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>> >>>>>> Something is definitely wrong here. I did the same thing for Toom 7
>>> >>>>>> today and there's a similar slowdown.
>>> >>>>>> Basically it splits into two threads OK, but the sum of the total CPU
>>> >>>>>> percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
>>> >>>>>> 54.5%, etc, and the end result is slower timings, not faster.
>>> >>>>>> I must be using OpenMP incorrectly. No one would use it if it were
>>> >>>>>> this bad. I'm just taking pairs of pointwise multiplications and
>>> >>>>>> splitting so that one of them is executed by one thread and one by the
>>> >>>>>> other.
>>> >>>>>> Bill.
>>> >>>>>> 2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:
>>> >>>>>> > On Saturday 13 June 2009 03:22:54 Bill Hart wrote:
>>> >>>>>> >> I made a first very rough attempt to parallelise Toom4. I just
>>> >>>>>> >> inserted
>>> >>>>>> >> some
>>> >>>>>> >> #pragma parallel sections
>>> >>>>>> >> and
>>> >>>>>> >> #pragma section
>>> >>>>>> >> statements in around pairs of independent pointwise multiplications
>>> >>>>>> >> in
>>> >>>>>> >> toom4.
>>> >>>>>> >> It slows it down nicely. I've got it running about 50% slower now.
>>> >>>>>> >> This is for multiplying about 450 limbs with toom4. Each of the
>>> >>>>>> >> pointwise mults is then going to be just over 100 limbs.
>>> >>>>>> >> Hmm, oh well. It's amazing how pathetic the documentation of openmp
>>> >>>>>> >> is
>>> >>>>>> >> on the web. It can't be this bad, or no one would use it!
>>> >>>>>> >> I'll try toom7 tomorrow maybe, see if that is any better. The odd
You're absolutely right. I missed that in my original and my new
version. Let me try and get this right.
And no problems with the code. We won't use it unless you explicitly
indicate you are contributing it to the project, say after you checked
it was OK with your institution - in the event that you did work on
something you wanted to contribute.
> On Jun 14, 7:41 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> Sure. From line 204 of tune/time.c I just hack it to make it think it
>> does not have access to getrusage, but only gettimeofday:
>> That does seem to give what I expect for the FFT.
>> I've attached the mul_n.c which does what I think you said you tried.
> I think your version doesn't pass the testsuite with
> OMP_NUM_THREADS=2. Of the 3 recursive calls to karamul, the first
> reads p while the second and third write to p, so you can't make the
> first in parallel to the others unless you use some other place to
> store the temporary values. Since you end up with one thread writing
> to p and one reading from it, the synchronization may be causing your
> slow-down.
> What I tried is something like:
> http://www.loria.fr/~glisse/tmp/mul_n.c > (I purposely try to avoid sending explicit code because I am not sure
> I am allowed to, but this should be trivial enough)
On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> And no problems with the code. We won't use it unless you explicitly
> indicate you are contributing it to the project, say after you checked
> it was OK with your institution - in the event that you did work on
> something you wanted to contribute.
Anything I post on this list, you can use. I am just explaining why I
am not posting much :-/
It looks like at around 3000 limbs with 3 threads one gets a better
than 2x speedup.
Of course at 3000 limbs there are about 6 recursions before it hits
basecase using just karatsuba. In the real world we'd be well into the
FFT range by the time that many recursions had occurred with
Toom3/4/7.
Having said that, the usual score at 131072 x 131072 when toom7 is
used is 2074, and at 192000,192000 the score is usually 1153, so we
actually beat those. with parallel karatsuba. At 640000,640000 the
score is usually 246, so we don't beat that.
There is hope, but it is unclear whether we'll ever get a worthwhile
speedup at the karatsuba range using openmp.
> On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>> And no problems with the code. We won't use it unless you explicitly
>> indicate you are contributing it to the project, say after you checked
>> it was OK with your institution - in the event that you did work on
>> something you wanted to contribute.
> Anything I post on this list, you can use. I am just explaining why I
> am not posting much :-/
> It looks like at around 3000 limbs with 3 threads one gets a better
> than 2x speedup.
> Of course at 3000 limbs there are about 6 recursions before it hits
> basecase using just karatsuba. In the real world we'd be well into the
> FFT range by the time that many recursions had occurred with
> Toom3/4/7.
> Having said that, the usual score at 131072 x 131072 when toom7 is
> used is 2074, and at 192000,192000 the score is usually 1153, so we
> actually beat those. with parallel karatsuba. At 640000,640000 the
> score is usually 246, so we don't beat that.
> There is hope, but it is unclear whether we'll ever get a worthwhile
> speedup at the karatsuba range using openmp.
>> On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>> And no problems with the code. We won't use it unless you explicitly
>>> indicate you are contributing it to the project, say after you checked
>>> it was OK with your institution - in the event that you did work on
>>> something you wanted to contribute.
>> Anything I post on this list, you can use. I am just explaining why I
>> am not posting much :-/
I tried parallelising just the toom3 (and setting the toom4 cutoff
extremely high, etc). As there are 5 pointwise mults for toom3, one
can only make two pairs of parallelised pointwise mults and use 2
threads. With a little rearrangement and addition of an extra
temporary space, this works. Here are the times:
Again, nothing to write home about. At 131072 and 192000 limbs we get
just slightly better performance than with a tuned single threaded
implementation.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> It looks like at around 3000 limbs with 3 threads one gets a better
>> than 2x speedup.
>> Of course at 3000 limbs there are about 6 recursions before it hits
>> basecase using just karatsuba. In the real world we'd be well into the
>> FFT range by the time that many recursions had occurred with
>> Toom3/4/7.
>> Having said that, the usual score at 131072 x 131072 when toom7 is
>> used is 2074, and at 192000,192000 the score is usually 1153, so we
>> actually beat those. with parallel karatsuba. At 640000,640000 the
>> score is usually 246, so we don't beat that.
>> There is hope, but it is unclear whether we'll ever get a worthwhile
>> speedup at the karatsuba range using openmp.
>>> On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>> And no problems with the code. We won't use it unless you explicitly
>>>> indicate you are contributing it to the project, say after you checked
>>>> it was OK with your institution - in the event that you did work on
>>>> something you wanted to contribute.
>>> Anything I post on this list, you can use. I am just explaining why I
>>> am not posting much :-/
Here we just beat ordinary mpir at 131072 bits (2000 limbs), there is
hope at 192000 bits (3000 limbs) and the bench at 640000 bits (100000
limbs) is usually 246, so we are just ahead there.
Bill.
2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
> I tried parallelising just the toom3 (and setting the toom4 cutoff
> extremely high, etc). As there are 5 pointwise mults for toom3, one
> can only make two pairs of parallelised pointwise mults and use 2
> threads. With a little rearrangement and addition of an extra
> temporary space, this works. Here are the times:
> Again, nothing to write home about. At 131072 and 192000 limbs we get
> just slightly better performance than with a tuned single threaded
> implementation.
> Bill.
> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>> Oops, the attachment.
>> Bill.
>> 2009/6/14 Bill Hart <goodwillh...@googlemail.com>:
>>> I attach a version of mul_n.c which does pass tests.
>>> It looks like at around 3000 limbs with 3 threads one gets a better
>>> than 2x speedup.
>>> Of course at 3000 limbs there are about 6 recursions before it hits
>>> basecase using just karatsuba. In the real world we'd be well into the
>>> FFT range by the time that many recursions had occurred with
>>> Toom3/4/7.
>>> Having said that, the usual score at 131072 x 131072 when toom7 is
>>> used is 2074, and at 192000,192000 the score is usually 1153, so we
>>> actually beat those. with parallel karatsuba. At 640000,640000 the
>>> score is usually 246, so we don't beat that.
>>> There is hope, but it is unclear whether we'll ever get a worthwhile
>>> speedup at the karatsuba range using openmp.
>>>> On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>> And no problems with the code. We won't use it unless you explicitly
>>>>> indicate you are contributing it to the project, say after you checked
>>>>> it was OK with your institution - in the event that you did work on
>>>>> something you wanted to contribute.
>>>> Anything I post on this list, you can use. I am just explaining why I
>>>> am not posting much :-/
On Jun 14, 10:54 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> I attach a version of mul_n.c which does pass tests.
> Here are the timings:
[...]
> It looks like at around 3000 limbs with 3 threads one gets a better
> than 2x speedup.
That looks better indeed.
With 2 threads (I don't have any hardware to play with 3 threads,
which is what your implementation is targeting), at 1000 limbs, your
version is just breaking even with the single-threaded one, whereas
the one I posted (a proof of concept targeting 2 threads) already
shows a 20% speedup at least. I wonder how hard it will be to make
something that works well whatever the number of threads you give
it...
> There is hope, but it is unclear whether we'll ever get a worthwhile
> speedup at the karatsuba range using openmp.
Indeed. But I think Jason suggested this mostly (or at least
partially) as an easy way to find out how much hope there is for a
parallel toom, and it doesn't look so bad. Anyway, the most
interesting part is the FFT, that offers many possibilities for
parallelism, but where you really need to consider memory locality.
I also expect that implementations are making progress on reducing the
overhead, at some point it will be necessary to do some comparisons of
the various implementations available.