Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
First attempt at openmp
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 87 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Bill Hart  
View profile  
 More options Jun 12, 10:22 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sat, 13 Jun 2009 03:22:54 +0100
Local: Fri, Jun 12 2009 10:22 pm
Subject: First attempt at openmp
I made a first very rough attempt to parallelise Toom4. I just inserted some

#pragma parallel sections

and

#pragma section

statements in around pairs of independent pointwise multiplications in toom4.

It slows it down nicely. I've got it running about 50% slower now.

This is for multiplying about 450 limbs with toom4. Each of the
pointwise mults is then going to be just over 100 limbs.

Hmm, oh well. It's amazing how pathetic the documentation of openmp is
on the web. It can't be this bad, or no one would use it!

I'll try toom7 tomorrow maybe, see if that is any better. The odd
thing is, 2 cpus start up, but they run at like 45% each.

Bill.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Moxham  
View profile  
 More options Jun 12, 10:58 pm
From: Jason Moxham <ja...@njkfrudils.plus.com>
Date: Sat, 13 Jun 2009 03:58:03 +0100
Local: Fri, Jun 12 2009 10:58 pm
Subject: Re: First attempt at openmp
On Saturday 13 June 2009 03:22:54 Bill Hart wrote:

My guess is that overheads/syncing are killing it.What I would try is the kara
mul but at sizes 100 to 30000 limbs , just to see how it performs. I'm not
suggesting we use karamul at these sizes , but just get some understanding of
how the parelllelization goes  with a simple algorithm.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 11:12 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sat, 13 Jun 2009 16:12:13 +0100
Local: Sat, Jun 13 2009 11:12 am
Subject: Re: First attempt at openmp
Something is definitely wrong here. I did the same thing for Toom 7
today and there's a similar slowdown.

Basically it splits into two threads OK, but the sum of the total CPU
percentage loads always sums to 1, e.g. 65.3% and 34.7%, or 45.5% and
54.5%, etc, and the end result is slower timings, not faster.

I must be using OpenMP incorrectly. No one would use it if it were
this bad. I'm just taking pairs of pointwise multiplications and
splitting so that one of them is executed by one thread and one by the
other.

Bill.

2009/6/13 Jason Moxham <ja...@njkfrudils.plus.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Martin  
View profile  
 More options Jun 13, 3:16 pm
From: Jason Martin <jason.worth.mar...@gmail.com>
Date: Sat, 13 Jun 2009 15:16:26 -0400
Local: Sat, Jun 13 2009 3:16 pm
Subject: Re: First attempt at openmp
The best documentation for OpenMP is here:

http://openmp.org/wp/openmp-specifications/

get the complete specification.

Try setting the environmental variable OMP_NUM_THREADS to 2 or 4 (if
you have 4 cores available) so that it doesn't try to create too many
threads (which I found to be the biggest problem).

--jason

Jason Worth Martin
Asst. Professor of Mathematics
http://www.math.jmu.edu/~martin


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Martin  
View profile  
 More options Jun 13, 3:18 pm
From: Jason Martin <jason.worth.mar...@gmail.com>
Date: Sat, 13 Jun 2009 15:18:16 -0400
Local: Sat, Jun 13 2009 3:18 pm
Subject: Re: First attempt at openmp
Actually, you might want to bound the thread nesting depth, too.
(Assuming that you're working with karatsuba sizes that are
recursing.)  At best, OpenMP is only going to buy us something at the
very top level...

Jason Worth Martin
Asst. Professor of Mathematics
http://www.math.jmu.edu/~martin

On Sat, Jun 13, 2009 at 3:16 PM, Jason


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc  
View profile  
 More options Jun 13, 3:19 pm
From: Marc <marc.gli...@gmail.com>
Date: Sat, 13 Jun 2009 12:19:56 -0700 (PDT)
Local: Sat, Jun 13 2009 3:19 pm
Subject: Re: First attempt at openmp
Hello,

to check whether things are working properly, you could try this in
mpn_mul_fft_full_a:

  tp = __GMP_ALLOCATE_FUNC_LIMBS (l);

#pragma omp parallel sections
{
#pragma omp section
  muh = mpn_mul_fft (op, h, n, nl, m, ml, k2); /* mu = muh+{op,h} */

#pragma omp section
  mpn_mul_fft_mersenne (tp, l, n, nl, m, ml, k1); /* B */

}

If CFLAGS includes -fopenmp and I set OMP_NUM_THREADS=2 at runtime, an
mpn_mul_fft_full with 10^6 limbs takes a lot less time than with a
single thread.

What implementation of openmp are you using, a recent gcc? How are you
taking the timings exactly? Karatsuba with 10000 limbs is faster with
several threads than just 1 (a bit more complicated than the example
above because you need scratch space), so I expect a small speed-up in
the toom4 range is possible.

On Jun 13, 8:12 am, Bill Hart <goodwillh...@googlemail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 9:37 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 02:37:16 +0100
Local: Sat, Jun 13 2009 9:37 pm
Subject: Re: First attempt at openmp
It's the way I'm timing it that's messed up. I was using the
mpirbench, which of course uses cputime not wall time. Doh.

I'll switch it over to use wall time.

Bill.

2009/6/13 Marc <marc.gli...@gmail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 10:02 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 03:02:55 +0100
Local: Sat, Jun 13 2009 10:02 pm
Subject: Re: First attempt at openmp
The FFT is about 50% faster at 32000 limbs and nearly twice as fast at
1000000 limbs, both of which are quite acceptable speedups.

I'm quite surprised to find toom7 only about 10% faster at 2000 limbs,
at best. That certainly wouldn't represent value for money, given that
two CPU cores are being used.

I think the next experiment will be the one suggested by Jason. See
how Karatsuba performs for different sizes up to say 30000 limbs.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 10:42 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 03:42:25 +0100
Local: Sat, Jun 13 2009 10:42 pm
Subject: Re: First attempt at openmp
Even at 300000 limbs karatsuba seems to run twice as slow. Seems very
bad. That's a 4s computation. And that's with OMP_NESTED=false.

I'm using speed to time it and speed shows an almost two times speedup
for the parallelised FFT. So I don't know what is going on. It must be
nesting even though I've told it not to.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Moxham  
View profile  
 More options Jun 13, 10:46 pm
From: "Jason Moxham" <ja...@njkfrudils.plus.com>
Date: Sun, 14 Jun 2009 03:46:48 +0100
Local: Sat, Jun 13 2009 10:46 pm
Subject: Re: First attempt at openmp
Your comparing kara and base against kara and base with openmp.
You have ONLY kara and basecase mul , nothing else?


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 10:56 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 03:56:05 +0100
Local: Sat, Jun 13 2009 10:56 pm
Subject: Re: First attempt at openmp
I'm timing mpn_kara_mul_n using speed. It's just kara and base against
kara and base with openmp.

I tried forcing it to not nest any parallelism by having openmp
pragmas only in the outer level (it calls off to an identical copy of
mpn_kara_mul_n without openmp pragmas instead of recursing to the same
function. No change. Still twice as slow.

It's 50% slower with only 2 threads instead of 3.

They are pretty enormous operands. Perhaps it just overwhelms the
cache more quickly with two processors working at the same time.

Bill.

2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Moxham  
View profile  
 More options Jun 13, 11:01 pm
From: "Jason Moxham" <ja...@njkfrudils.plus.com>
Date: Sun, 14 Jun 2009 04:01:07 +0100
Local: Sat, Jun 13 2009 11:01 pm
Subject: Re: First attempt at openmp
Perhaps we are hitting main memory or L3 bandwidth ? , doesn't seem likely
but perhaps it is
basecase runs at 2.5c/l and it takes 2 reads and 1 write per limb


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Moxham  
View profile  
 More options Jun 13, 11:15 pm
From: "Jason Moxham" <ja...@njkfrudils.plus.com>
Date: Sun, 14 Jun 2009 04:15:07 +0100
Local: Sat, Jun 13 2009 11:15 pm
Subject: Re: First attempt at openmp
yeah I think thats it
if we run a 10^6 basecase this will take 3*10^12 mem ops
so adding another thread wont help because we are allready waiting for
memory to respond


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 13, 11:18 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 04:18:02 +0100
Local: Sat, Jun 13 2009 11:18 pm
Subject: Re: First attempt at openmp
The times jump around a lot. At 1000 limbs it varies from 50% slower
to about the same speed.

At 10000 limbs it is 50% slower, twice as slow at 30000 limbs.

At 300 limbs it is only about 30% slower.

2009/6/14 Jason Moxham <ja...@njkfrudils.plus.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc  
View profile  
 More options Jun 14, 1:56 am
From: Marc <marc.gli...@gmail.com>
Date: Sat, 13 Jun 2009 22:56:19 -0700 (PDT)
Local: Sun, Jun 14 2009 1:56 am
Subject: Re: First attempt at openmp
Could you share the patches you are using, both to measure the speed
and to parallelize karatsuba? Last time I tried, just making the last
two calls to mpn_kara_mul_n before the /* Interpolate */ comment in
parallel sections (and giving the last one a fresh buffer instead of ws
+n), the second thread increased the speed for 1000, 10000 or 300000
limbs, it did not make it slower (I used 2 threads, no nesting). For
300000, it is pretty close to a 2x speed-up.

On Jun 13, 8:18 pm, Bill Hart <goodwillh...@googlemail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 10:41 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 15:41:32 +0100
Local: Sun, Jun 14 2009 10:41 am
Subject: Re: First attempt at openmp

Sure. From line 204 of tune/time.c I just hack it to make it think it
does not have access to getrusage, but only gettimeofday:

#if 1 //&& defined( _MSC_VER)
#define HAVE_GETRUSAGE      0
#define HAVE_GETTIMEOFDAY   1
//#include "getrusage.h"
//#include "gettimeofday.h"
#endif

That does seem to give what I expect for the FFT.

I've attached the mul_n.c which does what I think you said you tried.

Bill.

2009/6/14 Marc <marc.gli...@gmail.com>:

  mul_n.c
24K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc  
View profile  
 More options Jun 14, 12:56 pm
From: Marc <marc.gli...@gmail.com>
Date: Sun, 14 Jun 2009 09:56:14 -0700 (PDT)
Local: Sun, Jun 14 2009 12:56 pm
Subject: Re: First attempt at openmp
On Jun 14, 7:41 am, Bill Hart <goodwillh...@googlemail.com> wrote:

> Sure. From line 204 of tune/time.c I just hack it to make it think it
> does not have access to getrusage, but only gettimeofday:

> #if 1 //&& defined( _MSC_VER)
> #define HAVE_GETRUSAGE      0
> #define HAVE_GETTIMEOFDAY   1
> //#include "getrusage.h"
> //#include "gettimeofday.h"
> #endif

> That does seem to give what I expect for the FFT.

> I've attached the mul_n.c which does what I think you said you tried.

I think your version doesn't pass the testsuite with
OMP_NUM_THREADS=2. Of the 3 recursive calls to karamul, the first
reads p while the second and third write to p, so you can't make the
first in parallel to the others unless you use some other place to
store the temporary values. Since you end up with one thread writing
to p and one reading from it, the synchronization may be causing your
slow-down.

What I tried is something like:
http://www.loria.fr/~glisse/tmp/mul_n.c
(I purposely try to avoid sending explicit code because I am not sure
I am allowed to, but this should be trivial enough)


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 12:56 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 17:56:22 +0100
Local: Sun, Jun 14 2009 12:56 pm
Subject: Re: First attempt at openmp

OK, I did what Jason suggested and made the toom3 cutoff really high,
etc, so karatsuba is always used. I also modified the bench program to
time lots of different sizes. Here are the results, which are more
what I expected to see:

1 thread:
      MPIRbench.base.multiply.6400,6400 result: 170945
      MPIRbench.base.multiply.19200,19200 result: 28837
      MPIRbench.base.multiply.64000,64000 result: 4299
      MPIRbench.base.multiply.131072,131072 result: 1391
      MPIRbench.base.multiply.192000,192000 result: 722
      MPIRbench.base.multiply.640000,640000 result: 106
      MPIRbench.base.multiply.2097152,2097152 result: 16.5
      MPIRbench.base.multiply.6400000,6400000 result: 2.65

2 threads:
      MPIRbench.base.multiply.6400,6400 result: 45924
      MPIRbench.base.multiply.19200,19200 result: 21737
      MPIRbench.base.multiply.64000,64000 result: 4755
      MPIRbench.base.multiply.131072,131072 result: 1425
      MPIRbench.base.multiply.192000,192000 result: 890
      MPIRbench.base.multiply.640000,640000 result: 137
      MPIRbench.base.multiply.2097152,2097152 result: 17.8
      MPIRbench.base.multiply.6400000,6400000 result: 3.45

 3 threads:
      MPIRbench.base.multiply.6400,6400 result: 39237
      MPIRbench.base.multiply.19200,19200 result: 21635
      MPIRbench.base.multiply.64000,64000 result: 7055
      MPIRbench.base.multiply.131072,131072 result: 2332
      MPIRbench.base.multiply.192000,192000 result: 1546
      MPIRbench.base.multiply.640000,640000 result: 246
      MPIRbench.base.multiply.2097152,2097152 result: 31.3
      MPIRbench.base.multiply.6400000,6400000 result: 6.58

I think whatever I did to time.c wasn't right. So I'll stick to timing
with make bench.

I've attached my copy of runbench and timing.h which I hacked so it
would time parallel code correctly. These are from the bench
directory.

All I did to modify mul_n.c was put openmp pragmas around the central
3 calls to mpn_kara_mul_n and allocate extra scratch space:

      else
        {
                mp_limb_t tp[MPN_KARA_MUL_N_TSIZE (n2)];
                mp_limb_t up[MPN_KARA_MUL_N_TSIZE (n2)];
#pragma omp parallel sections
                {
#pragma omp section
          mpn_kara_mul_n (ws, p, p + n2, n2, ws + n);
#pragma omp section
          mpn_kara_mul_n (p, a, b, n2, tp);
#pragma omp section
     mpn_kara_mul_n (p + n, a + n2, b + n2, n2, up);
                }
        }

and of course in the appropriate gmp-mparam.h for my architecture I
changed the toom3, toom4 and toom7 cutoffs to very large values.

Naturally this is all very hackish at this stage and not much use as
no one in their right mind is going to use karatsuba for 1000 limbs.
Nothing we've tried scales beyond two or three threads either. But at
least openmp does work.

I think the goal should be to have algorithms which scale well with
the number of threads and have everything tuned so that for n = 2
threads there is at least a 50% speedup and for n > 2 threads, there
is at least an n-1 times speedup.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:

...

read more »

  runbench
5K Download

  timing.h
2K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 1:01 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 18:01:37 +0100
Local: Sun, Jun 14 2009 1:01 pm
Subject: Re: First attempt at openmp
You're absolutely right. I missed that in my original and my new
version. Let me try and get this right.

And no problems with the code. We won't use it unless you explicitly
indicate you are contributing it to the project, say after you checked
it was OK with your institution - in the event that you did work on
something you wanted to contribute.

Bill.

2009/6/14 Marc <marc.gli...@gmail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc  
View profile  
 More options Jun 14, 1:14 pm
From: Marc <marc.gli...@gmail.com>
Date: Sun, 14 Jun 2009 10:14:19 -0700 (PDT)
Local: Sun, Jun 14 2009 1:14 pm
Subject: Re: First attempt at openmp
On Jun 14, 10:01 am, Bill Hart <goodwillh...@googlemail.com> wrote:

> And no problems with the code. We won't use it unless you explicitly
> indicate you are contributing it to the project, say after you checked
> it was OK with your institution - in the event that you did work on
> something you wanted to contribute.

Anything I post on this list, you can use. I am just explaining why I
am not posting much :-/

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 1:54 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 18:54:08 +0100
Local: Sun, Jun 14 2009 1:54 pm
Subject: Re: First attempt at openmp
I attach a version of mul_n.c which does pass tests.

Here are the timings:

1 thread:
      MPIRbench.base.multiply.6400,6400 result: 170945
      MPIRbench.base.multiply.19200,19200 result: 28837
      MPIRbench.base.multiply.64000,64000 result: 4299
      MPIRbench.base.multiply.131072,131072 result: 1391
      MPIRbench.base.multiply.192000,192000 result: 722
      MPIRbench.base.multiply.640000,640000 result: 106
      MPIRbench.base.multiply.2097152,2097152 result: 16.5

2 threads:
      MPIRbench.base.multiply.6400,6400 result: 54658
      MPIRbench.base.multiply.19200,19200 result: 20500
      MPIRbench.base.multiply.64000,64000 result: 4287
      MPIRbench.base.multiply.131072,131072 result: 1467
      MPIRbench.base.multiply.192000,192000 result: 876
      MPIRbench.base.multiply.640000,640000 result: 127
      MPIRbench.base.multiply.2097152,2097152 result: 18.0

3 threads:
      MPIRbench.base.multiply.6400,6400 result: 39920
      MPIRbench.base.multiply.19200,19200 result: 21620
      MPIRbench.base.multiply.64000,64000 result: 6434
      MPIRbench.base.multiply.131072,131072 result: 2310
      MPIRbench.base.multiply.192000,192000 result: 1560
      MPIRbench.base.multiply.640000,640000 result: 228
      MPIRbench.base.multiply.2097152,2097152 result: 32.9

It looks like at around 3000 limbs with 3 threads one gets a better
than 2x speedup.

Of course at 3000 limbs there are about 6 recursions before it hits
basecase using just karatsuba. In the real world we'd be well into the
FFT range by the time that many recursions had occurred with
Toom3/4/7.

Having said that, the usual score at 131072 x 131072 when toom7 is
used is 2074, and at 192000,192000 the score is usually 1153, so we
actually beat those. with parallel karatsuba. At 640000,640000 the
score is usually 246, so we don't beat that.

There is hope, but it is unclear whether we'll ever get a worthwhile
speedup at the karatsuba range using openmp.

Bill.

2009/6/14 Marc <marc.gli...@gmail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 1:54 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 18:54:46 +0100
Local: Sun, Jun 14 2009 1:54 pm
Subject: Re: First attempt at openmp

Oops, the attachment.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:

  mul_n.c
18K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 2:48 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 19:48:42 +0100
Local: Sun, Jun 14 2009 2:48 pm
Subject: Re: First attempt at openmp
I tried parallelising just the toom3 (and setting the toom4 cutoff
extremely high, etc). As there are 5 pointwise mults for toom3, one
can only make two pairs of parallelised pointwise mults and use 2
threads. With a little rearrangement and addition of an extra
temporary space, this works. Here are the times:

      MPIRbench.base.multiply.6400,6400 result: 34044
      MPIRbench.base.multiply.19200,19200 result: 10874
      MPIRbench.base.multiply.64000,64000 result: 4090
      MPIRbench.base.multiply.131072,131072 result: 2140
      MPIRbench.base.multiply.192000,192000 result: 1161
      MPIRbench.base.multiply.640000,640000 result: 234
      MPIRbench.base.multiply.2097152,2097152 result: 43.0
      MPIRbench.base.multiply.6400000,6400000 result: 8.28
      MPIRbench.base.multiply.19200000,19200000 result: 1.65

Again, nothing to write home about. At 131072 and 192000 limbs we get
just slightly better performance than with a tuned single threaded
implementation.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options Jun 14, 3:15 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Sun, 14 Jun 2009 20:15:41 +0100
Local: Sun, Jun 14 2009 3:15 pm
Subject: Re: First attempt at openmp
Herewith timings for just toom4 parallelised with 2 threads:

      MPIRbench.base.multiply.6400,6400 result: 173445
      MPIRbench.base.multiply.19200,19200 result: 31849
      MPIRbench.base.multiply.64000,64000 result: 5632
      MPIRbench.base.multiply.131072,131072 result: 2107
      MPIRbench.base.multiply.192000,192000 result: 1482
      MPIRbench.base.multiply.640000,640000 result: 299
      MPIRbench.base.multiply.2097152,2097152 result: 53.1
      MPIRbench.base.multiply.6400000,6400000 result: 11.8
      MPIRbench.base.multiply.19200000,19200000 result: 2.43

Here we just beat ordinary mpir at 131072 bits (2000 limbs), there is
hope at 192000 bits (3000 limbs) and the bench at 640000 bits (100000
limbs) is usually 246, so we are just ahead there.

Bill.

2009/6/14 Bill Hart <goodwillh...@googlemail.com>:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc  
View profile  
 More options Jun 14, 3:18 pm
From: Marc <marc.gli...@gmail.com>
Date: Sun, 14 Jun 2009 12:18:45 -0700 (PDT)
Local: Sun, Jun 14 2009 3:18 pm
Subject: Re: First attempt at openmp
On Jun 14, 10:54 am, Bill Hart <goodwillh...@googlemail.com> wrote:

> I attach a version of mul_n.c which does pass tests.

> Here are the timings:
[...]
> It looks like at around 3000 limbs with 3 threads one gets a better
> than 2x speedup.

That looks better indeed.

With 2 threads (I don't have any hardware to play with 3 threads,
which is what your implementation is targeting), at 1000 limbs, your
version is just breaking even with the single-threaded one, whereas
the one I posted (a proof of concept targeting 2 threads) already
shows a 20% speedup at least. I wonder how hard it will be to make
something that works well whatever the number of threads you give
it...

> There is hope, but it is unclear whether we'll ever get a worthwhile
> speedup at the karatsuba range using openmp.

Indeed.  But I think Jason suggested this mostly (or at least
partially) as an easy way to find out how much hope there is for a
parallel toom, and it doesn't look so bad. Anyway, the most
interesting part is the FFT, that offers many possibilities for
parallelism, but where you really need to consider memory locality.

I also expect that implementations are making progress on reducing the
overhead, at some point it will be necessary to do some comparisons of
the various implementations available.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 87   Newer >
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google