#pragma parallel sections
and
#pragma section
statements in around pairs of independent pointwise multiplications in toom4.
It slows it down nicely. I've got it running about 50% slower now.
This is for multiplying about 450 limbs with toom4. Each of the
pointwise mults is then going to be just over 100 limbs.
Hmm, oh well. It's amazing how pathetic the documentation of openmp is
on the web. It can't be this bad, or no one would use it!
I'll try toom7 tomorrow maybe, see if that is any better. The odd
thing is, 2 cpus start up, but they run at like 45% each.
Bill.
Very hard with these algorithms I think.
>
>> There is hope, but it is unclear whether we'll ever get a worthwhile
>> speedup at the karatsuba range using openmp.
>
> Indeed. But I think Jason suggested this mostly (or at least
> partially) as an easy way to find out how much hope there is for a
> parallel toom, and it doesn't look so bad.
Sure.
> Anyway, the most
> interesting part is the FFT, that offers many possibilities for
> parallelism, but where you really need to consider memory locality.
Yes.
>
> I also expect that implementations are making progress on reducing the
> overhead, at some point it will be necessary to do some comparisons of
> the various implementations available.
Yes, though I think the main focus of reducing overhead is using less
memory. However this is precisely the thing which prevents the toom
algorithms being parallelised. So I am not too hopeful that this will
be useful.
I didn't actually read the GMP code in detail for Toom 4 so I don't
know precisely what they have done, and I see they are still working
on it since. But we already have a pretty competitive benchmark in
that range, so I'm not sure there's too much in the way of overhead to
shave off.
Had we not implemented toom7 then parallelising toom4 would probably
look pretty good. But toom7 eventually beats toom4 by quite a bit, so
parallelising toom4 looks much worse.
There is also the question of whether we should have a toom13 or
something like that. This might make parallelising the toom7 look bad.
Essentially I feel we are best parallelising the FFT for now and
working our way down. There's certainly plenty of opportunity to
parallelise the FFT. The only problem there is we don't intend to use
the current FFT forever, as a much better one exists, and the other
one (in FLINT) can also be parallelised for many threads. But I
suppose if for the time being we can demonstrate a meaningful speedup
in the current FFT without too much work, it will be worth doing.
Bill.
I hope you guys will try some of these benchmarks/experiments on the
sage.math hardware. I have some Sun X4450's with 24-cores (4
six-core Dunnington Xeon 2.6Ghz chips) and a Sun T2 with I guess 16
cores. Marc, if you want to do some serious experimentation on such
hardware let me know and I can get you an account.
-- William
Yes, that is a very important consideration. We do currently have
access to the Windows and Intel compilers.
>
>> Essentially I feel we are best parallelising the FFT for now and
>> working our way down. There's certainly plenty of opportunity to
>> parallelise the FFT. The only problem there is we don't intend to use
>> the current FFT forever, as a much better one exists, and the other
>> one (in FLINT) can also be parallelised for many threads.
>
> Any idea on when you plan to integrate it in mpir?
It will take a year probably, as the lower level stuff needs to be
done entirely in assembly (it's currently done in C), and each of the
higher levels will need to be rewritten in a manner that is consistent
with the way the rest of MPIR is written. I also expect that we can
improve it considerably.
Bill.
For toom7 I got 318 vs the usual 246, so not even 50% faster.
Are you using gcc? I'm using gcc 4.3.1. Also, did you simply
parallelise the pointwise mults in pairs or something more
sophisticated?
Bill.
Ah, you are absolutely correct. That 246 should be with the FFT. Let
me just check that....
Bill.
Let me try again and confirm that I didn't do anything wrong. Your
method of moving everything to the end might be better.
Bill.
2009/6/15 Bill Hart <goodwi...@googlemail.com>:
So your idea of putting all the multiplications at the end seems to be
a good one. It's also obviously much easier to adjust for more threads
then.
Have you tried 3 threads with your approach?
PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-unknown-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
on cicero a x86 pent4
PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3 configure=--enable-fat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc-4.3.3 CXX=g++-4.3.3
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=cc CXX=c++ configure=
PASSED CC=cc CXX=c++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=cc CXX=c++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=cc CXX=c++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=cc CXX=c++ configure=--enable-fat
PASSED CC=cc CXX=c++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=cc CXX=c++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
on box3 a k7
box3
PASSED CC=gcc CXX=g++ configure=
PASSED CC=gcc CXX=g++ configure=--enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
PASSED CC=gcc CXX=g++
configure=--enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
--build=none-pc-linux-gnu
PASSED CC=gcc CXX=g++ configure=--enable-fat
PASSED CC=gcc CXX=g++ configure=--enable-fat --enable-cxx --enable-gmpcompat
PASSED CC=gcc CXX=g++
configure=--enable-fat --enable-cxx --enable-gmpcompat --enable-assert --enable-alloca=debug
but now I get this on cygwin
/bin/sh ../libtool --tag=CC --mode=compile
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I
/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test_stuff/../../trunk/mpn -I..
-D__GMP_WITHIN_GMP -I/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test_stuff/
../../trunk -DOPERATION_`echo divis | sed
's/_$//'` -m32 -O2 -fomit-frame-poi
nter -mtune=nocona -march=nocona -c -o divis.lo divis.c
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I/cygdrive/c/Users/jasonadmin/mpir/mpir/bra
nches/test_stuff/../../trunk/mpn -I.. -D__GMP_WITHIN_GMP -I/cygdrive/c/Users/jas
onadmin/mpir/mpir/branches/test_stuff/../../trunk -DOPERATION_divis -m32 -O2
-fo
mit-frame-pointer -mtune=nocona -march=nocona -c divis.c -o divis.o
divis.c: In function `__gmpn_divisible_p':
divis.c:122: error: `SIZEOF_MP_LIMB_T' undeclared (first use in this
function)
divis.c:122: error: (Each undeclared identifier is reported only once
divis.c:122: error: for each function it appears in.)
make[2]: *** [divis.lo] Error 1
make[2]: Leaving directory
`/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test
_stuff/box1-win64/mpn'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/cygdrive/c/Users/jasonadmin/mpir/mpir/branches/test
_stuff/box1-win64'
make: *** [all] Error 2
box1-win64
FAILED CC=gcc CXX=g++ configure=
I analyzed a bit the running time of toom7 for 10^4 limbs. Looking
only at the topmost call to toom7, the evaluations account for 2% of
the running time, the interpolation for 6%, and the multiplications
for 92% (I actually measured 2, 7 and 93 but rounded it so it looked
like the total wasn't more than 100 ;-). Running the 13
multiplications in 2 threads is kind of like having only 7
multiplications, and running them in 3 threads like having only 5. The
running time would then be .08+.92*(7 or 5)/13 times the regular one,
ie a speedup of 73% with 2 threads and 130% for 3 threads, and indeed
with 2 threads I get almost 70%. I guess I should try and experiment
on some other machines to understand why it doesn't work as well on
yours.
On Jun 17, 7:30 pm, Marc <marc.gli...@gmail.com> wrote:
> On Jun 17, 3:22 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>
> > I ran your code Marc and it performs quite well. Here are the benches
> > up to 3 threads (someone is currently using the fourth core on my
> > machine at present):
>
> > 1 thread: 218-228
> > 2 threads: 308-337
> > 3 threads: 385-419
>
> Strange, at 10^4 limbs, on this core 2 duo, with 1 thread the
> multiplication takes 9ms and with 2 threads it takes about 5.4ms,
> something like a 65% speedup, whereas in your measures it only gives a
> 45% speedup. Too bad the benefits are not as good on different
> machines.
Did you try it in the toom7 range? For 10^3 limbs, I have .
> 37ms for 1 thread and .28ms for 2 threads, a 30% speedup, but you
> might have almost no speedup on your machine...
>
> > It's actually quite close to being efficient according to the measure
> > I was hoping to achieve, though we are well outside the toom7 range.
>
> > I also timed, at 10000 limbs, my karatsuba on top of toom7 code, which
> > is really only useful at 3 threads:
>
> > 1 thread: 179
> > 2 threads: 249-250
> > 3 threads: 451-455
>
> Makes sense that karatsuba would work well for 3 threads, there is
> almost nothing to do outside of the 3-way parallel region, whereas for
> toom it is a bit more work to include part of the evaluations and the
> interpolation in the parallel region, although it should be doable
> (for the evaluations there is an obvious way to do it, but it is
> likely not the most efficient unless you are using 13 threads :-)
2009/6/18 Bill Hart <goodwi...@googlemail.com>:
too see -> to see
Here are the timings with my code:
>
> 1 thread:
Bill.
Up to some scaling, the running time at 10^4 limbs is 4.55 for regular
On Jun 18, 6:00 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Another (perhaps more relevant) effect is that AMD machines are much more
> critically dependent on good cache locality. By splitting the data up in the
> way you have, we might be causing issues for the K8.
>
> I've attached my solution. It would be interesting to compare it with your
> code on your core2 machine too see if my approach still works as well on
> that machine.
toom7, 5.55 for your code single-threaded and 3.7 for your code with 2
threads. 3.7/5.55 is close to 2/3, the theoretical value, so there is
no relevant effect from caching on my machine.
For comparison, my code
is at 2.7 for 2 threads and I project 1.95 for 3 threads, whereas I
project your code as 1.85 for 3 threads (a winner), although it uses
22% more cpu cycles, so the best choice might depend on how much other
stuff you want to run at the same time as this multiplication.
When I get access to an AMD machine, I'll try to reorganize things in
a more cache friendly way and see what happens.
I hope we don't end up needing different strategies on amd and
intel...
So you only get a 10% speedup from a second thread at 10^3 limbs
> Yes, I timed it over the Toom7 range (currently 590 limbs -> infinity).
(compared to 30% on core2), not worth it.
Yes, I am not so worried about the overhead of evaluation
> The main issue is data dependence, and we've seen that slows things down. It may actually be faster to make copies of all the data before proceeding with the evaluations. But as you've noted, it is the interpolations which account for most of the rest of the time in toom7 and that will be harder to parallelise effectively.
+interpolation when executing 2 things in parallel is so far from
being twice faster. I am not sure exactly what data dependence you are
talking about. The 13 multiplications write to different places, and
they even take mostly different arguments, although the writing places
are adjacent so they might be on the same page.
Bill.
2009/6/18 Bill Hart <goodwi...@googlemail.com>:
Something really odd happens at 4 threads. I've checked the timings
and something really does go wrong, it's not just a temporary timing
anomaly.
2 threads:
MPIRbench.base.multiply.32000,32000 result: 10673
MPIRbench.base.multiply.64000,64000 result: 4721
MPIRbench.base.multiply.131072,131072 result: 1875
MPIRbench.base.multiply.192000,192000 result: 1073
MPIRbench.base.multiply.640000,640000 result: 211
MPIRbench.base.multiply.2097152,2097152 result: 45.9
MPIRbench.base.multiply.6400000,6400000 result: 10.7
3 threads:
MPIRbench.base.multiply.32000,32000 result: 12942
MPIRbench.base.multiply.64000,64000 result: 5861
MPIRbench.base.multiply.131072,131072 result: 2310
MPIRbench.base.multiply.192000,192000 result: 1322
MPIRbench.base.multiply.640000,640000 result: 278
MPIRbench.base.multiply.2097152,2097152 result: 58.5
MPIRbench.base.multiply.6400000,6400000 result: 15.4
4 threads:
MPIRbench.base.multiply.32000,32000 result: 13914
MPIRbench.base.multiply.64000,64000 result: 553
MPIRbench.base.multiply.131072,131072 result: 2623
MPIRbench.base.multiply.192000,192000 result: 464
MPIRbench.base.multiply.640000,640000 result: 327
MPIRbench.base.multiply.2097152,2097152 result: 70.4
MPIRbench.base.multiply.6400000,6400000 result: 16.8
Bill.
2009/6/20 Bill Hart <goodwi...@googlemail.com>:
I would guess the timings depend a lot on which cores the threads are
scheduled, i.e. what level of cache they share. You can use taskset(1)
to force your threads to use particular cores.
AFAIK there are 4 packages with 6 cores each:
phys 0 = core 0 4 5 6 7 8
phys 1 = core 1 9 10 11 12 13
phys 2 = core 2 14 15 16 17 18
phys 3 = core 19 20 21 22 23
I'm not sure how the 6 cores are organized inside each package in
terms of L2, but I think the whole package shares a 16Mb L3 cache, so
forcing your program to run in a single physical cpu (6 threads max)
may give you more consistent results.
Gonzalo
Yes, that's just a couple subtractions, it takes almost no time. You
On Jun 22, 8:42 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> I had a go at parallelising the first part of karatsuba, as there are
> two completely independent evaluations that get done there. But this
> slowed the whole thing down by at least a factor of 2.
could include them in the section of the first multiplication though.
I tried with your code, and 50% is the best I can get with 3 treads at
> So it seems that for 3 threads at least, 200 limbs is the cutoff, and
> even then it is touch and go.
>
> It's a shame that there is such a huge gap between 8192 and 192000 in
> the benchmark, as 8192 bits is too small to show a really significant
> improvement with threads. You get 50% with three threads, but that is
> really a waste of CPU cycles.
200 limbs (not 128), and only once every 10 runs...
By the way I changed your function to make it call mpn_mul_n so I
don't need 4 versions of it.
Yes, we need to figure something out. Maybe it's not possible.
> I will also try a different compiler. At small numbers of limbs, gcc44
> is faster than gcc42 by orders of magnitude.
I should also try gcc 4.4 on sage.math. I presume you have built it
there somewhere. When I tried to build gcc 4.3 it complained about
everything including the libc.
>
>> > By the way I changed your function to make it call mpn_mul_n so I
>> > don't need 4 versions of it.
>>
>> Yep that's the obvious thing to do. It doesn't work as well for toom3 or
>> karatsuba though I don't think because of the extra memory allocations that
>> need to be done.
>
> I don't think I understand this sentence. Anyway, in the toom3/
> karatsuba range, if you use TMP_ALLOC_LIMBS, it gets memory on the
> stack, so it should be quite fast. Although I did notice how your
> modification of karatsuba (use a separate buffer) makes it slower for
> 1 thread in a surprising proportion.
If you let mpn_mul_n do the memory allocation, as the karatsuba calls
it three times, three lots of allocation will be done. Yes it should
be fast, but it can be made slightly faster.
Bill.