Parallel Benchmarks

Steve Linton

unread,

Dec 11, 2017, 7:29:50 AM12/11/17

to ffpack-devel

Hi,

I've installed fflas-ffpack in order to benchmark it as a reference point for some new software. The benchmarks directory contains very useful programs, but I can't find a benchmark for multi-threaded multiply (or echelon form).

As far as I can see fflas suppresses any multi-threading in openBLAS, and the -t option only seems to relate to some aspect of the setup. What do I have to do ask fflas to do some operation using N threads and measure wall-clock time taken?

Also, I can't seem to build any usable documentation. Am I doing something stupid, or is that still work-in-progress.

Thanks

Steve

JGD

unread,

Dec 11, 2017, 7:43:42 AM12/11/17

to ffpack-devel

Dear Steve,

multi-threaded benchmarks are often included in the generic file and usable with the '-p' option,

so for instance you can have parallel echelon with:

./benchmark-pluq -p Y

or parallel multiply with:

./benchmark-fgemm -p 1

Now for doc,

./benchmark-fgemm --help

should provide some basic usage, then

make docs

in the main directory, will produce a (limited) doxygen based doc within 

doc/fflas-ffpack-dev-html/

and

doc/fflas-ffpack-html/


Otherwise, let us know !

Yours,

Steve Linton

unread,

Dec 11, 2017, 8:46:31 AM12/11/17

to ffpack-devel

Thanks.

The benchmark help seems to suggest that -p selects some kind of paralellisation strategy. A couple more questions: 1. How is the number of threads controlled? 2. Is the time printed by the benchmark wall clock time or CPU time?

Steve

Clement Pernet

unread,

Dec 11, 2017, 9:50:25 AM12/11/17

to ffpack...@googlegroups.com, Steve Linton

Dear Steve,

Le 11/12/2017 à 14:46, Steve Linton a écrit :
> The benchmark help seems to suggest that -p selects some kind of paralellisation strategy. A couple
> more questions: 1. How is the number of threads controlled?

The parallelization is by default based on OpenMP. Hence the number of threads used is defined by
OpenMP. You can set it with the environment variable OMP_NUM_THREADS.

Then, the various parallel algorithms use another parameter (called number of virtual threads in
benchmark-fgemm documentation) which you set with the option -t.
It is used to define the drive the block partitionning. by default it is set to omp_numthreads, but
you can arbitrarily force it to a larger value.

For instance, a run on a 32 core machine, with -t 128, will use 32 OMP threads, but generate a
splitting into 128 tasks, hence sometimes allowing a more efficient workstealing.

libgomp is actually not so good at managing many more tasks than threads, but when using XKaapi's
implementation of OpenMP through libkomp, we saw some improvements in doing so.
This is discussed in our 2016 PARCO paper:

http://dx.doi.org/10.1016/j.parco.2015.10.003
and in Ziad Sultan's phd thesis:

http://moais.imag.fr/membres/ziad.sultan/dokuwiki/lib/tpl/PHD/these.pdf

2. Is the time printed by the benchmark
> wall clock time or CPU time?

walltime of course. benchmark-fgemm.C line 212 calls chrono.realtime() which is a walltime measure
(as opposed to chrono.usertime() for CPU time).

Let us know if you have any further question.
And apologies for the rough documentation.

Best

Clément

> On Monday, 11 December 2017 12:29:50 UTC, Steve Linton wrote:
>
> Hi,
>
> I've installed fflas-ffpack in order to benchmark it as a reference point for some new software.
> The benchmarks directory contains very useful programs, but I can't find a benchmark for
> multi-threaded multiply (or echelon form).
> As far as I can see fflas suppresses any multi-threading in openBLAS, and the -t option only
> seems to relate to some aspect of the setup. What do I have to do ask fflas to do some operation
> using N threads and measure wall-clock time taken?
>
> Also, I can't seem to build any usable documentation. Am I doing something stupid, or is that
> still work-in-progress.
>
> Thanks
>
> Steve
>

> --
> You received this message because you are subscribed to the Google Groups "ffpack-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> ffpack-devel...@googlegroups.com <mailto:ffpack-devel...@googlegroups.com>.
> To post to this group, send email to ffpack...@googlegroups.com
> <mailto:ffpack...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/ffpack-devel.
> For more options, visit https://groups.google.com/d/optout.

Stephen Linton

unread,

Dec 12, 2017, 4:02:55 AM12/12/17

to ffpack...@googlegroups.com, clement...@gmail.com

Thanks Clement,

So, I ran the command below on out 64 core piledriver machine:

babbage$ ./benchmark-fgemm -p 1 -n 10000 -k 10000 -m 10000 -q 17

Time: 74.8889 Gfops: 26.7062 -q 17 -m 10000 -k 10000 -n 10000 -w -1 -i 3 -p 1 -t 64 -b 64

while it was running I saw CPU usage for the process (from top) at 100% almost all the time, with only occasional
flashes of more.

I tried explicitly exporting OMP_NUM_THREADS=64, but it made essentially no difference

Setting OMP_NUM_THREADS=1 speeds up the calculation (it runs in 51.5s) as does -p 0.

Is this what you would expect, or am I doing something stupid?

Steve

(FYI the latest version of meataxe64 does that benchmark in 58.8s on 1 core or 2.7s on all 64).

> You received this message because you are subscribed to a topic in the Google Groups "ffpack-devel" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/ffpack-devel/viVxRT4VBcQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to ffpack-devel...@googlegroups.com.
> To post to this group, send email to ffpack...@googlegroups.com.

Clement Pernet

unread,

Dec 12, 2017, 5:08:12 AM12/12/17

to Stephen Linton, ffpack...@googlegroups.com

This is very surprising to me.

A few remarks first:

1/ Which compiler are you using? A gcc >= 4.9 will provide a much more efficient implementation of
OpenMP than older versions.

2/ you choose a very small finite field (GF(17). For this one, it is recommended to use
ModularBalanced<float> field implementation, rather than ModularBalanced<double>. You can
comment/uncomment lines 86 and 87 in benchmark-fgemm.C.

By default, the ModularBalanced<double> converts your double precision matrix into a single
precision one when the field is too small as it will then use faster SIMD vectorization.
This conversion may be costly on your machine (a lot of cache misses)
However, on my 32 cores Intel Sandybridge, it does not harm at all, but instead increases the speed
w.r.t. a larger field.

Using ModularBalanced<double>:

pernet@hpac:~/soft/fflas-ffpack/benchmarks$ ./benchmark-fgemm -n 10000 -k 10000 -m 10000
Time: 88.1562 Gfops: 22.687 -q 131071 -m 10000 -k 10000 -n 10000 -w -1 -i 3 -p 0 -t 32 -b 32
pernet@hpac:~/soft/fflas-ffpack/benchmarks$ ./benchmark-fgemm -n 10000 -k 10000 -m 10000 -p 1
Time: 5.12793 Gfops: 390.021 -q 131071 -m 10000 -k 10000 -n 10000 -w -1 -i 3 -p 1 -t 32 -b 32
pernet@hpac:~/soft/fflas-ffpack/benchmarks$ ./benchmark-fgemm -n 10000 -k 10000 -m 10000 -p 1 -q 17
Time: 3.9822 Gfops: 502.235 -q 17 -m 10000 -k 10000 -n 10000 -w -1 -i 3 -p 1 -t 32 -b 32

Using ModularBalanced<float>:
pernet@hpac:~/soft/fflas-ffpack/benchmarks$ ./benchmark-fgemm -n 10000 -k 10000 -m 10000 -p 1 -q 17
Time: 2.90629 Gfops: 688.163 -q 17 -m 10000 -k 10000 -n 10000 -w -1 -i 3 -p 1 -t 32 -b 32

So it takes 1.08s out of 3.98222 to convert input and output back and forth from double to single
precision.

I do not have access to a pildriver multicore server, so it is hard for me to investigate the cause
of this bad performance. I'll try to search for fom AMD server and make experiments. I'll keep you
updated.

Best
Clément

Reply all

Reply to author

Forward