Performance of DistMatrix<double> vs Matrix<double> vs BLAS-3 call for SYRK

ANJU KAMBADUR

unread,

May 16, 2013, 9:55:21 PM5/16/13

to <elemental-dev@googlegroups.com>

When using MKL+Elemental to compute A'*A for a (100K,10K) matrix on a single node (1 MPI process) with 8 threads, Matrix<double> and the raw BLAS call give the same results (~3 seconds), but the DistMatrix<double> consistently takes thrice as much time (~9 seconds). Is this to be expected?

I am omitting the system details to keep the posting short, but I can post the code and environment details if need be.

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jeff Hammond

unread,

May 16, 2013, 10:01:41 PM5/16/13

to elemen...@googlegroups.com

Google has a lot of storage. It's probably safe to post your test code.

Jeff

> --
> You received this message because you are subscribed to the Google Groups
> "elemental-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elemental-de...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jham...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides

Jack Poulson

unread,

May 16, 2013, 10:02:04 PM5/16/13

to elemen...@googlegroups.com

What routines are you calling? Herk?

It's also good to make sure that the appropriate version of the underlying BLAS library was linked and that Elemental's local BLAS calls (e.g., MKL dgemm) are not launching several threads and resulting in oversubscription.

Jack

--

ANJU KAMBADUR

unread,

May 17, 2013, 9:36:53 AM5/17/13

to elemen...@googlegroups.com

I am calling elem::Syrk(...). I did not keep the posting short because I thought I'd bring down google :). Here is the (simple) code:

(See attached file: Jack.tar.gz)

Here are some sample numbers:

#./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
DistMatrix SYRK took 0.941127 (seconds)
Raw SYRK took 0.302189 (seconds)
Matrix SYRK took 0.302338 (seconds)

Is it safe to assume that Elemental has compiled properly because using Matrix<double> does give the same performance as raw BLAS? Here are the relevant link lines:

INCLIBS=-L${SOFTWARE}/lib \
-lelemental \
-Wl,--start-group \
$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_gnu_thread.a \
$(MKLROOT)/lib/intel64/libmkl_core.a \
-Wl,--end-group \
-lgomp \
-ldl \
-lpthread \
-lm

I used these same libraries MKL libraries when I configured Elemental. I also used OMP_NUM_THREADS=8 for these experiments. Thanks for your help.

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jeff Hammond ---05/16/2013 10:04:10 PM---Google has a lot of storage. It's probably safe to post your test code. Jeff

From:	Jeff Hammond <jham...@alcf.anl.gov>
To:	elemen...@googlegroups.com
Date:	05/16/2013 10:04 PM
Subject:	Re: [elemental] Performance of DistMatrix<double> vs Matrix<double> vs BLAS-3 call for SYRK
Sent by:	elemen...@googlegroups.com

Jack.tar.gz

Jack Poulson

unread,

May 17, 2013, 11:12:03 AM5/17/13

to elemen...@googlegroups.com

Dear Anju,

You only launched the MPI job with one process.

If you want to achieve similar performance in Elemental to multithreaded BLAS with 8 threads, you should instead run:

export OMP_NUM_THREADS=1

mpirun -np 8 ./Syrk 10000 1000 10

You are seeing a small amount of speedup due to the local BLAS calls being threaded, but, due to the distributed-nature of the computation, there are many smaller local BLAS calls and the performance is worse. At some point in the future Syrk will make use of BLIS in order to alleviate this issue, but for now it is best to use Elemental in the canonical way if you want the best performance.

A well-implemented multithreaded computation over a modest number of cores should be at least slightly faster than a general purpose distributed-memory implementation, but you shouldn't see a factor of three.

Jack

ecblank.gif

graycol.gif

Jack Poulson

unread,

May 17, 2013, 11:18:11 AM5/17/13

to elemen...@googlegroups.com

Also, since your matrix is very tall and skinny, at some point you may want to experiment with the version of Syrk/Herk used within the following Cholesky-based QR factorization:
https://github.com/poulson/Elemental/blob/master/include/elemental/lapack-like/QR/Cholesky.hpp#L44

This case is not (yet) specially handled by Syrk/Herk.

Jack

graycol.gif

ecblank.gif

ANJU KAMBADUR

unread,

May 17, 2013, 12:00:47 PM5/17/13

to elemen...@googlegroups.com, Haim Avron

Hey Jack,

Thanks for your email. I purposely launched a single process, because I wanted to use multi-threaded BLAS on a single node. In general, I want to launch one MPI process per node and use multi-threading within the node. We have a old GigE network on our cluster, so the fewer the MPI processes, the better. I also save some memory by not having to launch an MPI process per-core.

Let me give you some numbers to explain my dilemma; I have a 5 node (8-cores each) cluster:

(0) Running RAW BLAS and Matrix<double> on a single 8-core node:

#mpirun -np 1 -env OMP_NUM_THREADS 8 ./syrk 10000 1000 10

Running 10 iterations of A'*A with A(10000,1000)

Raw SYRK took 0.313457 (seconds)
elem::Matrix SYRK took 0.301522 (seconds)

(1) Running all 5 processes on a single 8-core node:

#mpirun -np 5 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10

Running 10 iterations of A'*A with A(10000,1000)

elem::DistMatrix SYRK took 0.644393 (seconds)

(2) Running 5 processes on 5 separate 8-core nodes (using only one core per node):

#mpirun -np 5 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10

Running 10 iterations of A'*A with A(10000,1000)

elem::DistMatrix SYRK took 1.305439 (seconds)

(3) Running 40 processes on 5 separate 8-core nodes:

#mpirun -np 40 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10

Running 10 iterations of A'*A with A(10000,1000)

elem::DistMatrix SYRK took 2.427789 (seconds)

You will see that the performance deteriorates significantly from the Raw BLAS or Matrix<double> calls. Is the import of your mail that even when there is a single process, DistMatrix<double> will be slower than Matrix<double>/Raw BLAS because it chops it up into multiple BLAS calls? I see that there is a HYBRID option in Elemental. Will that help me with the mixed-mode parallelism?

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jack Poulson ---05/17/2013 11:15:52 AM---Dear Anju, You only launched the MPI job with one process.

From:	Jack Poulson <jack.p...@gmail.com>
To:	"elemen...@googlegroups.com" <elemen...@googlegroups.com>
Date:	05/17/2013 11:15 AM

Jack Poulson

unread,

May 17, 2013, 12:28:24 PM5/17/13

to elemen...@googlegroups.com, Haim Avron

Dear Anju,

If you really are interested in hybrid execution (multiple MPI processes and multiple threads), then it is best to build Elemental in the HybridRelease mode. The only real difference between HybridRelease and PureRelease is that there are OpenMP directives sprinkled throughout the DistMatrix class's packing and unpacking routines surrounding MPI calls (and, of course, a multithreaded BLAS should be linked for Hybrid mode).

The distributed elem::Syrk/elem::Herk routines are written with the (harder) case A A', where A is n x k, and k << n. Obviously the data movement is equivalent to A' A, where A is k x n and k << n.

Your case is the easy one, where k >> n, and I would recommend trying a variant of the three-line solution demonstrated here, from lines 55-57:
https://github.com/poulson/Elemental/blob/master/include/elemental/lapack-like/QR/Cholesky.hpp#L55

If you run this in hybrid mode, you should see solid performance, as the process consists of a local Herk call and then an MPI_Allreduce to form the result on each process. Clearly the AllReduce could be replaced with a ReduceScatter if you want the result distributed.

This is a good motivation to extend elem::Syrk/elem::Herk to switch to this smarter algorithm with the inner dimension is very large. (However, technically, the names SYmmetric Rank-K update and HErmitian Rank-K update don't make sense, as the rank is at most min(k,n), which is a contradiction when k > n).

Jack

ANJU KAMBADUR

unread,

May 17, 2013, 12:36:43 PM5/17/13

to elemen...@googlegroups.com

Thanks Jack. I'll try these out (possibly today) and let you know.

FYI --- We actually have use for both A'A and AA' --- we just chose one or the other depending on whether M>>N or M<<N.

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jack Poulson ---05/17/2013 12:32:22 PM---Dear Anju, If you really are interested in hybrid execution (multiple MPI processes

From:	Jack Poulson <jack.p...@gmail.com>
To:	"elemen...@googlegroups.com" <elemen...@googlegroups.com>

Cc:	Haim Avron/Watson/IBM@IBMUS
Date:	05/17/2013 12:32 PM

Subject:	Re: [elemental] Performance of DistMatrix<double> vs Matrix<double> vs BLAS-3 call for SYRK
Sent by:	elemen...@googlegroups.com

Reply all

Reply to author

Forward