When using MKL+Elemental to compute A'*A for a (100K,10K) matrix on a single node (1 MPI process) with 8 threads, Matrix<double> and the raw BLAS call give the same results (~3 seconds), but the DistMatrix<double> consistently takes thrice as much time (~9 seconds). Is this to be expected?
I am omitting the system details to keep the posting short, but I can post the code and environment details if need be.
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
--
I am calling elem::Syrk(...). I did not keep the posting short because I thought I'd bring down google :). Here is the (simple) code:
(See attached file: Jack.tar.gz)
Here are some sample numbers:
#./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
DistMatrix SYRK took 0.941127 (seconds)
Raw SYRK took 0.302189 (seconds)
Matrix SYRK took 0.302338 (seconds)
Is it safe to assume that Elemental has compiled properly because using Matrix<double> does give the same performance as raw BLAS? Here are the relevant link lines:
INCLIBS=-L${SOFTWARE}/lib \
-lelemental \
-Wl,--start-group \
$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_gnu_thread.a \
$(MKLROOT)/lib/intel64/libmkl_core.a \
-Wl,--end-group \
-lgomp \
-ldl \
-lpthread \
-lm
I used these same libraries MKL libraries when I configured Elemental. I also used OMP_NUM_THREADS=8 for these experiments. Thanks for your help.
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
Jeff Hammond ---05/16/2013 10:04:10 PM---Google has a lot of storage. It's probably safe to post your test code. Jeff
| Jeff Hammond <jham...@alcf.anl.gov> |
| elemen...@googlegroups.com |
| 05/16/2013 10:04 PM |
| Re: [elemental] Performance of DistMatrix<double> vs Matrix<double> vs BLAS-3 call for SYRK |
| elemen...@googlegroups.com |
Hey Jack,
Thanks for your email. I purposely launched a single process, because I wanted to use multi-threaded BLAS on a single node. In general, I want to launch one MPI process per node and use multi-threading within the node. We have a old GigE network on our cluster, so the fewer the MPI processes, the better. I also save some memory by not having to launch an MPI process per-core.
Let me give you some numbers to explain my dilemma; I have a 5 node (8-cores each) cluster:
(0) Running RAW BLAS and Matrix<double> on a single 8-core node:
#mpirun -np 1 -env OMP_NUM_THREADS 8 ./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
Raw SYRK took 0.313457 (seconds)
elem::Matrix SYRK took 0.301522 (seconds)
(1) Running all 5 processes on a single 8-core node:
#mpirun -np 5 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
elem::DistMatrix SYRK took 0.644393 (seconds)
(2) Running 5 processes on 5 separate 8-core nodes (using only one core per node):
#mpirun -np 5 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
elem::DistMatrix SYRK took 1.305439 (seconds)
(3) Running 40 processes on 5 separate 8-core nodes:
#mpirun -np 40 -env OMP_NUM_THREADS 1 ./syrk 10000 1000 10
Running 10 iterations of A'*A with A(10000,1000)
elem::DistMatrix SYRK took 2.427789 (seconds)
You will see that the performance deteriorates significantly from the Raw BLAS or Matrix<double> calls. Is the import of your mail that even when there is a single process, DistMatrix<double> will be slower than Matrix<double>/Raw BLAS because it chops it up into multiple BLAS calls? I see that there is a HYBRID option in Elemental. Will that help me with the mixed-mode parallelism?
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
Jack Poulson ---05/17/2013 11:15:52 AM---Dear Anju, You only launched the MPI job with one process.
| Jack Poulson <jack.p...@gmail.com> |
| "elemen...@googlegroups.com" <elemen...@googlegroups.com> |
| 05/17/2013 11:15 AM |
Thanks Jack. I'll try these out (possibly today) and let you know.
FYI --- We actually have use for both A'A and AA' --- we just chose one or the other depending on whether M>>N or M<<N.
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
Jack Poulson ---05/17/2013 12:32:22 PM---Dear Anju, If you really are interested in hybrid execution (multiple MPI processes
| Jack Poulson <jack.p...@gmail.com> |
| "elemen...@googlegroups.com" <elemen...@googlegroups.com> |
| Haim Avron/Watson/IBM@IBMUS |
| 05/17/2013 12:32 PM |
| Re: [elemental] Performance of DistMatrix<double> vs Matrix<double> vs BLAS-3 call for SYRK |
| elemen...@googlegroups.com |