Dear SLATE developers,
My name is Vinh Dang, from Sandia Labs. I just wanted to get familiar with SLATE so I built SLATE and ran the tester for getrf with type d, target devices on Summit system and compared the performance results with the results on SummitDev system presented
in Figure 5 in the paper “Linear Systems Solvers for Distributed-Memory Machines
with GPU Accelerators,” Euro-Par 2019: Parallel Processing, August 2019.
I have noticed that performance on Summit is far lower than the performance on SummitDev.
For examples: on SummitDev with 16 nodes x 4 devices = 64 devices, performance for 300,000x300,000 matrix = ~25,000GFLOPS
However, on Summit with 16 nodes x 4 devices = 64 devices, I can only get ~15,000GFLOPS for the same matrix size.
Do you have any ideas?
Here are the modules I loaded:
Currently Loaded Modules:
1) mercurial/4.4.1
2) gcc/6.4.0
3) spectrum-mpi/10.3.1.2-20200121
4) essl/6.1.0-2
5) cuda/10.1.243
6) netlib-lapack/3.8.0
7) netlib-scalapack/2.0.2
And, the script I used:
#!/bin/bash
#BSUB -P CSC391
#BSUB -W 120
#BSUB -nnodes 16
#BSUB -o job9_640.out
#BSUB -e job9_640.err
#BSUB -J myJobName9_640
#BSUB -alloc_flags "smt1"
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=10
##64ranks - 64GPUs - 16nodes
jsrun --smpiargs="-gpu" -n 64 -a 1 -c 10 -g 1 -r 4 -d packed -b packed:10 ./tester getrf --matrix 0 --type d --target d --dim 300000 --nrhs 1 --nb 640 --panel-threads 1 --check y --ref n --repeat 2
Thanks,
Vinh Dang
Hi Asim,
Thank you for your response. I am looking forward to it since I am trying to compare our LU solver with SLATE.
Best,
Vinh