I have a question on FFPACK::pPLUQ.
it is said "Make sure to use a single threaded version of the BLAS library".
For OpenBLAS, this means that I have to set "USE_THREAD = 0" (and "USE_OPENMP = 0") in Makefile.rule.
Intuitively (for me), this sounds contradicts the configure options
"--enable-openmp --with-openblas-num-threads=N" for fflas-ffpack-2.4.0.
I once used FFPACK::pPLUQ in fflas-ffpack-2.1.0 (now the interface seems slightly changed) with GotoBLAS2-1.13
and got a very satisfactory performance for matrix rank calculations.
What is a good way to calculate ranks of dense matrices over a finite field in parallel using fflas-ffpack?
Thank you in advance.
I tries benchmark/benchmark-pluq.C
with a single threaded OpenBLAS-0.3.6 and fflas-ffpack-2.4.0 with the option --enable-openmp --with-openblas-num-threads=12 by:
g++910 benchmark-pluq.C -I/home/tshun/gcc910/ff4_sing/include -fabi-version=6 -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mfma -I/home/tshun/gcc910/giv4/include -I/home/tshun/gcc910/gmp/include -fopenmp -I/home/tshun/gcc910/giv4/include -I/home/tshun/gcc910/gmp/include -fopenmp -L/home/tshun/gcc910/openblas4_sing/lib -lopenblas -L/home/tshun/gcc910/giv4/lib -lgivaro -L/home/tshun/gcc910/gmp/lib -lgmpxx -lgmp -O3
It seems calling FFPACK::pPLUQ is a slightly faster way,
but when I monitored the system by "top" command, ./a.out (with "-p true") seems not working in parallel (judging from %CPU).
[tshun@p-01g benchmarks]$ date && ./a.out -m 10000 -n 10000 -p true && date
Wed Jun 5 10:15:02 JST 2019
Time: 4.28083 Gfops: 75.9978 -s N -q 131071 -m 10000 -n 10000 -r 2000 -g Y -i 3 -v 0 -t 12 -b 12 -p Y
Wed Jun 5 10:16:02 JST 2019
[tshun@p-01g benchmarks]$ date && ./a.out -m 10000 -n 10000 && date
Wed Jun 5 10:16:07 JST 2019
Time: 9.81899 Gfops: 33.1331 -s N -q 131071 -m 10000 -n 10000 -r 2000 -g Y -i 3 -v 0 -t 1 -b 1 -p N
Wed Jun 5 10:17:29 JST 2019