Dear All,
I am observing lot of variation in execution times of OpenBLAS DGEMM on Intel Haswell, whose spec is shown below:
Architecture: x86_64
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
CPU(s): 4
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
In the experiments, the DGEMM routine multiplies two square matrices of size NxN. The matrix sizes (N) are varied from 16960 to 17088. The attached plot shows the variations in execution times. This variation is observed for other matrix sizes too. No environment variable (OPENBLAS_NUM_THREADS) is set and so it is assumed that OpenBLAS uses 4 threads. Also, each experimental point in the plot is an average of 5 executions.
Please also note that when OpenBLAS was compiled, the following line is commented out in Makefile.rule:
# If you want to disable CPU/Memory affinity on Linux.
#NO_AFFINITY = 1
This is allow OpenBLAS to use CPU affinity. This flag, however, does not seem to make any difference (whether commented or not).
Please let me know if this is a known issue or if there are any workarounds for minimal deviations in execution times.
Regards
Ravi