I'm using the OpenBLAS benchmarks for some first order analysis of some embedded ARM development boards. I'd like to gauge multicore scaling, but I'm not seeing much speedup on my workstation that I use for test, a 4 core intel machine, but I'm not seeing much speedup for 1 vs 4 core using the OBLAS_NUM_THREADS environment variable. I suspect that, for an algorithm such as DGEMM, better scaling, so I'm wondering if there's a comparison of OpenBLAS and OpenMP implementation or performance in general.
1 core:
From : 5000 To : 5000 Step=1 : Trans=N
SIZE Flops Time
5000x5000 : 42878.58 MFlops 583.041683 sec
4 cores:
./dgemm.goto 5000 5000 1
From : 5000 To : 5000 Step=1 : Trans=N
SIZE Flops Time
5000x5000 : 51108.89 MFlops 489.151666 sec