I was trying out Openblas on my new machine to benchmark it and soon found out that the Openblas dgemm implementation is slower than the simple triple for loop implementation.
Any answer would be greatly helpful.
Here is my CPU specification:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 17
Model name: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
Stepping: 0
CPU MHz: 1403.261
CPU max MHz: 2000.0000
CPU min MHz: 1600.0000
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 4096K
Thank you.