Openblas dgemm slower than for loops implementation for matrix multiplication.

Prometheus47

unread,

Mar 29, 2020, 6:05:32 PM3/29/20

to OpenBLAS-users

I was trying out Openblas on my new machine to benchmark it and soon found out that the Openblas dgemm implementation is slower than the simple triple for loop implementation.

Any answer would be greatly helpful.

Here is my CPU specification:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 8

On-line CPU(s) list: 0-7

Thread(s) per core: 2

Core(s) per socket: 4

Socket(s): 1

NUMA node(s): 1

Vendor ID: AuthenticAMD

CPU family: 23

Model: 17

Model name: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx

Stepping: 0

CPU MHz: 1403.261

CPU max MHz: 2000.0000

CPU min MHz: 1600.0000

Virtualization: AMD-V

L1d cache: 32K

L1i cache: 64K

L2 cache: 512K

L3 cache: 4096K

Thank you.

Nima Sahraneshin

unread,

Mar 29, 2020, 8:26:57 PM3/29/20

to Prometheus47, OpenBLAS-users

How did you implement this loop? and also please provide more info about the environment.

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-users/e68503eb-c3d0-478c-832d-2c7db7a5b89f%40googlegroups.com.

martin-frbg

unread,

Mar 31, 2020, 11:37:40 AM3/31/20

to OpenBLAS-users

Cannot really answer without knowing OpenBLAS version and typical problem size. (If the matrix is very small, a triple loop or the unaccelerated reference implementation from netlib is almost certain to be faster as it does not carry all the overhead of multithreading setup and memory allocation)

Reply all

Reply to author

Forward