OpenBLAS multithreading implementation vs OpenMP

113 views
Skip to first unread message

Trokon Johnson

unread,
Sep 15, 2016, 3:34:16 PM9/15/16
to OpenBLAS-users
I'm using the OpenBLAS benchmarks for some first order analysis of some embedded ARM development boards. I'd like to gauge multicore scaling, but I'm not seeing much speedup on my workstation that I use for test, a 4 core intel machine, but I'm not seeing much speedup for 1 vs 4 core using the OBLAS_NUM_THREADS environment variable. I suspect that, for an algorithm such as DGEMM, better scaling, so I'm wondering if there's a comparison of OpenBLAS and OpenMP implementation or performance in general.

1 core:
From : 5000  To : 5000 Step=1 : Trans=N
  SIZE          Flops          Time
  5000x5000 :    42878.58 MFlops 583.041683 sec

4 cores:
./dgemm.goto 5000 5000 1
From : 5000  To : 5000 Step=1 : Trans=N
   SIZE          Flops          Time
   5000x5000 :    51108.89 MFlops 489.151666 sec

Zhang Xianyi

unread,
Sep 15, 2016, 3:38:39 PM9/15/16
to Trokon Johnson, OpenBLAS-users
Which CPU do you use?

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-users+unsubscribe@googlegroups.com.
To post to this group, send email to openblas-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Trokon Johnson

unread,
Sep 16, 2016, 9:05:40 AM9/16/16
to OpenBLAS-users, tkjoh...@gmail.com
Intel(R) Core(TM) i3-4160. Although, I do plan to test these on several ARM-A53 embedded boards once I ensure that parallelization works well.

On Thursday, September 15, 2016 at 3:38:39 PM UTC-4, Zhang Xianyi wrote:
Which CPU do you use?
2016-09-15 15:34 GMT-04:00 Trokon Johnson <tkjoh...@gmail.com>:
I'm using the OpenBLAS benchmarks for some first order analysis of some embedded ARM development boards. I'd like to gauge multicore scaling, but I'm not seeing much speedup on my workstation that I use for test, a 4 core intel machine, but I'm not seeing much speedup for 1 vs 4 core using the OBLAS_NUM_THREADS environment variable. I suspect that, for an algorithm such as DGEMM, better scaling, so I'm wondering if there's a comparison of OpenBLAS and OpenMP implementation or performance in general.

1 core:
From : 5000  To : 5000 Step=1 : Trans=N
  SIZE          Flops          Time
  5000x5000 :    42878.58 MFlops 583.041683 sec

4 cores:
./dgemm.goto 5000 5000 1
From : 5000  To : 5000 Step=1 : Trans=N
   SIZE          Flops          Time
   5000x5000 :    51108.89 MFlops 489.151666 sec

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-user...@googlegroups.com.
To post to this group, send email to openbla...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages