OpenBLAS and OpenMP in a single application

3,150 views
Skip to first unread message

Sergey Serebryakov

unread,
Apr 13, 2015, 8:18:53 PM4/13/15
to openbla...@googlegroups.com
Hi! There are a number of threads and discussions on the internet about using OpenBLAS and OpenMP together. I still cannot figure out something. Imagine that I compiled OpenBLAS with USE_THREAD=1 and USE_OPENMP=0 on an Intel Xeon E3-1240 CPU (HT enabled, 4 physical cores). My application is essentially a neural network that calls OpenBLAS sequentially. The question is - can I use OpenMP to parallelize fragments of the code that do not call OpenBLAS functions and can I set OMP_NUM_THREADS environmental variable before running  the application and be sure that this (OMP stuff) will not affect OpenBLAS performance and that OpenBLAS library will not affect my OpenMP-enabled code (OPENBLAS_NUM_THREADS is always set)? The reason I am asking is that I am observing some strange behavior I cannot clearly explain. Sometimes, when using particular combinations of  OPENBLAS_NUM_THREADS and OMP_NUM_THREADS values (for instance, 2 and 4 respectively), I observe a significant decrease in performance of OpenMP-parallelized loops (orders of magnitudes). In other cases, for instance, when those values are 2 and 2, everything works as expected. By the way, doing export OMP_PROC_BIND=TRUE before running the application solves the problem (but still the code runs slightly slower than I would expect). It seems I am missing something here. Thank you.

Sergey Serebryakov

unread,
Apr 13, 2015, 9:39:59 PM4/13/15
to openbla...@googlegroups.com
Just commented all OpenMP's pragmas and application started behave predictably, much faster, with increasing performance as number of cores for OpenBLAS increases. Looks like OMP_NUM_THREADS is just ignored what is expected. The question is still however what is the correct way to use OpenMP?

Zhang Xianyi

unread,
Apr 14, 2015, 10:37:51 AM4/14/15
to Sergey Serebryakov, openbla...@googlegroups.com
Hi Sergey,

For OpenBLAS with USE_THREAD=1 and USE_OPENMP=0, it uses pthreads to parallel the function. 
When both OPENBLAS_NUM_THREADS and OMP_NUM_THREADS are set, OpenBLAS will use OPENBLAS_NUM_THREADS. If OPENBLAS_NUM_THREADS is unset, OpenBLAS will try to use OMP_NUM_THREADS.


For  OpenBLAS with USE_THREAD=1 and USE_OPENMP=1, it uses OpenMP to parallel the function. Therefore, OpenBLAS only depends on OMP_NUM_THREADS.



Case 1: Use OpenBLAS outside OpenMP region.
====================
#pragma omp parallel for
for(...)
{
...
}

cblas_sgemm();
====================
I think you can build OpenBLAS with USE_OPENMP=0 or USE_OPENMP=1. 

Case 2: Use OpenBLAS inside OpenMP region.
======================
#pragma omp parallel for
for(...)
{
 cblas_sgemm();
}
======================

I suggest you build OpenBLAS with USE_OPENMP=1 or single thread version (USE_THREAD=0).

Thank you

Xianyi

2015-04-13 20:39 GMT-05:00 Sergey Serebryakov <serebryak...@gmail.com>:
Just commented all OpenMP's pragmas and application started behave predictably, much faster, with increasing performance as number of cores for OpenBLAS increases. Looks like OMP_NUM_THREADS is just ignored what is expected. The question is still however what is the correct way to use OpenMP?

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-user...@googlegroups.com.
To post to this group, send email to openbla...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dai Zander

unread,
Apr 21, 2015, 8:47:18 PM4/21/15
to openbla...@googlegroups.com, serebryak...@gmail.com
Hi Xianyi,

I actually have a similar problem as Sergey's. But in my application, I actually have both cases you mentioned. Under such circumstances, do you have any specific suggestion? 

What's more, you mentioned "For  OpenBLAS with USE_THREAD=1 and USE_OPENMP=1, it uses OpenMP to parallel the function. Therefore, OpenBLAS only depends on OMP_NUM_THREADS." I'm quite confused what it actually means that OpenBLAS uses OpenMP to parallel the function. Can you give a detailed example about this?

Thanks,
Zander

Zhang Xianyi

unread,
Apr 21, 2015, 11:17:38 PM4/21/15
to Dai Zander, openbla...@googlegroups.com, Sergey Serebryakov
Hi Zander,


2015-04-21 19:47 GMT-05:00 Dai Zander <zande...@gmail.com>:
Hi Xianyi,

I actually have a similar problem as Sergey's. But in my application, I actually have both cases you mentioned. Under such circumstances, do you have any specific suggestion? 

For your case, I suggest USE_OPENMP=1.
 

What's more, you mentioned "For  OpenBLAS with USE_THREAD=1 and USE_OPENMP=1, it uses OpenMP to parallel the function. Therefore, OpenBLAS only depends on OMP_NUM_THREADS." I'm quite confused what it actually means that OpenBLAS uses OpenMP to parallel the function. Can you give a detailed example about this?

In OpenBLAS library, there are two parallel implementations including pthread and OpenMP. By default, OpenBLAS uses pthread to parallelize BLAS functions, e.g. cblas_sgemm. 

With USE_OPENMP=1, OpenBLAS will use OpenMP. The OMP_NUM_THREADS is the environment variable to control the number of threads.

Dai Zander

unread,
Apr 21, 2015, 11:26:34 PM4/21/15
to openbla...@googlegroups.com, zande...@gmail.com
Thanks Xianyi. One more probing. Since in my application, I also use the environment variable OMP_NUM_THREADS to control other multi-thread parallelization, does this mean this environment variable is shared by both OpenBLAS and my other code? 

Also, I used the run-time function openblas_set_num_threads(1) to set the OpenBLAS thread number to be 1. But in this case, will this be ignored? 

In my mind, the perfect situation will be, when I use 
--------------------
#pragma omp parallel for
for(...) {
    cblas_sgemv(...)
}
--------------------
OpenBLAS runs in a single-thread mode, and when I use
--------------------
#pragma omp parallel for
for(...) {
    ...
}
cblas_sgemv(...)
--------------------
OpenBLAS runs in multi-thread mode. 

I don't know whether it's possible to achieve that. 

Again, thanks so much for the reply.

Zander

Zhang Xianyi

unread,
Apr 22, 2015, 12:07:09 AM4/22/15
to Dai Zander, openbla...@googlegroups.com
2015-04-21 22:26 GMT-05:00 Dai Zander <zande...@gmail.com>:
Thanks Xianyi. One more probing. Since in my application, I also use the environment variable OMP_NUM_THREADS to control other multi-thread parallelization, does this mean this environment variable is shared by both OpenBLAS and my other code? 

Yes, it is shared.
 

Also, I used the run-time function openblas_set_num_threads(1) to set the OpenBLAS thread number to be 1. But in this case, will this be ignored? 

I think you don't need openblas_set_num_threads(1). Please check the following reason.

For example, with OpenMP parallel model, cblas_sgemv is implemented as

void cblas_sgemv(...)
{
#pragma omp parallel for
for(...) {
}
}


 

In my mind, the perfect situation will be, when I use 
--------------------
#pragma omp parallel for
for(...) {
    cblas_sgemv(...)
}
--------------------
OpenBLAS runs in a single-thread mode, and when I use


In this case, cblas_sgemm is used in omp parallel region, which is equal to

--------------------
#pragma omp parallel for
for(...) {
   //cblas_sgemv OpenMP implementation
    #pragma omp parallel for
    for(...) { 
     }
}
--------------------

By default, the nested OpenMP parallel is disabled. Thus, it will use only one thread for the inner omp parallel region (cblas_sgemv). 

 
--------------------
#pragma omp parallel for
for(...) {
    ...
}
cblas_sgemv(...)
--------------------
OpenBLAS runs in multi-thread mode. 


In this case, cblas_sgemv will use OpenMP multi-threading. The number of threads is equal to OMP_NUM_THREADS.

Dai Zander

unread,
Apr 22, 2015, 12:09:51 AM4/22/15
to openbla...@googlegroups.com, zande...@gmail.com
Xianyi, thanks so much! The answer is super neat and helpful. 

Zihang
Reply all
Reply to author
Forward
0 new messages