Hello,
I am performing high level multithreading using the c++11 feature std::thread on an Odroid U3 and Odroid XU3 which both have an ARM CPU.
I am using OpenBLAS for matrix multiplications and other operations.
I have one main thread which calls a worker thread (using std::thread). Then, some work gets done in the main thread (I call it foreground processing) and in the worker thread (I call it background processing) in parallel before both getting synchronized again (using std::thread::join).
Both (foreground and background processing) call BLAS functions.
I measure three timings: foreground processing, background processing and the timing between "before starting the worker thread" and "after synchronizing main and worker thread" lets call it timer algo.
So timer algo = max(timer foreground processing, timer background processing) + overhead
Now if I use OpenBLAS with multithreading "timer foreground processing" and "timer background processing" are slightly less compared to using OpenBLAS without multithreading.
But the overhead gets really high if I use OpenBLAS with multithreading. This is not the case if I use OpenBLAS without multithreading (overhead very small).
For "OpenBLAS with multithreading", I compile the library simply with default settings.
For "OpenBLAS without multithreading", I changed the following entries in the Makefile.rule file before compiling:
USE_THREAD = 0
USE_OPENMP = 0
NUM_THREADS = 1
COMMON_OPT = -O3
Do you have any hints for me why the overhead is that high using OpenBLAS with multithreading in conjunction with std::thread? Or any explanations?
Thanks a lot in advance,
Johannes