Do we need to set OMP_NUM_THREADS, OMP_DYNAMIC, etc in our application ?

88 views
Skip to first unread message

C B

unread,
Jan 5, 2021, 7:13:59 PM1/5/21
to amgcl
Hello AMGCL users!

I did a test to find out if in my application of AMGCL the Wall time was changing when setting OMP_NUM_THREADS, because I found that a couple of tutorials were setting OMP_NUM_THREADS as an env. var.

In my simple test (average size for me, details below) running on a CPU with 8 cores,  I find that most of the gain (Min Wall time) is achieved with #threads =2, with #3 I get the min Wall time, and with 4 is almost 
the same Wall time as with 3. 
If I leave  OMP_NUM_THREADS unset in the env, AMGCL seems to use more threads because the %CPU increases even more than with #threads = 4, but the actual Wall time 
is not better, it is actually a little more than with  #threads = 4.
(BTW: I checked setting OMP_DYNAMIC=TRUE/FALSE and I could not notice any difference)

I have a couple of questions: 
Are there any guidelines to set  OMP_NUM_THREADS? or any other OMP settings?
Is AMGCL choosing an optimal value when the user is not forcing a OMP_NUM_THREADS value?
(I tried figuring this out myself looking for omp_set strings but I could not find any)

Please, let me know what is best.
Cheers,


Type:             BiCGStab
Unknowns:         5550680
Memory footprint: 296.44 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.55
Grid complexity:     1.12
Memory footprint:    1.82 G

level     unknowns       nonzeros      memory
---------------------------------------------
    0      5550680       38043526      1.40 G (64.58%)
    1       610948       18897200    390.08 M (32.08%)
    2        34282        1878536     33.93 M ( 3.19%)
    3         2028          87572      9.04 M ( 0.15%)

Denis Demidov

unread,
Jan 6, 2021, 12:57:08 AM1/6/21
to amgcl
AMGCL performance, as with most iterative methods, is memory-bound. That is, it depends more on memory bandwidth than on CPU speed. Since you are getting 2x speedup, I would guess you have 2 memory channels in your system. This is also the case with my CPU (look for "memory channels" on this page).

Controlling CPU affinity and memory allocation become much more important on NUMA systems. There, you could use OMP_PLACES, OMP_SCHEDULE, OMP_WAIT_POLICY environment variables, and numactl tool to improve performance. On a Windows system, OMP_NUM_THREADS may be used to not oversubscribe your machine, in case there are other processes using one or more CPU cores.

C B

unread,
Jan 6, 2021, 12:57:42 PM1/6/21
to Denis Demidov, amgcl
Denis,
Thank you very much for this information. This is very instructive!
To check how memory bound was the AMGL Poisson example code, 
I run the same executable using 1,2,3,4 simultaneous instances (just sent them to the background and waited for them to finish).
I checked that the Wall time differences among the "identical" jobs was very small, so all simultaneous jobs run in pretty much the same Wall time.
Processor Intel(R) Core(TM) i7-3940XM CPU @ 3.00GHz, 3190 Mhz, 4 Core(s), 8 Logical Processor(s)
32 GB of DRAM (low RAM use, not paging)
image.png

With OMP#1 the Wall time is ~ the same up to 3 simul jobs,
and with OMP#2 the Wall time is ~ up to 2 simul jobs.
Also for  OMP#2 the Wall time ~ doubles when going from 1 to 4  simul jobs, 
Can we say from this that for this job size the code is 50% memory bound or something along these lines?
And I wonder if it is possible to estimate the Wall time reduction running this code on a workstation with 4,6,8 memory channels?
Can we use linux's prof or any other profiler to figure this out?

Thanks for your advice!
Cheer


--
You received this message because you are subscribed to the Google Groups "amgcl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to amgcl+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/e3969cc7-3e13-4389-9b89-f8f1a89c9b03n%40googlegroups.com.

Denis Demidov

unread,
Jan 7, 2021, 1:26:55 AM1/7/21
to amgcl
On Wednesday, January 6, 2021 at 8:57:42 PM UTC+3 cebau...@gmail.com wrote:
Denis,
Thank you very much for this information. This is very instructive!
To check how memory bound was the AMGL Poisson example code, 
I run the same executable using 1,2,3,4 simultaneous instances (just sent them to the background and waited for them to finish).
I checked that the Wall time differences among the "identical" jobs was very small, so all simultaneous jobs run in pretty much the same Wall time.
Processor Intel(R) Core(TM) i7-3940XM CPU @ 3.00GHz, 3190 Mhz, 4 Core(s), 8 Logical Processor(s)
32 GB of DRAM (low RAM use, not paging)
image.png

With OMP#1 the Wall time is ~ the same up to 3 simul jobs,
and with OMP#2 the Wall time is ~ up to 2 simul jobs.
Also for  OMP#2 the Wall time ~ doubles when going from 1 to 4  simul jobs, 
Can we say from this that for this job size the code is 50% memory bound or something along these lines?

I am not sure how this relates to the memory throughput, or what is "50% memory bound".

I think what you are seeing here is the fact that embarassingly parallel code (independent instances of the same program) scales better than code with some serial sections.
 
And I wonder if it is possible to estimate the Wall time reduction running this code on a workstation with 4,6,8 memory channels?

There is benchmarks page in amgcl docs with some scaling data both for shared memory and distributed memory computations:

 
Can we use linux's prof or any other profiler to figure this out?


A good way to test the scalability of your memory subsystem would be something like the stream benchmark:

C B

unread,
Jan 7, 2021, 6:00:05 PM1/7/21
to Denis Demidov, amgcl
Denis,

Regarding: I am not sure how this relates to the memory throughput, or what is "50% memory bound".

My idea was that if the application is "memory bound" in the sense that Wall time depends on Cache/DRAM memory latency/bandwidth/etc, then when I run multiple instances the Wall time should increase proportionally to the number of simultaneous (independent) jobs running. But I found that with OMP#1, I could run up to 4 simultaneous jobs with little impact on the Wall time. Therefore it looks like for the job type/size that I used, it is not so memory bound.
Then with  OMP#2 the Wall time was little changed with up to 2 simultaneous jobs, and with 4 simultaneous jobs the Wall time increased by a factor 66/41=1.6, or 60%, instead of 100%, so I thought perhaps we could say that the Wall time dependence on Memory Bandwidth is only 60%.

Regarding: I think what you are seeing here is the fact that embarrassingly parallel code (independent instances of the same program) scales better than code with some serial sections.

I don't understand your comment. My idea was that if the computer has many cores and we think that a job's Wall time will depend on the memory bandwidth, then we should be able to see the actual dependence by running several identical jobs simultaneously, and if we see that the Wall time is proportional to the # of jobs, then we can say this type of job is 100% memory bound (this assumes that the time spent doing actual CPU work is low compared with the time used getting data in/out of the CPU).
But then again, there may be many other factors that I am not aware of :).

Thanks again for your help.
Cheers,

Reply all
Reply to author
Forward
0 new messages