Do we need to set OMP_NUM_THREADS, OMP_DYNAMIC, etc in our application ?

C B

unread,

Jan 5, 2021, 7:13:59 PM1/5/21

to amgcl

Hello AMGCL users!

I did a test to find out if in my application of AMGCL the Wall time was changing when setting OMP_NUM_THREADS, because I found that a couple of tutorials were setting OMP_NUM_THREADS as an env. var.

In my simple test (average size for me, details below) running on a CPU with 8 cores, I find that most of the gain (Min Wall time) is achieved with #threads =2, with #3 I get the min Wall time, and with 4 is almost

the same Wall time as with 3.

If I leave OMP_NUM_THREADS unset in the env, AMGCL seems to use more threads because the %CPU increases even more than with #threads = 4, but the actual Wall time

is not better, it is actually a little more than with #threads = 4.

(BTW: I checked setting OMP_DYNAMIC=TRUE/FALSE and I could not notice any difference)

I have a couple of questions:

Are there any guidelines to set OMP_NUM_THREADS? or any other OMP settings?

Is AMGCL choosing an optimal value when the user is not forcing a OMP_NUM_THREADS value?

(I tried figuring this out myself looking for omp_set strings but I could not find any)

Please, let me know what is best.

Cheers,

Type: BiCGStab
Unknowns: 5550680
Memory footprint: 296.44 M

Preconditioner
==============
Number of levels: 4
Operator complexity: 1.55
Grid complexity: 1.12
Memory footprint: 1.82 G

level unknowns nonzeros memory
---------------------------------------------
0 5550680 38043526 1.40 G (64.58%)
1 610948 18897200 390.08 M (32.08%)
2 34282 1878536 33.93 M ( 3.19%)
3 2028 87572 9.04 M ( 0.15%)

Denis Demidov

unread,

Jan 6, 2021, 12:57:08 AM1/6/21

to amgcl

AMGCL performance, as with most iterative methods, is memory-bound. That is, it depends more on memory bandwidth than on CPU speed. Since you are getting 2x speedup, I would guess you have 2 memory channels in your system. This is also the case with my CPU (look for "memory channels" on this page).

Controlling CPU affinity and memory allocation become much more important on NUMA systems. There, you could use OMP_PLACES, OMP_SCHEDULE, OMP_WAIT_POLICY environment variables, and numactl tool to improve performance. On a Windows system, OMP_NUM_THREADS may be used to not oversubscribe your machine, in case there are other processes using one or more CPU cores.

C B

unread,

Jan 6, 2021, 12:57:42 PM1/6/21

to Denis Demidov, amgcl

Denis,

Thank you very much for this information. This is very instructive!

To check how memory bound was the AMGL Poisson example code,

I run the same executable using 1,2,3,4 simultaneous instances (just sent them to the background and waited for them to finish).

I checked that the Wall time differences among the "identical" jobs was very small, so all simultaneous jobs run in pretty much the same Wall time.

Processor Intel(R) Core(TM) i7-3940XM CPU @ 3.00GHz, 3190 Mhz, 4 Core(s), 8 Logical Processor(s)

32 GB of DRAM (low RAM use, not paging)

With OMP#1 the Wall time is ~ the same up to 3 simul jobs,

and with OMP#2 the Wall time is ~ up to 2 simul jobs.

Also for OMP#2 the Wall time ~ doubles when going from 1 to 4 simul jobs,

Can we say from this that for this job size the code is 50% memory bound or something along these lines?

And I wonder if it is possible to estimate the Wall time reduction running this code on a workstation with 4,6,8 memory channels?

Can we use linux's prof or any other profiler to figure this out?

Thanks for your advice!

Cheer

--
You received this message because you are subscribed to the Google Groups "amgcl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to amgcl+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/e3969cc7-3e13-4389-9b89-f8f1a89c9b03n%40googlegroups.com.

Denis Demidov

unread,

Jan 7, 2021, 1:26:55 AM1/7/21

to amgcl

On Wednesday, January 6, 2021 at 8:57:42 PM UTC+3 cebau...@gmail.com wrote:

Denis,
Thank you very much for this information. This is very instructive!
To check how memory bound was the AMGL Poisson example code,
I run the same executable using 1,2,3,4 simultaneous instances (just sent them to the background and waited for them to finish).
I checked that the Wall time differences among the "identical" jobs was very small, so all simultaneous jobs run in pretty much the same Wall time.
Processor Intel(R) Core(TM) i7-3940XM CPU @ 3.00GHz, 3190 Mhz, 4 Core(s), 8 Logical Processor(s)
32 GB of DRAM (low RAM use, not paging)

With OMP#1 the Wall time is ~ the same up to 3 simul jobs,
and with OMP#2 the Wall time is ~ up to 2 simul jobs.
Also for OMP#2 the Wall time ~ doubles when going from 1 to 4 simul jobs,
Can we say from this that for this job size the code is 50% memory bound or something along these lines?

I am not sure how this relates to the memory throughput, or what is "50% memory bound".

I think what you are seeing here is the fact that embarassingly parallel code (independent instances of the same program) scales better than code with some serial sections.

And I wonder if it is possible to estimate the Wall time reduction running this code on a workstation with 4,6,8 memory channels?

There is benchmarks page in amgcl docs with some scaling data both for shared memory and distributed memory computations:

https://amgcl.readthedocs.io/en/latest/benchmarks.html.

Can we use linux's prof or any other profiler to figure this out?

A good way to test the scalability of your memory subsystem would be something like the stream benchmark:

https://www.cs.virginia.edu/stream/

C B

unread,

Jan 7, 2021, 6:00:05 PM1/7/21

to Denis Demidov, amgcl

Denis,

Regarding: I am not sure how this relates to the memory throughput, or what is "50% memory bound".

My idea was that if the application is "memory bound" in the sense that Wall time depends on Cache/DRAM memory latency/bandwidth/etc, then when I run multiple instances the Wall time should increase proportionally to the number of simultaneous (independent) jobs running. But I found that with OMP#1, I could run up to 4 simultaneous jobs with little impact on the Wall time. Therefore it looks like for the job type/size that I used, it is not so memory bound.

Then with OMP#2 the Wall time was little changed with up to 2 simultaneous jobs, and with 4 simultaneous jobs the Wall time increased by a factor 66/41=1.6, or 60%, instead of 100%, so I thought perhaps we could say that the Wall time dependence on Memory Bandwidth is only 60%.

Regarding: I think what you are seeing here is the fact that embarrassingly parallel code (independent instances of the same program) scales better than code with some serial sections.

I don't understand your comment. My idea was that if the computer has many cores and we think that a job's Wall time will depend on the memory bandwidth, then we should be able to see the actual dependence by running several identical jobs simultaneously, and if we see that the Wall time is proportional to the # of jobs, then we can say this type of job is 100% memory bound (this assumes that the time spent doing actual CPU work is low compared with the time used getting data in/out of the CPU).

But then again, there may be many other factors that I am not aware of :).

Thanks again for your help.

Cheers,

To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/8282130a-26fc-4fc5-9ec0-2129bb29efdan%40googlegroups.com.

Reply all

Reply to author

Forward