Mixed-mode MPI/OpenMP, extra threads

479 views
Skip to first unread message

craigwa...@gmail.com

unread,
Sep 15, 2015, 3:47:52 AM9/15/15
to mpi4py
Hi,

I am using mpi4py in a mixed MPI/OpenMP setting. The MPI bit is a simple task farm with a master and workers. Each workers gets a job that has been parallelised using OpenMP. I'm trying to specify a single thread for the master (as it only farms out tasks) and multiple threads for the workers. I set threads for the master using:

if rank == 0
    os
.environ['OMP_NUM_THREADS'] = '1'

I call the mpirun like:

mpirun -np 2 -x OMP_NUM_THREADS=2 python3 -m gprMax

I would expect 1 master with 1 thread and 1 worker with 2 threads. However, I appear to be getting (reported in Activity Monitor) a master with 2 threads and a worker with 3 threads! top reports 2/1 and 3/2 threads. Reported When I print os.environ.get('OMP_NUM_THREADS') from the master and workers they report correctly.

Am I misunderstanding the reporting?

I'm testing this on an 8-core Mac Pro with Mac OS X 10.10.5, OpenMPI 1.10.0, and mpi4py 1.3.1.

Cheers,

Craig

Lisandro Dalcin

unread,
Sep 15, 2015, 4:04:15 AM9/15/15
to mpi4py
On 15 September 2015 at 01:02, <craigwa...@gmail.com> wrote:
> Hi,
>
> I am using mpi4py in a mixed MPI/OpenMP setting. The MPI bit is a simple
> task farm with a master and workers. Each workers gets a job that has been
> parallelised using OpenMP. I'm trying to specify a single thread for the
> master (as it only farms out tasks) and multiple threads for the workers. I
> set threads for the master using:
>
> if rank == 0
> os.environ['OMP_NUM_THREADS'] = '1'
>

My guess is that this hack does not work. By the time you change the
environ var, the OpenMP runtime is already initialized. Note however
I'm not sure about how this works, so I might be wrong.

There are a couple of additional approaches you should investigate:

1) Look at your mpirun docs about how you can set specific environment
vars for the different processes mpirun spawns.

2) Use the ctypes modules to dlopen the OpenMP runtime shared library,
and then call the routine ``omp_set_num_threads()`` asking for one
thread in rank 0.

--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459

craigwa...@gmail.com

unread,
Sep 15, 2015, 6:03:58 AM9/15/15
to mpi4py
Lisandro, thanks for the advice. I have stopped setting OMP_NUM_THREADS by the os.environ method for now. However... 

If I use my program with only OpenMP, the environment variable OMP_NUM_THREADS is correctly picked up, and the correct number of threads generated.

With OpenMP/MPI, if I either set OMP_NUM_THREADS in the environment prior to calling mpirun or pass it in using the -x flag then I get extra threads, e.g. 

mpirun -np 2 -x OMP_NUM_THREADS=2 python3 -m gprMax

gives a master with 3 threads and a worker with 3 threads.

I have tried some of the more explicit options for mpirun from https://www.olcf.ornl.gov/kb_articles/parallel-job-execution-on-commodity-clusters/ but I always get extra threads.

Craig

Yury V. Zaytsev

unread,
Sep 15, 2015, 6:40:41 AM9/15/15
to mpi...@googlegroups.com
On Tue, 2015-09-15 at 03:03 -0700, craigwa...@gmail.com wrote:
>
> gives a master with 3 threads and a worker with 3 threads.

So did you try Lisandro's suggestion to use omp_set_num_threads() ?

Also, I would check what omp_get_num_threads() returns afterwards from a
parallel block to see if it has worked. I provide a sample code below,
you can wrap it using ctypes, or Cython, or else cffi. I'm not sure of
what your tools are reporting, and if you are interpreting what they are
reporting correctly...

Finally, it's generally not a very good idea to make the master use less
threads than the rest in an OpenMP / MPI scenario. If you are running it
on one machine, it won't make a difference anyways, because the extra
small number of threads to create wouldn't take much time, and they'll
keep idling anyways. If you are running on a large cluster, then
breaking the symmetry might lead to unexpected interactions with the
queuing system, and often it's difficult or impossible to request such
allocations anyways.

Sample code:

size_t omp_set_num_threads_native(const size_t n) {

size_t result = 1;

#ifdef _OPENMP
omp_set_num_threads(n);

#pragma omp parallel
{
result = omp_get_num_threads();
}
#endif

return result;
}

--
Sincerely yours,
Yury V. Zaytsev


craigwa...@gmail.com

unread,
Sep 15, 2015, 7:43:58 AM9/15/15
to mpi4py
Hi Yury,

I have tried checking omp_get_num_threads() from a parallel block (I'm using Cython) and it reports the specified number of threads correctly. I am testing on my own machine prior to our cluster. I am happy to allow the master to use the same number of the threads as the workers, but even this is not happening. I have tried setting omp_set_dynamic(0) but it did not help. My feeling is this is something to do with the MPI bit as the code works OK with OpenMP only.

Regards,

Craig

Yury V. Zaytsev

unread,
Sep 15, 2015, 8:52:27 AM9/15/15
to mpi...@googlegroups.com
Well, if omp_get_num_threads() reports the right number of threads, then
I'm not sure that you are interpreting what your other tools are showing
you correctly, or else maybe the MPI implementation adds one extra
thread or something like that? Do you get at least the right speedup?

Lisandro Dalcin

unread,
Sep 15, 2015, 10:23:29 AM9/15/15
to mpi4py
Indeed, the MPI implementation may be spawning background threads to
support asynchronous progress.

craigwa...@gmail.com

unread,
Sep 15, 2015, 1:29:29 PM9/15/15
to mpi4py
Yes it would seem extra threads are spawned. I tested with a very basic implementation (no threads at all):

from mpi4py import MPI
from time import sleep


comm
= MPI.COMM_WORLD
nprocs
= comm.Get_size()
rank  
= comm.Get_rank()


if rank == 0:
   data
= 'Hello!'
   comm
.send(data, dest=nprocs-1, tag=1)
elif rank == nprocs-1:
   data
= comm.recv(source=0, tag=1)
   sleep
(30)
   
print('Rank {}, received {}'.format(rank, data))

and when launched with 
mpirun -np 2 python3 testmpi.py

I get two Python processes with 2 threads each and the orterun process with 2 threads.
Reply all
Reply to author
Forward
0 new messages