Intel MPI + OpenMP + CUDA issue

664 views
Skip to first unread message

Ai Haike

unread,
Dec 2, 2015, 10:52:11 PM12/2/15
to hoomd-users
Hey folks,

I'm testing the application on our HPC facility.
Pure MPI jobs runs well, however when I try MPI+OpenMP or MPI+OpenMP+CUDA if fails.
What I mean by that is I get  one OpenMP threads only per MPI process.
We have two K20 and two CPU per nodes, I use to have two  multi-threads MPI process per nodes so each MPI process takes care of one GPU and one CPU.
With HOOMD-bleue one GPU only is used (meaning that one MPI process doesn't use any GPU), and one CPU core is used per MPI process.

So, I'm wondering whether it's possible to have HOOMD-bleue using one GPU card per multi-threads MPI process.

Thank you,

              Eric.

Joshua Anderson

unread,
Dec 3, 2015, 6:19:33 AM12/3/15
to hoomd...@googlegroups.com
HOOMD >=1.0 does not use OpenMP for threading at all. Submit one MPI process per GPU (2 per node in your case) and HOOMD will use both GPUs effectively.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/
> --
> You received this message because you are subscribed to the Google Groups "hoomd-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.
> To post to this group, send email to hoomd...@googlegroups.com.
> Visit this group at http://groups.google.com/group/hoomd-users.
> For more options, visit https://groups.google.com/d/optout.

Ai Haike

unread,
Dec 3, 2015, 8:13:50 PM12/3/15
to hoomd-users
Thanks Joshua,

That's what I'm doing actually but only one GPU (id=0) is used even the memory is allocated on both of them.
Any plan to implement a OpenMP version in the near future?

Jens Glaser

unread,
Dec 3, 2015, 8:17:19 PM12/3/15
to hoomd...@googlegroups.com
HOOMD should run on two GPUs without problems. The issue you are describing could be a result of using
Intel MPI together and the fact that your GPU may not be set to compute exclusive mode.

No plans currently to support OpenMP again … you can still use MPI on the CPU cores.

Jens

Joshua Anderson

unread,
Dec 4, 2015, 6:43:48 AM12/4/15
to hoomd...@googlegroups.com
HOOMD prints diagnostic message to aid in debugging this. In a single node job, run `mpirun -n 2 hoomd script.py` and send us the first part of the output, including the version identification line and the "HOOMD-blue is running on" information. There may be an issue with automatic GPU assignment within a node. That has not been tested with Intel MPI.

For example, this is what I get on my development box:

```
HOOMD-blue 1.2.1-unknown CUDA (7.5) DOUBLE MPI SSE SSE2 SSE3 SSE4_1 SSE4_2 AVX
......
HOOMD-blue is running on the following GPU(s):
Rank 0: [0] Quadro M6000 24 SM_5.2 @ 1.11 GHz, 12287 MiB DRAM
Rank 1: [1] Tesla K40c 15 SM_3.5 @ 0.876 GHz, 11519 MiB DRAM, DIS
HOOMD-blue is using domain decomposition: n_x = 1 n_y = 1 n_z = 2.
1 x 1 x 2 local grid on 1 nodes
````

And no, there are no plans to bring back OpenMP for the CPU code path. MPI domain decomposition is **much** faster and scalable to many nodes. Nor do we have plans for a hybrid MPI+threaded implementation of any kind. That would be a huge development effort, involving a major restructuring of all code paths, for the potential of unknown slight gains in performance. Scaling is already very good with MPI alone and GPU kernel launch latency becomes the performance limiter in strong scaling - not communication. We do use node-aware domain assignment by default so that each node gets a relatively compact domain to maximize intra-node communication.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

Ai Haike

unread,
Dec 6, 2015, 10:11:49 PM12/6/15
to hoomd-users
Thank you Joshua,

Here it is:

HOOMD-blue 1.2.1-unknown CUDA (7.0) SINGLE MPI SSE SSE2 SSE3 SSE4_1 SSE4_2 AVX
Compiled: 12/03/2015
Copyright 2009-2015 The Regents of the University of Michigan.

....

*Warning*: Delayed creation of execution configuration is deprecated and will be removed.
*Warning*: Call context.initialize() after importing hoomd_script to avoid this message.

HOOMD-blue is running on the following GPU(s):
Ranks 0-1:  [0]            Tesla K20m  13 SM_3.5 @ 0.706 GHz, 4799 MiB DRAM
notice(2): Reading init.xml...
notice(2): --- hoomd_xml file read summary
notice(2): 64000 positions at timestep 200000
notice(2): 64000 velocities

...

Here is what nvidia-smi returns:

+------------------------------------------------------+                      
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                      
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 0000:08:00.0     Off |                    0 |
| N/A   29C    P0    94W / 225W |    444MiB /  4799MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 0000:82:00.0     Off |                    0 |
| N/A   24C    P8    14W / 225W |     14MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    242433    C   ...glotzer-hoomd-blue-aa0c4b438a3b/bin/hoomd   218MiB |
|    0    242434    C   ...glotzer-hoomd-blue-aa0c4b438a3b/bin/hoomd   209MiB |
+-----------------------------------------------------------------------------+

I use impi/5.0.1.
Right, it looks like there is an issue with automatic GPU assignment.
That doesn't happen with other codes or my own development.

Thank you for you support.

             Éric.

Joshua Anderson

unread,
Dec 7, 2015, 9:00:42 AM12/7/15
to hoomd...@googlegroups.com
Excellent, thanks for the information. That confirms that hoomd is assigning both ranks to the same GPU. You say that "this doesn't happen" with other codes or your own development. If you have a suggestion on how to assign processes to GPUs in a reasonable and robust way, I'd love to hear it.

Most MPI stacks set an environment variable that is the "local rank" on a node. (Intel's MPI does not). HOOMD uses this information to distribute GPU ids among ranks on the node (gpu_id = local_rank_id % n_gpus_on_node).

In an alternate mode of operation, if hoomd detects that all GPUs on a node are set to compute exclusive, it will allow the CUDA runtime to auto-select a separate GPU for each process.

Without an easy way to determine the local rank on a node for Intel MPI, there is not much I can do to fix this quickly. There are complex but general methods to determine a node-local identifier: https://blogs.fau.de/wittmann/2013/02/mpi-node-local-rank-determination/ - and we do have code similar to Method I elsewhere in hoomd. I've opened an issue and will refactor that code for GPU selection when I find time: https://bitbucket.org/glotzer/hoomd-blue/issues/108.

Again, if anyone has a better suggestion I'd love to hear it -  better yet, submit a pull request. The relevant code is ExecutionConfiguration.cc guessLocalRank (line 649).
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

Michael Howard

unread,
Dec 7, 2015, 10:16:42 AM12/7/15
to hoomd-users
Are you using a specific job scheduler to manage your resources? If you are using slurm (very popular right now), there is an environment variable set to identify the local rank regardless of the flavor of mpi being run. I patched this a few weeks ago, so I'm not sure if you have it in your branch you're compiling without a commit identifier.

Not a general solution to the problem of determining a local rank of course. For that, I don't have any better ideas than what the blog post suggested.

Regards... Mike

Ai Haike

unread,
Dec 7, 2015, 9:08:51 PM12/7/15
to hoomd-users
Thanks Joshua,

Here is how proceed:
/*  MPI rank to GPU id  */
err = (int)gpurank_cu(rank,&gpuid) ;
if (err!=0) (void)handel_error("gpurank_cu","main.c" , err,rank);

if (rank != 0)
 {
  MPI_Send(&gpuid, 1, MPI_INT,0, 0, MPI_COMM_WORLD);
 }
 
if (rank == 0)
 {
  err = (int)gpuinfo_cu(rank,&gpumajor, &gpuminor, gpuname);
  if (err!=0) (void)handel_error("gpuinfo_cu","main.c" , err,rank); 
  printf("# MPI [%d] device [%d] : %s CUDA Version %d.%d  \n",rank,gpuid,gpuname,gpumajor,gpuminor);
  for(j=1;j<nproc;j++)
   {
     MPI_Recv(&gpuid, 1, MPI_INT,j, 0, MPI_COMM_WORLD, &status);
     printf("# MPI [%d] device [%d] : %s CUDA Version %d.%d  \n",j,gpuid,gpuname,gpumajor,gpuminor);   
    
   }
 }
Here are the routines:
/* ----------------------------------------------------- */

int gpurank_cu(int rank, int *gpuid)
{

 int deviceCount = 0;
 cudaDeviceProp prop;
 
 cudaGetDeviceCount(&deviceCount);
 
 *gpuid= rank%deviceCount;
 
 cudaSetDevice(*gpuid);

 return(0) ;

}

/* ----------------------------------------------------- */

int gpuinfo_cu(int rank,int *gpumajor, int *gpuminor, char (*gpuname))
{

 int deviceCount = 0;
 cudaDeviceProp prop;
 
 cudaSetDevice(0);
 cudaGetDeviceProperties(&prop, 0);
 
 *gpumajor = prop.major ;
 *gpuminor = prop.minor ;
 strcpy(gpuname, prop.name) ;

 return(0) ;

}

Ai Haike

unread,
Dec 7, 2015, 9:09:16 PM12/7/15
to hoomd-users
That's great Mike!

we use SLURM actually.
I downloaded the glotzer-hoomd-blue-aa0c4b438a3b.tar.bz2 tarball.
In which branch is your implementation.
Thanks,

         Éric.

Éric

unread,
Dec 8, 2015, 6:17:05 AM12/8/15
to hoomd...@googlegroups.com
That's great Mike!

we use SLURM actually.
I downloaded the glotzer-hoomd-blue-aa0c4b438a3b.tar.bz2 tarball.
In which branch is your implementation.
Thanks,

         Éric.

You received this message because you are subscribed to a topic in the Google Groups "hoomd-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hoomd-users/UGW8yHCGevc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hoomd-users...@googlegroups.com.

Éric

unread,
Dec 8, 2015, 6:17:05 AM12/8/15
to hoomd...@googlegroups.com
You received this message because you are subscribed to a topic in the Google Groups "hoomd-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hoomd-users/UGW8yHCGevc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hoomd-users...@googlegroups.com.

Joshua Anderson

unread,
Dec 8, 2015, 7:10:21 AM12/8/15
to hoomd...@googlegroups.com
The version you downloaded is v1.2.1. It does not include Mike's patch for SLURM support. For that, you will want to access the latest HEAD of master (soon to be v1.3.0). It uses the env var SLURM_LOCALID to get the local rank. I'd like to hear if it works for you. In my limited testing on Comet and Stampede, I found that SLURM_LOCALID was set to 0 for every rank and thus not useful.

Also, thanks for your suggestion to use global rank % number of devices. That is a simple solution to the issue, and I can add easily add it as a temporary fallback measure. It will work in homogenous clusters scheduled by node, though it is not general enough to work in all the system configurations users run hoomd which include heterogeneous clusters with varying numbers of GPUs per node, or per-GPU scheduling and differing numbers of ranks on each node.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

Joshua Anderson

unread,
Dec 8, 2015, 12:52:20 PM12/8/15
to hoomd...@googlegroups.com
I implemented some logic to ignore SLURM_LOCALID if it is all 0's and fall back to the global rank if no local rank can be identified. This is in v1.3.0.

------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

Ai Haike

unread,
Dec 8, 2015, 11:32:43 PM12/8/15
to hoomd-users
I Joshua,

I tested the v1.3.0 and got this using 1 node and 2 process:

 bmark.py:006  |  system = init.read_xml('init.xml')

*Warning*: Delayed creation of execution configuration is deprecated and will be removed.
*Warning*: Call context.initialize() after importing hoomd_script to avoid this message.
notice(2): This system is not compute exclusive, using local rank to select GPUs
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(124): MPI_Comm_size(comm=0x2d70290, size=0x7fffcde7dc74) failed
PMPI_Comm_size(78).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(124): MPI_Comm_size(comm=0xfe86978, size=0x7fff491c5a74) failed
PMPI_Comm_size(78).: Invalid communicator

It's a tricky issue.

Eric.

Joshua Anderson

unread,
Dec 9, 2015, 6:42:32 AM12/9/15
to hoomd...@googlegroups.com
I did test this code on Stampede and Comet and it worked there. In any case, it should be fixed on the maint branch now.

------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

Reply all
Reply to author
Forward
0 new messages