HOOMD prints diagnostic message to aid in debugging this. In a single node job, run `mpirun -n 2 hoomd script.py` and send us the first part of the output, including the version identification line and the "HOOMD-blue is running on" information. There may be an issue with automatic GPU assignment within a node. That has not been tested with Intel MPI.
For example, this is what I get on my development box:
```
HOOMD-blue 1.2.1-unknown CUDA (7.5) DOUBLE MPI SSE SSE2 SSE3 SSE4_1 SSE4_2 AVX
......
HOOMD-blue is running on the following GPU(s):
Rank 0: [0] Quadro M6000 24 SM_5.2 @ 1.11 GHz, 12287 MiB DRAM
Rank 1: [1] Tesla K40c 15 SM_3.5 @ 0.876 GHz, 11519 MiB DRAM, DIS
HOOMD-blue is using domain decomposition: n_x = 1 n_y = 1 n_z = 2.
1 x 1 x 2 local grid on 1 nodes
````
And no, there are no plans to bring back OpenMP for the CPU code path. MPI domain decomposition is **much** faster and scalable to many nodes. Nor do we have plans for a hybrid MPI+threaded implementation of any kind. That would be a huge development effort, involving a major restructuring of all code paths, for the potential of unknown slight gains in performance. Scaling is already very good with MPI alone and GPU kernel launch latency becomes the performance limiter in strong scaling - not communication. We do use node-aware domain assignment by default so that each node gets a relatively compact domain to maximize intra-node communication.