"Particle ... no longer in the simulation box" error in LJ example

1,342 views
Skip to first unread message

luz...@gmail.com

unread,
Jul 14, 2015, 12:54:47 AM7/14/15
to hoomd...@googlegroups.com
Hi folks,

there is a problem as follows. There are two different machines, both with Scientific Linux 6.6, one with CUDA 6.5 and Tesla C2075 GPU, the other one is with CUDA 7.0 and  GeForce GTX 750 GPU. During the installation, the very same error message appeared that i reported earlier:

 > at " make check " stage of compilation, i get the following error message:
>     > ...
>     > [ 80%] Building CXX object test/unit/CMakeFiles/test_messenger.dir/test_messenger.cc.o
>     > /home/dmytro/Downloads/hoomd-blue/test/unit/test_messenger.cc: In member function Б-?void Messenger_file::test_method()Б-?:
>     > /home/dmytro/Downloads/hoomd-blue/test/unit/test_messenger.cc:172: error: Б-?unique_pathБ-? was not declared in this scope
>     > make[3]: *** [test/unit/CMakeFiles/test_messenger.dir/test_messenger.cc.o] Error 1
>     > make[2]: *** [test/unit/CMakeFiles/test_messenger.dir/all] Error 2
>     > make[1]: *** [CMakeFiles/check.dir/rule] Error 2
>     > make: *** [check] Error 2

However, on CUDA 6.5/Tesla C2075 hoomd running fine without any noticeable errors, while on CUDA 7.0/GeForce GTX 750 machine the LJ example (and others, randomly chosen, too) regularly crashes with the following error message
...
HOOMD-blue is running on the following GPU(s):
 [0]       GeForce GTX 750   4 SM_5.0 @ 1.14 GHz, 2047 MiB DRAM, DIS
lj.py:005  |  init.create_random(N=2000, phi_p=0.01, name='A')
notice(2): Group "all" created containing 2000 particles
lj.py:007  |  lj = pair.lj(r_cut=3.0)
lj.py:008  |  lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)
lj.py:010  |  all = group.all();
lj.py:011  |  integrate.mode_standard(dt=0.005)
lj.py:012  |  integrate.nvt(group=all, T=1.2, tau=0.5)
lj.py:014  |  run(10e3)
notice(2): -- Neighborlist exclusion statistics -- :
notice(2): Particles with 0 exclusions             : 2000
notice(2): Neighbors excluded by diameter (slj)    : no
notice(2): Neighbors excluded when in the same body: no
** starting run **
**ERROR**: Particle with unique tag 1446 is no longer in the simulation box.

**ERROR**: Cartesian coordinates:
**ERROR**: x: 15436.9 y: -24372.6 z: 27529.9
**ERROR**: Fractional coordinates:
**ERROR**: f.x: 328.004 f.y: -516.581 f.z: 584.567
**ERROR**: Local box lo: (-23.5675, -23.5675, -23.5675)
**ERROR**:           hi: (23.5675, 23.5675, 23.5675)
Traceback (most recent call last):
  File "lj.py", line 14, in <module>
    run(10e3)
  File "/home/dmytro/bin/hoomd/bin/../lib/hoomd/python-module/hoomd_script/__init__.py", line 268, in run
    globals.system.run(int(tsteps), callback_period, callback, limit_hours, int(limit_multiple));
RuntimeError: std::exception
...

In most cases it crashes like this, only the particle id can be different. If i decrease the integration time-step and/or number of particles in the box, it will work errorless longer, but, for a long enough run, will eventually crash anyway. What's weird here is that by default the periodic boundary conditions are assumed, so that there is no meaningful way for a particle to get out of the simulation box. Any suggestions why?  If i force it to run on CPU (hoomd lj.py --mode=cpu) everything works fine every time, the problem appears only when GPU is invoked. Various tests of GPU have not revealed any errors, and in all other regards the GPU seems to work flawless.

Mike Henry

unread,
Jul 14, 2015, 7:30:46 AM7/14/15
to hoomd...@googlegroups.com

So I can't comment on exactly why you getting these errors but even with periodic boundary conditions, if a particle leaves the simulation volume, it means that it traveled further than it should have in a unphysical way. Decreasing the temperature and time step should help the system be more stable. However, with too small of time step you will run into problems with floating point rounding errors. I've found that if a system isn't stable with a dt of 0.001 then something else is likely the problem.

--
You received this message because you are subscribed to the Google Groups "hoomd-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.
To post to this group, send email to hoomd...@googlegroups.com.
Visit this group at http://groups.google.com/group/hoomd-users.
For more options, visit https://groups.google.com/d/optout.

Michael Howard

unread,
Jul 14, 2015, 9:19:04 AM7/14/15
to hoomd...@googlegroups.com
Although I'm not sure exactly why you're having problems, it seems to me that you have a couple different factors here. Perhaps Josh or one of the other developers will be able to give you a better diagnosis. Just a few thoughts I had:

(1) The make check error is most likely disconnected from the particle out of the box error. What version of boost are you using? What compiler & version? If I grep for unique_path in my boost include directory, I find it in filesystem/operations.hpp. It's possible this is just a symptom of another compilation error though.

(2) As another user noted, particle out of the box can and does happen if your system blows up and the particle moves farther than one box length. In your case, the particle has moved hundreds of boxes. In my experience, this has happened either when (a) particles overlap or (b) there is an error in the pair force (or neighbor list) calculation. Since you are running the dilute Lennard-Jones example that is randomly generated, we can exclude (a).

(3) Correct me if I'm wrong, but I believe that the two GPUs you are using are very different architectures. C2075 is a Fermi card (SM 2.0), but GTX 750 is a Maxwell card (SM 5.0). HOOMD has two different code paths for building the neighbor list (one for Fermi, one for more recent cards). This might explain why you are okay on the C2075 and having problems on the 750. I have only run on cards up to SM 3.7 (K80).

(4) Is the GTX 750 also your display GPU? What about the Tesla? (I haven't seen that "DIS" in my output before.) Have you tried running remotely with the card in compute exclusive mode? I doubt this is doing anything, but maybe worth a try.

Regards... Mike

Jens Glaser

unread,
Jul 14, 2015, 10:18:40 AM7/14/15
to hoomd...@googlegroups.com
What version of HOOMD are you running?

Jens

Joshua Anderson

unread,
Jul 14, 2015, 11:33:22 AM7/14/15
to hoomd...@googlegroups.com
I recently ran into some problems with SM5.0 cards failing unit tests. It appears that certain driver versions might incorrectly translate old PTX to sm50. Work around this problem by adding "50" to the CUDA_ARCH_LIST in cmake. This is now default on the master branch, but default settings like this only take effect in initial cmake runs in clean build dirs. If you have existing build dirs, you will either need to start over, or add the value yourself.

Since you can't build and run the unit tests, I can't be sure this is the root cause of your problem, but it is possible. I have this problem with driver version 352.21.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
Phone: 734-647-8244
http://www-personal.umich.edu/~joaander/

luz...@gmail.com

unread,
Jul 14, 2015, 10:43:32 PM7/14/15
to hoomd...@googlegroups.com
Hi folks,

thanks for all the suggestions. The simplest solution, as it turned out to be, was to downgrade the nvidia driver from 352.21 to 346.35, and now hoomd does not crash anymore. So, keep in mind, the latest driver is not the best driver :)
Reply all
Reply to author
Forward
0 new messages