Idle workers in a multi-node WESTPA/ZMQ run

195 views
Skip to first unread message

Miłosz Wieczór

unread,
Nov 22, 2019, 1:38:56 PM11/22/19
to westpa-users
Hi,

I'm trying to use ZMQ to setup a multi-node WESTPA run on a cluster; for now, I'm just working with your NaCl example to see if I can get it to work properly. Unfortunately, once the job starts, it never makes more than two iterations: within 5 minutes, the master ZMQ process reports non-responsive workers and kills them so that eventually all client processes are shut down. At that time, 12 out of 20 workers completed the second iteration.

One suspicious thing I observe is this line:

-- WARNING  [work_managers.zeromq.core] -- sending SIGKILL to worker process 1136

repeated 9 times (with different PIDs) very early, between the creation of the master process and the appearance of the .json file. Since I'm first running w_run with --n-workers=0, I don't know why there should be any workers at all at that moment.

I'm also attaching my sbatch and node.sh scripts. Any help will be greatly appreciated.

Regards,
Miłosz Wieczór

node.sh
submit_zmq.sh

JD Russo

unread,
Nov 22, 2019, 6:38:14 PM11/22/19
to westpa-users
Can you attach one of your output log files? I've been running into I think the same, or a very similar problem, also running the NaCl tutorial with gromacs, but in my case there's always just one remaining worker on each node which times out and does not complete its trajectory.

As a somewhat unsatisfying workaround, see if you can just restart the job and have it complete the iteration it crashed during. In my case, even though each iteration crashes with one worker per node incomplete, re-submitting the job to the scheduler sees WESTPA finishing those, then continuing on to the next iteration, and crashing in the same way. By chaining N jobs for N iterations, I'm able to run to completion. 

Of course, this hack is neither ideal nor an actual solution, and I'm spending some time debugging. I'll update with the results of that.

As an aside, in my case I also saw (in the single test run I've completed so far) that at around iteration 66, it actually stopped crashing, and ran successfully to completion. I'd be interested to know if, with the above workaround, you see the same thing.

-John Russo

Miłosz Wieczór

unread,
Nov 23, 2019, 4:25:28 PM11/23/19
to westpa-users
Hi John,

Thanks for your suggestion. I tried re-running, but to no success; the job didn't make it past iteration 2 anyway, so maybe it's a different kind of error. I'm attaching a bunch of logs from the initial run: job.out is the main sbatch output, west.log comes from the master server and west[...]-p1257.log comes from node.sh (there's also another one from the second node).

It's my first time working with ZMQ, so I don't yet have a good grasp of the details under the hood, but if you can make sense of what exactly is going wrong, that'd help a lot.

Thanks,
Miłosz
job.out
west.log
west-17088132-p1257.log

Anthony Bogetti

unread,
Nov 24, 2019, 4:04:41 PM11/24/19
to westpa...@googlegroups.com
Hello Miłosz,

From reading the log files you sent, I don't know off the top of my head what could be happening, but I would like to work with you on this and get everything working properly on your end.  I'll ask a few questions first.

1. Which tutorial are you using?  I know you mentioned NaCl, but specifically, where did you obtain this tutorial?  We just published new tutorials (in a LiveCoMS manuscript) and I want to be sure you aren't using some of our older tutorials, as they are not supported anymore.

2. Did you change anything from the original tutorial when you encountered your errors?  What did you change specifically?  I don't believe we have any of our new LiveCoMS tutorials using the ZMQ work manager; however, I can provide some examples I have that should work.

3.  Would you be willing to send me your simulation directory?  If you could tar the main WESTPA simulation folder that is giving you errors and attach that I would then be able to reproduce the error and try to figure out more what is going on.

Thanks!
Anthony

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/b824f82f-9c33-494c-ac70-f4b53de4e3bd%40googlegroups.com.

Aud J. Pratt

unread,
Nov 25, 2019, 3:51:01 AM11/25/19
to westpa...@googlegroups.com
Hi Milosz,

In addition to what Anthony has mentioned, it's worth mentioning that the ZMQ work manager has a number of heartbeat settings that may need tweaking for some clusters (depending on communication patterns, network infrastructure, etc); it's been the culprit for other jobs that seem to run fine and then just shut down.  Considering that it does move on to iteration 2, it does seem that propagation and communication works, so it might be useful to check the ZMQ heartbeat settings (you can see the settings if you run w_run --help).  Try increasing the number of missed heartbeats before the server decides to shut it down and see whether that works.

Alternatively, there's the possibility that the compute node you're running on is failing or otherwise dropping out of communication.  Depending on how your simulation and run scripts are set up, it might be difficult to see whether that's the case until Anthony can possibly see your simulation setup.

Best,
Audrey



--
Audrey Pratt
Graduate Student in Chemistry
Chong Lab, Room 338, Eberly Hall
University of Pittsburgh
Pittsburgh, PA 15260

Miłosz Wieczór

unread,
Nov 25, 2019, 11:20:41 AM11/25/19
to westpa-users
Hi,

Following the suggestion of Audrey, I set --zmq-timeout-factor to 50 and it kind-of solved the issue - it took much longer until the job crashed with the same error, and the run seems to be fine in the meantime with 2 extra iterations. Extrapolating, I assume that increasing it even more will let me run until completion, as I'm currently verifying (update: I already got past iteration 6, so the job should be fine indeed).

To answer Anthony's questions: 
(1) I chose the 'nacl_gmx' tutorial as I found it in the git repo (westpa-2017.10/lib/examples/...), and modified my submission scripts according to what I could find in the docs and on this mailing list;
(2) here and there I added the "--work-manager=zmq" keyword to w_init and w_run, but besides that everything else seems to be untouched now (I eventually reversed all changes before reporting the problem here); 
(3) now that it seems to be running fine I'd rather spare you the extra work, but it would help people a lot if there was a complete working example with ZMQ (or mpi4py, if you have that one implemented by now) in the tutorials. I understand there's some variation depending on the architecture, but then a single file with all tunable parameters would do.

Thanks a lot!
Miłosz
To unsubscribe from this group and stop receiving emails from it, send an email to westpa...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa...@googlegroups.com.

Aud J. Pratt

unread,
Nov 25, 2019, 12:00:48 PM11/25/19
to westpa...@googlegroups.com
Hi Milosz,

Glad to hear that helped!  Regarding a ZMQ example, I believe that’s on the todo list, if I’m not mistaken.

On the ZMQ note, it sounds like the underlying problem is still there.  One thing to do would be to check whether it’s using TCP or Unix sockets.  I’ve had some issues with Unix sockets in the past, so you might want to make sure you’re using TCP and try it, if you’re not.

Best,
Audrey

To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/47d5a702-717b-4220-a204-b6419678fe52%40googlegroups.com.

Miłosz Wieczór

unread,
Nov 25, 2019, 6:34:07 PM11/25/19
to westpa-users
On a second look - I just compared the 2-node ZMQ run to a regular single-node one and it turns out they take much longer to complete. In a single-node run all trajectories are completed within 13 seconds, while in the ZMQ run they take anywhere from 25 (first iteration) to 200 or 300 seconds (8th-9th iteration). I wonder if there are any obvious tweaks to gain optimal performance with ZMQ, but for now it still does not look like a viable option compared to single-node.

Best,
Miłosz
Message has been deleted

John Russo

unread,
Nov 25, 2019, 6:56:23 PM11/25/19
to westpa-users


When using —n-workers to specify the number of concurrent workers, SSHing into the compute node that WESTPA was running on showed it was not actually running multiple in parallel. I checked this by launching top and looking at the “gmx” processes.

I was able to get multiple running processes on different cores by specifying the number of workers I wanted in the -n option to srun, and leaving —n-workers = 1. Launching top on the computer node with the workers then showed me the correct number of workers, all assigned to different cores.

This is definitely contrary to how the documentation describes launching the ZMQ workers, but it gave me the correct result

On Friday, November 22, 2019 at 10:38:56 AM UTC-8, Miłosz Wieczór wrote:

Miłosz Wieczór

unread,
Nov 27, 2019, 6:34:01 AM11/27/19
to westpa-users
Hi John,

Thanks!, your tip worked like charm[m] - now I got 100 iterations in an hour. Looking back, it makes sense now that asking for 1 CPU per node in srun didn't give me optimal performance.

Best,
Miłosz
Reply all
Reply to author
Forward
0 new messages