Issues Encountered While Using WESTPA with AMBER for the 7.2 Tutorial on Pitt CRC GPU

110 views
Skip to first unread message

Han, Fengyang

unread,
May 9, 2023, 8:26:15 PM5/9/23
to westpa...@googlegroups.com

Dear Members of the WESTPA Community,


I apologize for the interruption.

I've been working on running the tutorials on a cluster. You can find my code for GPU and CPU jobs in the tarball files. I've been submitting my jobs using sbatch WE.slurm. However, I've encountered some issues. While the CPU job seems to be running smoothly, the GPU job failed and has generated an error report, while they share no differences except for the WE.slurm files. As for the error from GPU job, you can find at gpu_run/west-536516.log

... 

Waiting for segments to complete... -- ERROR [westpa.core.propagators.executable] -- could not read pcoord from '/scratch/slurm-536516/tmp50ce4bao': ValueError('cannot reshape array of size 0 into shape (101,2)')

 ...


Interestingly, both the CPU and GPU jobs initially encountered a same problem, as you can see in cpu_run/west-1336807.log

... 

exception caught; shutting down -- ERROR [w_run] -- error message: [Errno 2] No such file or directory: '/ix1/jwang/feh34/4.md/1.westpa/7.2/cpu_run/seg_logs/000003-000001.log

...


To troubleshoot, I tried replacing $SANDER with pmemd.cuda or sander in runseg.sh for both directories and resubmitted my jobs. This adjustment seemed to resolve the issue for the CPU job, but the GPU job still failed after four iterations with a similar problem. You can see this failure report in gpu_run/west-536537.log.

Trying to google for a solution, I found additional tutorials addressing amber_gpu problems. I attempted to implement these solutions by including node.sh, runwe_bridges.sh, and env.sh from basic_nacl_amber_GPU. Unfortunately, this resulted in another  westpa.work_managers.zeromq.core.ZMQWorkerMissing: no workers available error. You can find this in gpu_run/west-536541.log in the additional tarball file.


I would greatly appreciate any guidance or suggestions you could provide to help me understand where I might have gone wrong.



Best regards, 

Fengyang

rus...@ohsu.edu

unread,
May 11, 2023, 12:45:51 AM5/11/23
to westpa-users
Hi Fengyang,

Can you double check that the `seg_logs` directory exists?

If your west.cfg specifies outputting files in seg_logs, you have to create the directory yourself and make sure it exists, or WESTPA will run into problems. (Same goes for the traj_segs folder, if your simulation setup uses that)

It's easy to run into this situation -- for example: Say you run a simulation, find an issue with it, and manually clear out the WESTPA-generated files to prepare for a new simulation. You re-run w_init, which works fine, so it's not clear there's an issue, but then when you w_run, this error occurs, because you deleted traj_segs and seg_logs. I run into this semi-often myself.

I'm not able to view the attached files, so maybe you're already doing this, but some people include creating those directories in their WESTPA initialization or run scripts just to make sure, as in https://github.com/westpa/westpa_tutorials/blob/main/additional_tutorials/basic_nacl_amber_GPU/init.sh#L7 .

-John Russo

Han, Fengyang

unread,
May 11, 2023, 12:07:35 PM5/11/23
to rus...@ohsu.edu, westpa...@googlegroups.com
Hi John,

Thank you for your patient and quick response, and sorry for the inconvenience in file sharing. I have reset the permissions for the files.

I have followed your instructions and use files from basic_nacl_amber_GPU, and encounter some trouble which seems to be related with ZMQ.

Here are the steps I followed:
  1. I ran ./init.sh​ to remove and remake directories for simulation. 
  2. sbatch runwe_bridges.sh to submit my job. 
  3. Upon checking output, I noticed an "ssh_askpass" error at line 1226 of "slurm.out".
  4. Additionally, in "west-537830.log", I encountered a "ZMQWorkerMissing" error at line 32.
  5. In lines 17-19, I tried printing out "wm_ops.prep_iter", "self.n_iter", and the segments shown in line 29, which I suspect might be causing the issue.
As for the previous files:
  • CPU job: completed successfully on May 10th, as you can find it out at "west-1336816.log". I have removed the traj_segs folder to reduce folder size.
  • GPU job: if I choose to submit my job with "WE.slurm" without ZMQ, you can find errors reported at "west-536537.log" and "slurm-536516.out".
Please let me know if you have any suggestions or if there's anything else I can provide to help troubleshoot the problem.

Thank you once again for your assistance.

Best regards,
Fengyang




From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of rus...@ohsu.edu <rus...@ohsu.edu>
Sent: Thursday, May 11, 2023 0:45
To: westpa-users <westpa...@googlegroups.com>
Subject: [westpa-users] Re: Issues Encountered While Using WESTPA with AMBER for the 7.2 Tutorial on Pitt CRC GPU
 
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/1e42cff1-afa3-4785-bdd8-4edfc2842dc6n%40googlegroups.com.

Jeremy Leung

unread,
May 11, 2023, 12:29:22 PM5/11/23
to westpa-users
Hi Fengyang,

We usually ask the users to contact their HPC about issues running on clusters, but since this is about H2P (and we've run tons of simulations on H2P), we'll help you here.

For ZMQ permission error, run the following once so you can ssh to yourself. If you already had an ssh key generated for yourself and don't want to overwrite that, feel free to skip the first line:
     ssh-keygen # hit enter a few times
     cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

For the missing workers, it's typically due to high I/O or traffic in the cluster, which requires increasing your worker/server heartbeats. I've had success running the main server node with IPC protocol instead and extending heartbeats, as written here in the FAQ. Just replace the first w_run command in the runwe_bridges.sh with what's in the faq and modify some of the parameters in the env.sh.

See here for the exact commands:

-- JL

Han, Fengyang

unread,
May 11, 2023, 2:41:19 PM5/11/23
to westpa...@googlegroups.com
Hi Jeremy,

Sincere gratitude for your instructions, they were indeed helpful. Now, there are no error messages in "west.log" and "slurm.out" after modifying my ssh settings, "runwe_bridges.sh" and "env.sh".

However, upon initializing the simulation,  my "west-gpu-n35-node.log" file report error: no contact from peer. And for 15 minutes, there are none update in my west.log and slurm.out, so I killed the job for further check.

As part of my troubleshooting efforts, I have attempted to adjust the environment settings by setting export WM_ZMQ_MASTER_HEARTBEAT and export WM_ZMQ_WORKER_HEARTBEAT both to 300, as this seems to be a solution mentioned by some previous discussions from Idle workers in a multi-node WESTPA/ZMQ run.However, due to the current high usage of our HPC resources, I am unable to validate whether these changes will resolve the issue.



I kindly ask for your insights on this matter, or should I contact HPC? Any suggestions or advice on how to address this problem would be greatly appreciated.

Thank you for your time and assistance.

Best regards, 
Fengyang


From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Jeremy Leung <jeremyl...@gmail.com>
Sent: Thursday, May 11, 2023 12:29
To: westpa-users <westpa...@googlegroups.com>
Subject: Re: Fw: [westpa-users] Re: Issues Encountered While Using WESTPA with AMBER for the 7.2 Tutorial on Pitt CRC GPU
 

Jeremy Leung

unread,
May 11, 2023, 2:49:16 PM5/11/23
to westpa-users
Hi Fengyang,

That's from the master not starting up within the wait time, which again is quite common on H2P due to high usage. What we usually do is increase the wait time for it to start up. In this line, change the 60 to 300, which will increase the wait from 1 minute to 5 minutes.

If it passes this phase, you should see the json file being made in $WEST_SIM_ROOT.

-- JL

Han, Fengyang

unread,
May 11, 2023, 4:39:45 PM5/11/23
to westpa...@googlegroups.com
Hi Jeremy,

Thank you for your response. I have submitted two jobs, one with a 1-minute wait time and the other with a 5-minute wait time. However, after an hour of calculation, I received only one message update for each job, stating that "no workers are available." I'll then try a longer wating time to assist it passes this phase
in env.sh
export WM_ZMQ_MASTER_HEARTBEAT=3000
export WM_ZMQ_WORKER_HEARTBEAT=3000

​in runwe_bridges.sh.
# wait on host info file up to one minute 
for ((n=0; n<3000; n++)); do

If this doesn't work, I'll go back to use mpi for job submission. Thank you for your patience.

Best regards,
Fengyang


From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Jeremy Leung <jml...@pitt.edu>
Sent: Thursday, May 11, 2023 14:49

Jeremy Leung

unread,
May 11, 2023, 4:57:32 PM5/11/23
to westpa-users
Hi Fengyang,

If it's stating no workers available (and not "startup phase with no contact from master" or anything equivalent regarding the server/master), then it's just the workers not starting up fast enough. The n value only dictates the start up of the master server.

Here's my submission script in case it helps.

Alternatively for running the tutorials, you can just start an interactive session/and or job and run the with the "processes" manager. That always seems to work for me on H2P on a single node. Running on weekends also helps.

-- JL
runwe_gpu_ipc.sh

Han, Fengyang

unread,
May 11, 2023, 9:47:29 PM5/11/23
to westpa...@googlegroups.com
Hi Jeremy,

Thank you for your advice. It appears that the issue was largely related to the high system load. Now the "WE.slurm" at GPU.tar.gz is able to produce simulation using single node.  The discussion has not only helped to resolve this challenge but also greatly broadened my horizon. Hope it can also benefit others searching here.

Once again, thank you for your time and support.

Best regards,
Fengyang


Sent: Thursday, May 11, 2023 16:57
Reply all
Reply to author
Forward
0 new messages