I apologize for the interruption.
I've been working on running the tutorials on a cluster. You can find my code for GPU and CPU jobs in the tarball files. I've been submitting my jobs using
sbatch
WE.slurm. However, I've encountered some issues. While the CPU job seems to be running smoothly, the GPU job failed and has generated an error report, while they share
no differences except for the WE.slurm files. As for the error from GPU job, you can find at
gpu_run/west-536516.log:
...
Waiting for segments to complete... -- ERROR [westpa.core.propagators.executable] -- could not read pcoord from '/scratch/slurm-536516/tmp50ce4bao': ValueError('cannot reshape array of size 0 into shape (101,2)')
...
Interestingly, both the CPU and GPU jobs initially encountered a same problem, as you can see in
cpu_run/west-1336807.log:
...
exception caught; shutting down -- ERROR [w_run] -- error message: [Errno 2] No such file or directory: '/ix1/jwang/feh34/4.md/1.westpa/7.2/cpu_run/seg_logs/000003-000001.log'
...
To troubleshoot, I tried replacing
$SANDER with
pmemd.cuda or
sander in
runseg.sh for
both directories and resubmitted my jobs. This adjustment seemed to resolve the issue for the CPU job, but the GPU job still failed after four iterations with a similar problem. You can see this failure report in
gpu_run/west-536537.log.
Trying to google for a solution, I found additional tutorials addressing amber_gpu problems. I attempted to implement these solutions by including
node.sh,
runwe_bridges.sh,
and env.sh from
basic_nacl_amber_GPU.
Unfortunately, this resulted in another westpa.work_managers.zeromq.core.ZMQWorkerMissing:
no workers available error. You can find this in
gpu_run/west-536541.log in
the additional tarball file.
I would greatly appreciate any guidance or suggestions you could provide to help me understand where I might have gone wrong.
Best regards,
Fengyang
./init.sh to remove and remake directories for simulation. sbatch runwe_bridges.sh to
submit my job. 