Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

How can i repeat a we iteration after removing it?

52 views
Skip to first unread message

Ibrahim Mohamed

unread,
Aug 28, 2023, 9:22:21 AM8/28/23
to westpa-users
I was trying to optimize the number of cores to use while running WE iterations in GROMACS. while i was doing this, i removed the last two iterations from the logs and seg_trajs and when i tried to repeat these two runs, the west.log showed that the iterations are complete (without starting them again).

this is the west.cfg file:

# The master WEST configuration file for a simulation.
# vi: set filetype=yaml :
---
west:
  system:
    driver: westpa.core.systems.WESTSystem
    system_options:
      # Dimensionality of your progress coordinate
      pcoord_ndim: 1
      # Number of data points per iteration
      # Needs to be pcoord_len >= 2 (minimum of parent, last frame) to work with most analysis tools
      # number of frames outpputted from MD ENGINE
      pcoord_len: 51
      # Data type for your progress coordinate
      pcoord_dtype: !!python/name:numpy.float32
      bins:
        type: RectilinearBinMapper
        # The edges of the bins
        boundaries:        
          -  [ 0.00, 2.60, 2.80, 3.00, 3.20, 3.40, 3.60, 3.80,
               4.00, 4.50, 5.00, 5.50, 6.00, 7.00, 8.0, 9.0, 10.0,
               11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 'inf']
      # bins:
      #   type: RecursiveBinMapper
      #   base:
      #     type: RectilinearBinMapper
      #     boundaries:
      #       - [-inf, 0., 15., inf]
      #   mappers:
      #     - type: MABBinMapper
      #       nbins: [20]
      #       at: [0]

      # Number walkers per bin
      bin_target_counts: 2 # 24
  propagation:
    max_total_iterations: 16 # 50
    max_run_wallclock:    240:00:00
    propagator:           executable
    gen_istates:          true
  data:
    west_data_file: west.h5
    datasets:
      - name:        pcoord
        scaleoffset: 4
    data_refs:
      iteration:     $WEST_SIM_ROOT/traj_segs/iter_{n_iter:06d}.h5
      segment:       $WEST_SIM_ROOT/traj_segs/{segment.n_iter:06d}/{segment.seg_id:06d}
      basis_state:   $WEST_SIM_ROOT/bstates/{basis_state.auxref}
      initial_state: $WEST_SIM_ROOT/istates/{initial_state.iter_created}/{initial_state.state_id}.xml
  plugins:
  executable:
    environ:
      PROPAGATION_DEBUG: 1
    propagator:
      executable: $WEST_SIM_ROOT/westpa_scripts/runseg.sh
      stdout:     $WEST_SIM_ROOT/seg_logs/{segment.n_iter:06d}-{segment.seg_id:06d}.log
      stderr:     stdout
      stdin:      null
      cwd:        null
      environ:
        SEG_DEBUG: 1
    get_pcoord:
      executable: $WEST_SIM_ROOT/westpa_scripts/get_pcoord.sh
      stdout:     $WEST_SIM_ROOT/get_pcoord.log
      stderr:     stdout
    gen_istate:
      executable: $WEST_SIM_ROOT/westpa_scripts/gen_istate.sh
      stdout:     /dev/null
      stderr:     stdout
    post_iteration:
      enabled:    true
      executable: $WEST_SIM_ROOT/westpa_scripts/post_iter.sh
      stderr:     stdout
    pre_iteration:
      enabled:    false
      executable: $WEST_SIM_ROOT/westpa_scripts/pre_iter.sh
      stderr:     stdout
  # Settings for w_ipa, an interactive analysis program that can also automate analysis.
  analysis:
     directory: ANALYSIS                # specify the directory all analysis files should exist in.
     kinetics:                          # general options for both kinetics routines.
       step_iter: 1
       evolution: cumulative
       extra: [ 'disable-correl' ]
     analysis_schemes:                  # Analysis schemes.  Required: name (TEST), states, and bins
       TEST:
         enabled: True
         bins:
           - type: RectilinearBinMapper
             boundaries:
               - [0.0,2.6,12.0,'inf']
         states:
           - label: bound
             coords:
               - [0]
           - label: unbound
             coords:
               - [12.1]  

the last two runs should have numbers 15 and 16, therefore the maximum number of iterations is set to 16.

also i am continuing from the 14 iteration.

Leung, Jeremy

unread,
Aug 28, 2023, 10:40:23 AM8/28/23
to westpa...@googlegroups.com
Hi Ibrahim,

Simply removing traj_segs and seg_logs is not sufficient. You need to run w_truncate to remove the relevant iterations from the west.h5 file.

`w_truncate -n 15` should delete everything from iteration 15 onwards (including 15). Back up your h5 file just in case!

-- JL
---
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/59df33a9-65f6-4ef5-8ba7-fc852da7190en%40googlegroups.com.

Ibrahim Mohamed

unread,
Aug 28, 2023, 11:06:43 AM8/28/23
to westpa-users
okay thank you.
using the default command from the na cl association for gromacs tutorial, i found that the number of cores used for each walker is one core. so if i have one node with 24 cores, then 24 walker will run simultaneously. 
Is there a way to use two nodes (48 cores) and :
1- make each core run a walker
2- use two cores to run one walker

this is the command for gromacs:
gmx mdrun -v -deffnm seg -nt 1

this is the run.sh:

#!/bin/sh
#SBATCH --job-name=west
#SBATCH --partition=cpu
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --time=10-00:00:00

bash ./env.sh
bash ./init.sh

rm -f west.log
w_run --work-manager processes "$@" &> west-${SLURM_JOB_ID}.log

Thanks

Leung, Jeremy

unread,
Aug 28, 2023, 11:18:41 AM8/28/23
to westpa...@googlegroups.com
Hi Ibrahim,

You can pass the `--n-workers` option to dictate how many segments to run at a time.

For example, setting n-workers=12 in a single node of 24 cores would allow each segment to run on 2 cores:
   w_run --work-manager processes --n-workers=12 "$@" &> west-${SLURM_JOB_ID}.log

For multiple nodes, you'll have to switch to the MPI or ZMQ (recommended) work managers.

We have a few examples in the additional_tutorials folder (and definitely in the google group mailing list). Please also read the wiki:

 

-- JL

---
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

Ibrahim Mohamed

unread,
Aug 29, 2023, 3:20:03 AM8/29/23
to westpa-users
Thanks for your help
i tried this command :
w_run --work-manager processes --n-workers=12 "$@" &> west-${SLURM_JOB_ID}.log

but the time appeared by gromacs for each segement was the same as the default command:
w_run --work-manager processes "$@" &> west-${SLURM_JOB_ID}.log

the command i used for running gromacs is:
gmx mdrun -v -deffnm seg -nt 1

however when i tried to change -nt to 2, the time decreased but gromacs showed me this message:

Using 1 MPI thread
Using 2 OpenMP threads


NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

Leung, Jeremy

unread,
Aug 29, 2023, 9:52:42 AM8/29/23
to westpa...@googlegroups.com
Hi Ibrahim,

Great! It's a GROMACS issue. The -nt 2 seems to allow GROMACS to run on two threads. The message looks like it's safe to ignore if you're getting the desired speed up. 

(You can play around with the -pin and -pinoffset, obviously, to see if you get any more speedup but I'm sure you'll get better help by asking the GROMACS community) 

-- JL
---
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

Ibrahim Mohamed

unread,
Nov 6, 2023, 3:24:52 AM11/6/23
to westpa-users
Hi Jeremy,

I have managed to use the ZMQ to run the WE on 48 cores (2 nodes). However, each segment used one core. Is there a way to make each segement use two cores?
Attached are the files I am using.
Thanks

node.sh
env.sh
run_ZMQ_two_nodes.sh
init.sh

Leung, Jeremy

unread,
Nov 6, 2023, 12:46:17 PM11/6/23
to westpa...@googlegroups.com
Hi Ibrahim,

I'm working under the assumption that there are 24 threads/12 cores per node.

1) In your run script,  your `srun` seemingly is calling just 1 task per node with `-n 1` but it's being overwritten by the cpu-per-task (I think) up above, which means it'll have 24 tasks? And then you're calling `--n-workers  48` which means the workers are going to be fighting over the CPUs? I think you need something like:

``
srun -N 2 --ntasks-per-node 12 $WEST_SIM_ROOT/node.sh --work-manager=zmq --zmq-mode=client --n-workers=12 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp --debug &
``

or following our examples, request the nodes up front and loop through each node

``
for node in $(scontrol show hostname $SLURM_NODELIST); do
    ssh -o StrictHostKeyChecking=no $node $WEST_SIM_ROOT/node.sh $SLURM_SUBMIT_DIR $SLURM_JOBID $node --work-manager=zmq --zmq-mode=client --n-workers=12 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp &
done
```

2) And in your runseg.sh, make sure `-nt 2` for `gmx mdrun`.

3) Also since there are only 2 atoms (this is NaCl tutorial i think?), GROMACS will probably have a very hard time parallelizing this over 2 threads.

Best,

-- JL
---
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/5db4eb89-24d4-499f-8423-225a342c0715n%40googlegroups.com.
<node.sh><env.sh><run_ZMQ_two_nodes.sh><init.sh>

hollowic...@gmail.com

unread,
Nov 7, 2023, 2:28:03 PM11/7/23
to westpa...@googlegroups.com
Hi jeremy,

i have followed you instructions and tried to use 49 walker while using two nodes (24 cores each) and I found that all of them were running at the same time.
but this is different from what i meant before. i was hoping to use two cores for each segment. So if I used 49 walker with 48 core, then 24 segments will be running using 2 cores each and after finishing, the second 24 start then the last one.

attached is the ZMQ, node, init, and env sh files

Thank you for your help
run_ZMQ_two_nodes.sh
node.sh
env.sh
init.sh

Leung, Jeremy

unread,
Nov 7, 2023, 4:13:22 PM11/7/23
to westpa...@googlegroups.com
Hi Ibrahim,

There are a few typos in your scripts that completely messed up the whole ZMQ process. 

west.cfg
Line 33, you need a space after master
Line 57, you need a space after -N 

node.sh
Line 18 references $SLURM_NODENAME when that is not set, this mean every one of your nodes will write to the same file.


Here's a repo with a working 2nodes gmx example I personally setup and tested. I checked that each traj/gmx mdrun job ran on 2 threads each.


---
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]
On Nov 7, 2023, at 2:27 PM, hollowic...@gmail.com wrote:

Hi jeremy,

i have followed you instructions and tried to use 49 walker while using two nodes (24 cores each) and I found that all of them were running at the same time.
but this is different from what i meant before. i was hoping to use two cores for each segment. So if I used 49 walker with 48 core, then 24 segments will be running using 2 cores each and after finishing, the second 24 start then the last one.

attached is the ZMQ, node, init, and env sh files

Thank you for your help
بتاريخ الاثنين، 6 تشرين الثاني 2023 في 07:46:17 م غرينتش+2، Leung, Jeremy <jml...@pitt.edu> كتب:


Hi Ibrahim,

I'm working under the assumption that there are 24 threads/12 cores per node.

1) In your run script,  your `srun` seemingly is calling just 1 task per node with `-n 1` but it's being overwritten by the cpu-per-task (I think) up above, which means it'll have 24 tasks? And then you're calling `--n-workers  48` which means the workers are going to be fighting over the CPUs? I think you need something like:

``
srun -N 2 --ntasks-per-node 12 $WEST_SIM_ROOT/node.sh --work-manager=zmq --zmq-mode=client --n-workers=12 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp --debug &
``

or following our examples, request the nodes up front and loop through each node

``
for node in $(scontrol show hostname $SLURM_NODELIST); do
    ssh -o StrictHostKeyChecking=no $node $WEST_SIM_ROOT/node.sh $SLURM_SUBMIT_DIR $SLURM_JOBID $node --work-manager=zmq --zmq-mode=client --n-workers=12 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp &
done
```
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/1120922424.884642.1699385273443%40mail.yahoo.com.
<run_ZMQ_two_nodes.sh><node.sh><env.sh><init.sh><apple-touch-icon-180x180-a80b8e11abe2.png>

hollowic...@gmail.com

unread,
Nov 8, 2023, 2:16:51 PM11/8/23
to westpa...@googlegroups.com
first of all, I would like to thank you for taking time to help me. I really appreciate this.

I have downloaded the github repo you created and changed the number of walkers to 49 and ran it and i found that all the 49 segments are running at the same time. Is this the correct behaviour?

Thanks

Jeremy Leung

unread,
Nov 10, 2023, 2:17:56 PM11/10/23
to westpa-users
Ibrahim,

Can you specify which lines you modified to 49 walkers? 

I'm also a little confused about the "49" because you mentioned there are 2 nodes * 24 cores = 48 cores. Do your processors have hyper-threading such that you can have 2 nodes * 24 cores * 2 threads/core = 96 threads?  If that's the case, then having 49 segments run at the same time makes sense. Double check your seg.log files (GROMACS output for each segment) to see if you're running each mdrun with 2 threads.

-- JL

hollowic...@gmail.com

unread,
Nov 10, 2023, 5:31:02 PM11/10/23
to westpa-users
Dear Jeremy,

I modified this line in the west.cfg:  bin_target_counts: 49
I searched for the word "hyper" in the gromacs log file but coud not find it. However, i found this: 
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)

Attached are the log file from GROMACS (seg.log) and the log of the same segment of the first iteration.
You received this message because you are subscribed to a topic in the Google Groups "westpa-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/westpa-users/gIj26ewJofM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/6c28f6ed-e053-4252-9c54-34f8651ba2bbn%40googlegroups.com.
000001-000000.log
seg.log

Ibrahim Mohamed

unread,
Nov 13, 2023, 3:03:31 PM11/13/23
to westpa-users
I think i finally managed to run it using two cores per segment. it seems that --n-workers need to be 1 not 12. 
this is the working command:
srun -c 2 -N 2 -n 24 $WEST_SIM_ROOT/node.sh $WEST_SIM_ROOT --work-manager=zmq --zmq-mode=client --n-workers=1 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp &

with the -nt 2 in the gmx mdrun command and using these in the start of the main sh file:

#!/bin/sh
#SBATCH --job-name=NaCl_ZMQ
#SBATCH --partition=cpu
#SBATCH --time=1:00:00
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=2

Thank you for your help and suggestions

Attached are the working files (that worked for me). also i tried them on the files of the second tutorial and they worked well.
node.sh
env.sh
run_ZMQ_two_nodes.sh
init.sh

Jeremy Leung

unread,
Nov 13, 2023, 4:23:56 PM11/13/23
to westpa-users
Hi Ibrahim,

It's great that's working for you. What your configurations are doing is you're running 24 "node.sh" processes and then connecting each worker generated by it to the main zmq server (the one you started with n_workers=0).

Something like:
srun -N 2 --ntasks-per-node=1 $WEST_SIM_ROOT/node.sh $WEST_SIM_ROOT --work-manager=zmq --zmq-mode=client --n-workers=12 --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=tcp &

should actually work better in terms of outputs, since each "node.sh" (now 2) is generating 12 workers, then outputting into a single text file.  Note that anything after $WEST_SIM_ROOT is passed directly to w_run in node.sh via "$@". $WEST_SIM_ROOT itself is removed from the list via "shift" in node.sh.

Play around with the srun options. There are a lot of them.

-- JL

hollowic...@gmail.com

unread,
Nov 14, 2023, 2:35:23 AM11/14/23
to westpa...@googlegroups.com, Jeremy Leung
Reply all
Reply to author
Forward
0 new messages