Time taken by WESTPA iteration doesn't match simulation length

45 views
Skip to first unread message

Razvan Marinescu

unread,
Jan 11, 2023, 10:34:22 AM1/11/23
to westpa-users
Hi everyone,

The time that a single WESTPA iteration takes (3 min) is significantly longer than it takes to finish a single segment (55s). However, given that the number of GPUs I have (204 in total running on 51 nodes) is larger than the total number of segments (195), the whole iteration should take ~1min instead of 3min. What am I missing? Why do I get 3D longer iterations?

Here are some log prints. Time taken by a single segment to finish is 56sec + 2sec for computing the p_coord:
#"Step","Potential Energy (kJ/mole)","Temperature (K)","Speed (ns/day)"
1930000,-303022.625,302.42474136859636,0
1935000,-302605.1875,300.0807577990081,174
1940000,-303047.84375,300.0887716341635,173
1945000,-302578.65625,300.1127171038857,173
1950000,-301680.375,302.25395961573724,173
elapsed seconds: 55.99628233909607
+ date
Wed Jan 11 09:19:12 CST 2023
+ python /u/rmarinescu/work/10mer2D/common_files/dist.py
+ date
Wed Jan 11 09:19:14 CST 2023

Time taken by iterations:

grep "wallclock"  slurm.out
Iteration wallclock: 0:03:21.727373, cputime: 3:14:25.149217
Iteration wallclock: 0:03:09.230570, cputime: 3:07:03.777573
Iteration wallclock: 0:03:08.426127, cputime: 2:59:31.713428
Iteration wallclock: 0:02:55.408825, cputime: 2:44:50.820108
Iteration wallclock: 0:03:01.559169, cputime: 2:52:22.652424
Iteration wallclock: 0:03:08.952174, cputime: 2:59:49.063276
Iteration wallclock: 0:03:10.041508, cputime: 3:03:40.220779
Iteration wallclock: 0:03:07.729556, cputime: 3:00:02.322551
Iteration wallclock: 0:03:11.174920, cputime: 3:07:35.708657
Iteration wallclock: 0:03:16.932580, cputime: 3:11:16.898027
Iteration wallclock: 0:03:14.564068, cputime: 3:07:34.667934
Iteration wallclock: 0:03:09.452174, cputime: 3:03:45.795260
Iteration wallclock: 0:03:08.356321, cputime: 3:07:48.013928
Iteration wallclock: 0:03:12.781218, cputime: 3:11:36.007694
Iteration wallclock: 0:03:12.135257, cputime: 3:07:48.813627
Iteration wallclock: 0:03:14.538703, cputime: 3:07:52.203024
Iteration wallclock: 0:03:09.185754, cputime: 3:00:22.352625
Iteration wallclock: 0:03:08.892456, cputime: 3:04:15.003890
Iteration wallclock: 0:03:11.210650, cputime: 3:04:15.699921
Iteration wallclock: 0:03:07.959955, cputime: 3:00:33.297622

And I'm currently running on 51 nodes, with 4 GPUs each:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1277965  gpuA40x4    10mer rmarines  R    4:06:36     51 gpub[001,003-005,007-009,014,016,018-021,025-027,029-031,035,037,039,045-048,050-057,059-062,065,069,074-075,078-079,081,083,085,087-089,092]

Any ideas?

Many thanks,
Razvan

Yang, Darian T

unread,
Jan 11, 2023, 12:05:11 PM1/11/23
to westpa...@googlegroups.com
Hi Razvan,

If you're using the HDF5 framework, loading into MDTraj may be a bottleneck for large systems.
If you're using something like the MAB scheme, resampling may take longer than expected if you have a lot of different segments.

If you are using these features, you could try without them and compare times.

Best,
Darian

From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Razvan Marinescu <mraz...@gmail.com>
Sent: Wednesday, January 11, 2023 10:34 AM
To: westpa-users <westpa...@googlegroups.com>
Subject: [westpa-users] Time taken by WESTPA iteration doesn't match simulation length
 
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/470b8cfb-44d9-4dff-a2d0-82328b374ee9n%40googlegroups.com.

Razvan Marinescu

unread,
Jan 11, 2023, 11:50:00 PM1/11/23
to westpa-users
Hi Darian,

Thank you very much for your answer! 

If you're using the HDF5 framework, loading into MDTraj may be a bottleneck for large systems.

Yes, I am using HDF5. My system has 44000 atoms. Could it become a bottleneck at this size? I am also using the MAB scheme indeed. I'll try to turn them off to check, will report back.

Many thanks,
Razvan

Razvan Marinescu

unread,
Jan 12, 2023, 1:26:48 AM1/12/23
to westpa-users
Hi Darian and everyone,

Switching to fixed binning does not seem to help. However, I managed to manually profile the code through printed timestamps. It seems that the bottleneck is in src/westpa/core/sim_manager.py in function propagate(self), in this code block:

        while futures:
            # TODO: add capacity for timeout or SIGINT here
            future = self.work_manager.wait_any(futures)
            futures.remove(future)

            if future in segment_futures:
                segment_futures.remove(future)
                incoming = future.get_result()
                self.n_propagated += 1

                self.segments.update({segment.seg_id: segment for segment in incoming})
                self.completed_segments.update({segment.seg_id: segment for segment in incoming})
                
                self.we_driver.assign(incoming)
                new_istate_futures = self.get_istate_futures()
                istate_gen_futures.update(new_istate_futures)
                futures.update(new_istate_futures)
                
                with self.data_manager.expiring_flushing_lock():
                    self.data_manager.update_segments(self.n_iter, incoming)

                

It seems to take a full 2 minutes to go through that loop, for 188 segments. Each segment takes around 0.7sec to finish, and it loops over all segments. What does that function do? Is there something I can optimize there?

Thanks,
Razvan

Anthony Bogetti

unread,
Jan 12, 2023, 6:16:48 AM1/12/23
to westpa...@googlegroups.com
Hi Razvan,

Are the timings you are getting for the first iteration? It usually takes the first iteration a bit longer to finish than normal, and by the second the timing should be closer to what you expect. If not, does the longer-than-expected iteration runtime stay constant over the course of a few iterations?

Best,
Anthony

On Jan 12, 2023, at 1:26 AM, Razvan Marinescu <mraz...@gmail.com> wrote:

Hi Darian and everyone,

Razvan Marinescu

unread,
Jan 13, 2023, 3:12:16 PM1/13/23
to westpa-users
Hi Antony and Darian,

I found the problem! It was indeed the HDF5 framework. Now I don't get any significant extra time at the end of the iterations. I get good performance even when running with MAB (which waits for all walkers to complete before doing the re-binning) and with MPI. 

However, I think I need the HDF5 framework to stay turned on, for saving on space. What are my options then? Can I replace that mdtraj import with a faster version? And where does that import happen? Or can I compress the results at the end of a WESTPA run?

Thanks,
Razvan 

Razvan Marinescu

unread,
Jan 13, 2023, 3:25:58 PM1/13/23
to westpa-users
Quick follow-up: I did a timing test with python, and it takes ~1sec to load my system:

import time
import mdtraj as md
f = 'traj_segs/000008/000186/'
start = time.time()
md.load(f + 'seg.dcd', top=f + 'bstate.pdb')
end = time.time()
print(end - start) # time in seconds

(westpa) python timing.py
1.1193983554840088

Which matches what I observer earlier, which is that when running around 200 segments, it takes 200 extra seconds. Is it that maybe it's running those loads sequentially, instead of in parallel? Is there a different library that is faster, which I can use?

Razvan

Yang, Darian T

unread,
Jan 13, 2023, 4:52:32 PM1/13/23
to westpa...@googlegroups.com
Hi Razvan,

Thanks for following up with this additional info.
The loading that you pointed out is indeed done sequentially (since only one process can write into an h5 file at a time and there is only one iteration.h5).

Currently, we are not aware of any alternatives to MDTraj and it's hdf5 specific trajectory formatting (although we have discussed this issue and improvements may be available in the future).

To cut down on size if you decide to forego the hdf5 framework, you could try saving the coordinates as auxdata directly into the west.h5 file and removing the trajectory files, or tarring each segment after it finishes running. The latter may be easier to set up and compare to the hdf5 style sizes.

Best,
Darian

Sent: Friday, January 13, 2023 3:25 PM
To: westpa-users <westpa...@googlegroups.com>
Subject: Re: [westpa-users] Time taken by WESTPA iteration doesn't match simulation length
 

Razvan Marinescu

unread,
Jan 13, 2023, 9:39:09 PM1/13/23
to westpa-users
Thank you very much Darian!  That makes sense. I will run without HDF5 for now, and might tar each segment after it finishes. If I have tarred archives, am I correct in assuming all post-hoc analysis commands (mainly plothist and trace3.py to piece together a trajectory) will not work on the tars out of the box, and I'll have to modify them manually.

-Razvan

Razvan Marinescu

unread,
Jan 14, 2023, 3:50:33 PM1/14/23
to westpa-users
Hi Darian,

Quick follow-up question. If I tar (create a tar archive) each iteration folder (000001.tar, 000002.tar, 000003.tar, ...), will I still be able to run the post-hoc analyses out of the box? (mainly w_pdist, plothist and trace3.py). Am I correct in assuming tarring was done in WESTPA1 (before the HDF5 was introduced), so I assume those commands should still work, as they will be backwards compatible?

Thanks,
Razvan

Leung, Jeremy

unread,
Jan 14, 2023, 4:00:13 PM1/14/23
to westpa...@googlegroups.com
Hi Razvan,

`w_pdist` only requires your `west.h5` file and `plothist` only depends on the `pdist.h5` file generated with `w_pdist`, so tarring your traj_segs will have no effect. trace3.py (which I assume traces a trajectory out) will probably require you to untar the per-iteration folder in order to run as they depend on the trajectory files. 

Whatever code you used to tar/untar in WESTPA 1.0 should work. If not, this following snippet in `post_iter.sh` should do it.
ITER=$(printf "%06d" $WEST_CURRENT_ITER) TAR=$(($WEST_CURRENT_ITER-1)) TAR_DIR=$(printf "%06d" $TAR) echo $ITER echo $TAR echo $TAR_DIR tar -cf seg_logs/$ITER.tar seg_logs/$ITER-*.log rm -f seg_logs/$ITER-*.log if [ -d traj_segs/$TAR_DIR ]; then tar -cf traj_segs/$TAR_DIR.tar traj_segs/$TAR_DIR rm -rf traj_segs/$TAR_DIR fi
Best,

Jeremy L.
--
Jeremy M. G. Leung
PhD Candidate, Chemistry
Graduate Student Researcher, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

Razvan Marinescu

unread,
Jan 15, 2023, 1:51:44 AM1/15/23
to westpa...@googlegroups.com, Anthony Bogetti
Hi Jeremy and Antony,

That is great, thank you very much! Do you also have a script that traces out a full trajectory (i.e. from a given walker in a given segment, back to the origin)? The one from WESTPA1 would work. Antony sent me trace3.py some months ago, but it only works with the iter_000X.h5 files. Now that I deactivated HDF5, I wanted to trace the trajectory like it was done in WESTPA1. 

Many thanks,
Razvan
 

Reply all
Reply to author
Forward
0 new messages