Simplifying analysis cadences

22 views
Skip to first unread message

Keaton Burns

unread,
Sep 24, 2015, 10:42:16 PM9/24/15
to dedal...@googlegroups.com
Hi all,

Right now, analysis tasks can be set to trigger on any and all of the following cadences: iteration intervals, simulation time intervals, and wall time intervals.

I’m considering implementing the following changes:

1) Allowing a given file handler to only trigger on one type of cadence.

2) Removing the wall time option.

With these changes, each write would have a single, well-defined “index” indicating which multiple of the handler’s (one) cadence it corresponds to. This index would be robust in the sense that it can be used to properly order writes from different restarted simulations, etc., making it substantially easier to combine the outputs of restarted simulations.

I don’t think #1 should present any serious issues, since you can always create another handler with the same tasks and different triggering mechanism.

As for #2, the wall-time per time-step should be roughly constant, so it shouldn’t be too hard to figure out the iteration cadence needed to approximate a given wall-time cadence through a quick test run.

Any thoughts or objections?

Thanks,
-Keaton

Geoff Vasil

unread,
Sep 24, 2015, 11:18:03 PM9/24/15
to dedal...@googlegroups.com
I definitely agree with #1.

I mostly agree with #2. I don’t see it a particularly useful. But what’s the cost of keeping it in if it’s there now? And we do have wall-clock triggers for stopping at sim entirely. I probably wouldn’t use the wall-clock trigger. But I can see a use for it on supercomputing platforms with hard wall clock limits and where the time/iteration isn’t known in advance. It’s simple from the point of view or wanting 18 outputs after an 18 hour run.

I can also see a benefit to removing it if it makes optimising the tree easier.

-Geoff
> --
> You received this message because you are subscribed to the Google Groups "Dedalus Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-dev...@googlegroups.com.
> To post to this group, send email to dedal...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-dev/3DF14279-75C6-4D94-94BB-DB00B8A8CF00%40gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Ben Brown

unread,
Sep 24, 2015, 11:28:01 PM9/24/15
to dedal...@googlegroups.com
I also completely agree with #1.  I hadn't realized you could trigger on more than one criteria.  In practice, I'm always triggering a particular output on a single criteria, and as you were suggesting, I use a second output if I want a different trigger.

I strongly disagree with #2, deprecating wall_dt.  I do so for one very particular use case, and as Geoff was guessing, it has to do with supercomputers and hard wall times.  The wall_dt is incredibly useful when checkpointing.  With a wall_dt criteria for checkpoints, I can ensure that no more than a given tolerable amount of wall time is lost in a crash, and I can also ensure that the last output of a normal run will be sufficiently close to the end wall time. 

I typically do wall_dt=3600 for 24 hour runs, and even that is a bit of overkill.  A wall_dt criteria is vastly preferable to a sim_dt or iteration criteria for checkpointing, both of which tend to make way too many checkpoints early in a run, and too few later when the dynamics are interesting and the computation is more demanding (e.g., getting through the transient), and where big runs can be particularly vulnerable to cluster failure.

With wall_dt for checkpointing, the write index issue is ameliorated in that we're typically explicitly specifying the restart file, and we pick the index out directly during the restart; I can see that it could complicate determination of write indexing for other output files that chose to use wall_dt... but I think that edge case would fall to the user to sort out.

Those are my thoughts.  It would be a huge boon to have determinable write indexing for the restarts.

--Ben

Message has been deleted

Keaton Burns

unread,
Sep 24, 2015, 11:43:07 PM9/24/15
to dedal...@googlegroups.com
Geoff — the cost is that it just doesn’t allow for a straight-forward indexing for aligning restarted outputs with original outputs.

Ben — I agree it’s useful for checking pointing, that’s why I originally implemented it, but I guess I’m not sure how a wall_dt cadence could be significantly different than an iteration cadence in it’s behavior. If it is, that would imply that the wall-time-per-iteration is substantially changing early vs late in your simulation, right?
> To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-dev/CAHqBLzw4Ld7TnayJ5rZQ0MMwFqMx-hfhwmQquLORoi3wSU_R9g%40mail.gmail.com.

Geoff Vasil

unread,
Sep 24, 2015, 11:55:04 PM9/24/15
to dedal...@googlegroups.com
Iterations per wall time do remain quite constant. The issue is that it’s not known very well in advance. This comes up the first time one starts a simulation. And also every time one changes the number of cores. When doing months-long production runs, it's really common to pick a weird number of cores fit in free windows that opens for weird intervals. It’s usually very difficult to guess how many iteration you’ll get in these situations. But you want to make sure the sim outputs something before hitting the wall.

I see the advantage of wanting to keep things straight from run to run. But I’m starting to recall its a very useful tool to squeeze data out of limited resources.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-dev/B6939188-B16F-4A0C-AF69-DF63D7C473E0%40gmail.com.

Ben Brown

unread,
Sep 24, 2015, 11:57:05 PM9/24/15
to dedal...@googlegroups.com
Ben — I agree it’s useful for checking pointing, that’s why I originally implemented it, but I guess I’m not sure how a wall_dt cadence could be significantly different than an iteration cadence in it’s behavior.  If it is, that would imply that the wall-time-per-iteration is substantially changing early vs late in your simulation, right?

Yeah, especially with LU recycling, the early part of the sim has a lot of changing timesteps that make long-term timing difficult to predict, while the later parts tend to have more stable timestepping.

But I think the bigger issue is that with a wall_dt I can set the same cadence for every sim (say 1/hour) based on cluster & allocation properties, and this is independent of both the resolution and the number of cores I'm running on.  That use case can't be covered by an iteration or sim_dt based cadence, since the iter/sec and sim_sec/sec both change substantially with resolution and core count.  In a parameter space sweep, it can be impractical to adjust each sims checkpointing cadence.  Conversely, in those same studies, the actual science analysis outputs can be done on a well-defined sim_dt. 

They're addressing different timing concerns: "how much does node failure cost" vs. "how do I diagnose interesting dynamics"


 

Ben Brown

unread,
Sep 24, 2015, 11:58:16 PM9/24/15
to dedal...@googlegroups.com
On Thu, Sep 24, 2015 at 9:56 PM, Geoff Vasil <geoffrey...@gmail.com> wrote:
Iterations per wall time do remain quite constant. The issue is that it’s not known very well in advance. This comes up the first time one starts a simulation. And also every time one changes the number of cores. When doing months-long production runs, it's really common to pick a weird number of cores fit in free windows that opens for weird intervals. It’s usually very difficult to guess how many iteration you’ll get in these situations. But you want to make sure the sim outputs something before hitting the wall.

I see the advantage of wanting to keep things straight from run to run. But I’m starting to recall its a very useful tool to squeeze data out of limited resources.

Yep.  Exactly.  Oh the joys of queue backfill...

 

Keaton Burns

unread,
Sep 25, 2015, 12:04:04 AM9/25/15
to dedal...@googlegroups.com
Yes these are all the original reasons I implemented the wall cadence -- I posted this basically to find out if anyone else was actually using it, and it seems so, so I’ll keep it. Just means a little more effort making sure other post-processing tools do something reasonable for wall_dt based triggers w/o an index, but that’s not a huge deal.

Thanks,
-Keaton
> To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-dev/CAHqBLzynR43b%2BkVnb5uNk%3DVa2yP-vrBe7x8G%2BYaRkCdCLqmAow%40mail.gmail.com.

Ben Brown

unread,
Sep 25, 2015, 12:19:49 AM9/25/15
to dedal...@googlegroups.com
Keaton,
     Awesome.

And the wall_dt trigger has been transformative for checkpointing.  Seriously awesome.  Previously we all just did what Geoff and I have been whining about and dealt with it :)  Hard to go back when you've seen the future.
--Ben

Jeffrey S. Oishi

unread,
Sep 25, 2015, 11:36:36 AM9/25/15
to dedal...@googlegroups.com
I'm late to the party, but agree with Ben & Geoff. Yes, definitely to removing multiple triggers for a single handle, no definitely to removing wall_dt. I've actually (accidentally) set multiple triggers both here and in the Pencil Code (where it is also possible)...it leads to confusion, and I don't see any need for it. 

j

Reply all
Reply to author
Forward
0 new messages