restart/checkpoint strategies in user applications/libraries

76 views
Skip to first unread message

Denis Davydov

unread,
Dec 21, 2016, 8:07:47 AM12/21/16
to deal.II User Group
Hi all,

This is a bit off-topic, but i would like to ask how do you prefer to implement restart/checkpoints in your user applications?
I see two options:

1. Make restarting usable with the same input parameter file so that calculations will continue exactly from the point of restart.
For example, if you wanted to do 10 adaptive refinements and one node failed on 6th step, you would use exactly the same input parameter 
file to automatically continue from the 5th step if the application see that restart information is available.

2. Another approach is to consider restarts as an arbitrary input (similar to the input mesh / initial conditions and alike).
With the same example in mind, if one uses the same input parameter file with 10 refinement steps
but specify `Restart = true` and thereby use the restart generated from 5th step,
then the program would actually end up doing 10 refinement steps starting from the input data taken from the restart.
In other words, 15 steps in total.

Maybe ASPECT (Timo, Wolfagng?), pi-DoMUS ( Luca?) or DOpElib developers could comment on which restart strategy you have chosen 
for your libraries and why?

Regards,
Denis.

Timo Heister

unread,
Dec 24, 2016, 4:00:52 PM12/24/16
to dea...@googlegroups.com
Denis,

we do a mixture in ASPECT:
We have a "resume computation = true" setting in the prm and ASPECT
will first load all settings from the ParameterHandler and then resume
from the last snapshot by deserializing necessary information
(including current time, mesh, etc.). Note that we do not serialize
things that come directly from the ParameterHandler (like a solver
tolerance). This means you can resume a computation and change some of
your settings if you want to. We also allow changing the number of
processors for parallel computations.
> --
> The deal.II project is located at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dealii.org_&d=CwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=r-7wFUtobVXE2pXM7WZmXxixuk-9gpsC5egYRKzungo&s=5yqBM479thjOHwWYj9Xu5QFv1sCOhlzUu0zFAXCcWsg&e=
> For mailing list/forum options, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_forum_dealii-3Fhl-3Den&d=CwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=r-7wFUtobVXE2pXM7WZmXxixuk-9gpsC5egYRKzungo&s=br5oifg7wi1MYsz1b2YlFOu5G4tYVhsl8VEx_4qjn7w&e=
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=CwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=r-7wFUtobVXE2pXM7WZmXxixuk-9gpsC5egYRKzungo&s=i9do8gXgWLRPtwJOUf6hzreCFmfTHbMGftiuEfIM3z4&e= .



--
Timo Heister
http://www.math.clemson.edu/~heister/

Denis Davydov

unread,
Dec 24, 2016, 4:10:28 PM12/24/16
to dea...@googlegroups.com
Hi Timo,

> On 24 Dec 2016, at 22:00, Timo Heister <hei...@clemson.edu> wrote:
>
> Denis,
>
> we do a mixture in ASPECT:
> We have a "resume computation = true" setting in the prm and ASPECT
> will first load all settings from the ParameterHandler and then resume
> from the last snapshot by deserializing necessary information
> (including current time, mesh, etc.). Note that we do not serialize

so it looks like this is closer to the strategy (1) in that if uesr’s
parameer file has “calculate on time interval [0, T)” together with “do restart”,
then you would restart from some snapshot time t1 \in [0,T) and
continue calculations according to user’s settings (tolerances, timestep, etc)?

> things that come directly from the ParameterHandler (like a solver
> tolerance). This means you can resume a computation and change some of
> your settings if you want to. We also allow changing the number of
> processors for parallel computations.

I thought serialization with solution transfer and p::d::Tria is bounded to using the same number of
MPI processes. Good to know that it is not.

Regards,
Denis.

Timo Heister

unread,
Dec 26, 2016, 3:39:42 PM12/26/16
to dea...@googlegroups.com
> so it looks like this is closer to the strategy (1) in that if uesr’s
> parameer file has “calculate on time interval [0, T)” together with “do restart”,
> then you would restart from some snapshot time t1 \in [0,T) and
> continue calculations according to user’s settings (tolerances, timestep, etc)?

Yes, if you don't change anything else in your .prm we have bitwise
identical computations as if you did not snapshot/resume. But you can
also change some settings if you want to (timestep size would be an
example).
Reply all
Reply to author
Forward
0 new messages