Checkpoint/Restart

51 views
Skip to first unread message

Jalil Khan

unread,
Aug 4, 2025, 8:22:21 AMAug 4
to deal.II User Group
Hi,

I am trying to use restart/checkpointing feature for a problem. I am facing the following issue.

Without using restart
1. The programme runs fine with both single core and multi core.

When I enable the restart
1. The programme runs fine with single core. Pickts ecactly where checkpointing is done and write correct output.
2. Stalls infinitely when multiple cores are used, It picks exactly where checkpointing is done, but does not write the next expected output.

Wolfgang Bangerth

unread,
Aug 4, 2025, 5:00:49 PMAug 4
to dea...@googlegroups.com
On 7/31/25 03:53, Jalil Khan wrote:
>
> *Without using restart*
> 1. The programme runs fine with both single core and multi core.
>
> *When I enable the restart*
> 1. The programme *runs fine* with *single core*. Pickts ecactly where
> checkpointing is done and write correct output.
> 2. Stalls infinitely when multiple cores are used, It picks exactly where
> checkpointing is done, but does not write the next expected output.

Jalil,
there's not much we can tell you. We don't know what your program does, nor
how the checkpoint was written. All we know is that your program does not work
as you expect.

I think you need to figure out more where exactly the program hangs and why.
This is probably best done using a debugger, or by adding print statements to
figure out in which line of the program things hang. Separately, I'm not clear
what single core/multiple cores actually means -- are you using MPI? Or in
which specific way do you select how many cores are running?

Best
W.

Jalil Khan

unread,
Aug 5, 2025, 8:21:15 AMAug 5
to deal.II User Group
Dear W. Bangerth,

I apologize for writing the issue in very short form. Following is the detailed issue.

I have a compressible‐Euler/NS solver built on deal.II that:
1. Runs cleanly in single‐core or multi‐core mode (no restart).
2. Can write both VTU and HDF5+XDMF outputs via
  • output_results_vtu() (per‐rank VTU)
  • output_results_xdmf() (collective HDF5 + XDMF)
Checkpoint / serialization
  • Serialize metadata (time, step counters, XDMF entries):
           template <int dim>
           template <class Archive>
          void NS<dim>::serialize(Archive &ar, const unsigned int /*version*/)
           {
            // solution is handled via SolutionTransfer
           ar & time;
           ar & next_out_time;
           ar & output_file_number;
           ar & iter_restart;
           ar & xdmf_entries;  
          }

  •  Checkpoint routine (all ranks save mesh; rank 0 writes “checkpoint_cgsem”):
       template <int dim>
      void NS<dim>::checkpoint()
      {
        // 1) apply constraints, ghost‐value fill
          PVector sol_copy = solution;
          constraints.distribute(sol_copy);
          sol_copy.update_ghost_values();

        // 2) tell SolutionTransfer to pack sol_copy
            parallel::distributed::SolutionTransfer<dim,PVector> st(dof_handler);
           st.prepare_for_serialization(sol_copy);

         // 3) all ranks write mesh pieces
           triangulation.save("tmp.checkpoint");  

         // 4) rank 0 writes metadata
           if (Utilities::MPI::this_mpi_process(mpi_comm)==0)
           {
            std::ofstream f("tmp.checkpoint_cgsem",std::ios::binary);
           boost::archive::binary_oarchive ar(f);
           serialize(ar,0);
            }
           MPI_Barrier(mpi_comm);

        // 5) rename files to “checkpoint” and “checkpoint_cgsem”
            …
          }

  • Restart routine (all ranks load mesh; repartition; reinit vectors; rank 0 reads metadata; broadcast; SolutionTransfer::deserialize):
           template <int dim>
           void NS<dim>::restart()
            {
               // 1) load triangulation + redistribute DoFs
               triangulation.load("checkpoint");  
               dof_handler.distribute_dofs(fe);

               // 2) reinit all vectors, including ghosted_solution

                // 3) rebuild constraints

                // 4) rank 0 reads metadata
                   if (Utilities::MPI::this_mpi_process(mpi_comm)==0)
                    {
                       std::ifstream f("checkpoint_cgsem",std::ios::binary);
                       boost::archive::binary_iarchive ar(f);
                       serialize(ar,0);
                     }
                   MPI_Barrier(mpi_comm);

                  // 5) broadcast metadata to all ranks

                   // 6) deserialize solution via SolutionTransfer
                      parallel::distributed::SolutionTransfer<dim,PVector> st(dof_handler);
                      st.deserialize(solution);
                      constraints.distribute(solution);
                      solution.update_ghost_values();

                      assemble_mass_matrix();
                       }

Where is it exactly hanging:
  • Fresh run: always works single (even restart) or multi core..
  • output_results_vtu() always works.
  •  output_results_xdmf() works on 1–n cores, no deadlock.
 Restart run on multiple ranks:
  •  Everything up to restart() completes fine,
  •  At the first call to output_results_xdmf(), the program hangs (never returns) inside  data_out.add_data_vector(dof_handler, /* distributed vector */, postprocessor);
  •  Restart + single‐rank: works.
  •  Restart + multi‐rank: always hangs.
I have tried output_results_xdmf() with rank‐ordered MPI_Barrier and cout statements. All ranks reach the call to data_out.add_data_vector(dof_handler, solution, postprocessor) in lockstep, but then never proceed.

What I’m suspecting
  •    Ghost‐index mismatch:    After restart, the LinearAlgebra::distributed::Vector’s internal Partitioner may not have the correct ghost indices, so the collective add_data_vector() deadlocks when trying to exchange data.
  •   SolutionTransfer misuse:    Perhaps the sequence of
             st.deserialize(solution); 
             constraints.distribute(solution);
             solution.update_ghost_values();

          does not recreate exactly the same ghost layout that DataOut expects.   
  •  Mesh file name mismatch:   When saving/loading mesh pieces, the restart load may pick up only one file (e.g. checkpoint.info), leaving some ranks with an empty mesh, so add_data_vector() stalls trying to gather geometric data.
I have the following queries
    1. Proper ghost‐vector reconstitution after restart
    What is the canonical “deal.II” way to restore a distributed vector and its ghost entries after loading a mesh and DOF‐handler from disk?

    VectorType tmp(locally_owned, locally_relevant, mpi_comm);
    solution_transfer.deserialize(tmp);
    tmp.update_ghost_values();
    solution = tmp;
    solution.update_ghost_values();
    constraints.distribute(solution);

   or  import(solution, VectorOperation::insert). Which is correct?

    2. DataOut::add_data_vector deadlock:
    Has anyone encountered DataOut::add_data_vector() hanging on a restarted run? What synchronization or ghost‐setup steps are required before calling it?

   3.  Mesh save/load conventions
    I use triangulation.save("checkpoint") and triangulation.load("checkpoint").  Are there pitfalls in using the “tmp.checkpoint” naming that might leave some ranks without a file to load?

   4.  Boost serialization of distributed vectors
    I rely on SolutionTransfer::prepare_for_serialization() and deserialize() to shuttle the solution through the mesh checkpoint. Are there examples of restart working robustly with HDF5+XDMF output that I could compare against? I checked step 83 and 69.

Any pointers or minimal reproductions would be hugely appreciated!

Thank you in advance for any insight.

Timo Heister

unread,
Aug 5, 2025, 9:37:48 AMAug 5
to deal.II User Group
Are you using a parallel::distributed::Triangulation? You might need the "mesh_reconstruction_after_repartitioning" flag enabled. Otherwise, the ordering of DoFs can be slightly different after restarting.

Jalil Khan

unread,
Aug 5, 2025, 11:33:29 PMAug 5
to dea...@googlegroups.com
Yes I am using parallel::distributed::Triangulation but currently with default settings. This particular flag is not enabled.

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dealii/08ebeaed-5e13-41ac-853b-282d50e97639n%40googlegroups.com.


--
Regards
J.R.Khan
8803985667

Jalil Khan

unread,
Aug 14, 2025, 7:12:07 AMAug 14
to dea...@googlegroups.com
Thanks, this solves the problem.

On Tue, Aug 5, 2025 at 7:07 PM Timo Heister <timo.h...@gmail.com> wrote:
--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dealii/08ebeaed-5e13-41ac-853b-282d50e97639n%40googlegroups.com.


--
Regards
J.R.Khan
8803985667
Reply all
Reply to author
Forward
0 new messages