Dear W. Bangerth,
I apologize for writing the issue in very short form. Following is the detailed issue.
I have a compressible‐Euler/NS solver built on deal.II that:
1. Runs cleanly in single‐core or multi‐core mode (no restart).
2. Can write both VTU and HDF5+XDMF outputs via
- output_results_vtu() (per‐rank VTU)
- output_results_xdmf() (collective HDF5 + XDMF)
Checkpoint / serialization- Serialize metadata (time, step counters, XDMF entries):
template <int dim>
template <class Archive>
void NS<dim>::serialize(Archive &ar, const unsigned int /*version*/)
{
// solution is handled via SolutionTransfer
ar & time;
ar & next_out_time;
ar & output_file_number;
ar & iter_restart;
ar & xdmf_entries;
}
- Checkpoint routine (all ranks save mesh; rank 0 writes “checkpoint_cgsem”):
template <int dim>
void NS<dim>::checkpoint()
{
// 1) apply constraints, ghost‐value fill
PVector sol_copy = solution;
constraints.distribute(sol_copy);
sol_copy.update_ghost_values();
// 2) tell SolutionTransfer to pack sol_copy
parallel::distributed::SolutionTransfer<dim,PVector> st(dof_handler);
st.prepare_for_serialization(sol_copy);
// 3) all ranks write mesh pieces
triangulation.save("tmp.checkpoint");
// 4) rank 0 writes metadata
if (Utilities::MPI::this_mpi_process(mpi_comm)==0)
{
std::ofstream f("tmp.checkpoint_cgsem",std::ios::binary);
boost::archive::binary_oarchive ar(f);
serialize(ar,0);
}
MPI_Barrier(mpi_comm);
// 5) rename files to “checkpoint” and “checkpoint_cgsem”
…
}
- Restart
routine (all ranks load mesh; repartition; reinit vectors; rank 0 reads
metadata; broadcast; SolutionTransfer::deserialize):
template <int dim>
void NS<dim>::restart()
{
// 1) load triangulation + redistribute DoFs
triangulation.load("checkpoint");
dof_handler.distribute_dofs(fe);
// 2) reinit all vectors, including ghosted_solution
// 3) rebuild constraints
// 4) rank 0 reads metadata
if (Utilities::MPI::this_mpi_process(mpi_comm)==0)
{
std::ifstream f("checkpoint_cgsem",std::ios::binary);
boost::archive::binary_iarchive ar(f);
serialize(ar,0);
}
MPI_Barrier(mpi_comm);
// 5) broadcast metadata to all ranks
// 6) deserialize solution via SolutionTransfer
parallel::distributed::SolutionTransfer<dim,PVector> st(dof_handler);
st.deserialize(solution);
constraints.distribute(solution);
solution.update_ghost_values();
assemble_mass_matrix();
}
Where is it exactly hanging:- Fresh run: always works single (even restart) or multi core..
- output_results_vtu() always works.
- output_results_xdmf() works on 1–n cores, no deadlock.
Restart run on multiple ranks:- Everything up to restart() completes fine,
- At
the first call to output_results_xdmf(), the program hangs (never
returns) inside data_out.add_data_vector(dof_handler, /* distributed
vector */, postprocessor);
- Restart + single‐rank: works.
- Restart + multi‐rank: always hangs.
I have tried output_results_xdmf() with rank‐ordered MPI_Barrier and cout statements. All ranks reach the call to
data_out.add_data_vector(dof_handler, solution, postprocessor) in lockstep, but then never proceed.
What I’m suspecting
- Ghost‐index mismatch: After restart, the LinearAlgebra::distributed::Vector’s internal Partitioner may not have the correct ghost indices, so the collective add_data_vector() deadlocks when trying to exchange data.
- SolutionTransfer misuse: Perhaps the sequence of
st.deserialize(solution);
constraints.distribute(solution);
solution.update_ghost_values();
does not recreate exactly the same ghost layout that DataOut expects.
- Mesh file name mismatch: When saving/loading mesh pieces, the restart load may pick up only one file (e.g. checkpoint.info), leaving some ranks with an empty mesh, so add_data_vector() stalls trying to gather geometric data.
I have the following queries 1. Proper ghost‐vector reconstitution after restart
What is the canonical “deal.II” way to restore a distributed vector
and its ghost entries after loading a mesh and DOF‐handler from disk?
VectorType tmp(locally_owned, locally_relevant, mpi_comm);
solution_transfer.deserialize(tmp);
tmp.update_ghost_values();
solution = tmp;
solution.update_ghost_values();
constraints.distribute(solution);
or import(solution, VectorOperation::insert). Which is correct?
2. DataOut::add_data_vector deadlock:
Has anyone encountered DataOut::add_data_vector() hanging on a
restarted run? What synchronization or ghost‐setup steps are required
before calling it?
3. Mesh save/load conventions
I use triangulation.save("checkpoint") and
triangulation.load("checkpoint"). Are there pitfalls in using the
“tmp.checkpoint” naming that might leave some ranks without a file to
load?
4. Boost serialization of distributed vectors
I rely on SolutionTransfer::prepare_for_serialization() and
deserialize() to shuttle the solution through the mesh checkpoint. Are
there examples of restart working robustly with HDF5+XDMF output that I
could compare against? I checked step 83 and 69.
Any pointers or minimal reproductions would be hugely appreciated!
Thank you in advance for any insight.