Hi
I have build deal.ii on the Nvidia Jetson Nano cluster from Picocluster
It has 5 nodes (pc[0-4], I am using pc0 as the head/login node to launch the application
The executable is in an NFS filesystem to share between all the nodes
The following works
mpirun --host pc1 --mca btl_tcp_if_include
192.168.0.0/24 --mca btl tcp,self /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release
However, when attempting to run it on more than one hosts fails
mpirun --host pc1,pc2 --mca btl_tcp_if_include
192.168.0.0/24 --mca btl tcp,self /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release
It seems to be consistently failing when writing the checkpoint file(s)
Are there special flags I need to setup up for some form of parallel IO that may be happening ?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
####################################################
######### #########
######### Cycle 000040 (0.5%) #########
######### at time t = 0.01975410 #########
######### #########
####################################################
####################################################
######### #########
######### checkpoint computation #########
######### #########
######### #########
####################################################
[pc1:07535] mca_sharedfp_individual_file_open: Error during datafile file open
[pc2:07602] mca_sharedfp_individual_file_open: Error during datafile file open
----------------------------------------------------
Exception on processing:
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped
to avoid a possible deadlock.
---------------------------------------------------------
----------------------------------------------------
Exception on processing:
--------------------------------------------------------
An error occurred in line <1412> of file </home/picocluster/projects/dealii/dealii/dealii_git/source/distributed/tria_base.cc> in function
void dealii::parallel::DistributedTriangulationBase<dim, spacedim>::DataTransfer::save(unsigned int, unsigned int, const string&) const [with int dim = 2; int spacedim = 2; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was:
ierr == MPI_SUCCESS
Additional information:
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_FILE: invalid
file".
The numerical value of the original error code is 30.
Stacktrace:
-----------
#0 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::DistributedTriangulationBase<2, 2>::DataTransfer::save(unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#1 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::DistributedTriangulationBase<2, 2>::save_attached_data(unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#2 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::distributed::Triangulation<2, 2>::save(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#3 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: Step69::MainLoop<2>::checkpoint(std::array<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>, 4ul> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, unsigned int)
#4 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: Step69::MainLoop<2>::run()
#5 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: main
--------------------------------------------------------
Aborting!
----------------------------------------------------
--------------------------------------------------------
An error occurred in line <1412> of file </home/picocluster/projects/dealii/dealii/dealii_git/source/distributed/tria_base.cc> in function
void dealii::parallel::DistributedTriangulationBase<dim, spacedim>::DataTransfer::save(unsigned int, unsigned int, const string&) const [with int dim = 2; int spacedim = 2; std::__cxx11::string = std::__cxx11::basic_string<char>]
The violated condition was:
ierr == MPI_SUCCESS
Additional information:
deal.II encountered an error while calling an MPI function.
The description of the error provided by MPI is "MPI_ERR_FILE: invalid
file".
The numerical value of the original error code is 30.
Stacktrace:
-----------
#0 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::DistributedTriangulationBase<2, 2>::DataTransfer::save(unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#1 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::DistributedTriangulationBase<2, 2>::save_attached_data(unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#2 /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: dealii::parallel::distributed::Triangulation<2, 2>::save(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#3 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: Step69::MainLoop<2>::checkpoint(std::array<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>, 4ul> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, unsigned int)
#4 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: Step69::MainLoop<2>::run()
#5 /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: main
--------------------------------------------------------
Aborting!
----------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[8910,1],1]
Exit code: 1
--------------------------------------------------------------------------