Hi everyone,
we installed dea...@9.2.0 on our HPC cluster (centos7) using spack and the intel compilers (dea...@9.2.0%in...@19.0.4~assimp~petsc~slepc~ginkgo~adol-c+mpi^intel-mpi^intel-mkl^boost).
When running our code, which uses hdf5 for output, on the front node and when submitting it via the batch script everything works fine as long as we run on a single node (up to 40 cores).
As soon as we increase the node number above 1 (eg 41 cores) the code fails.
We were able to reproduce the problem with an adapted version of step-40 of the dealii tutorials that outputs using hdf5 (see attached step-40).
The restriction to 32 MPI Processes for the output was bypassed by setting the limit to 42.
The code can overwrite existing files (created during a previous run with 40 processes or less and 1 node), but crashes when new files are to be created with the following error message, which is related to the hdf5 output:
...
HDF5-DIAG: Error detected in HDF5 (1.8.21) MPI-process 41:
#000: H5F.c line 520 in H5Fcreate(): unable to create file
major: File accessibilty
minor: Unable to open file
#001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Mon Apr 12 11:25:30 2021
, name = 'Solution_0.h5', tent_flags = 13
major: File accessibilty
minor: Unable to open file
#002: H5FD.c line 991 in H5FD_open(): open failed
major: Virtual File Layer
minor: Unable to initialize object
#003: H5FDmpio.c line 1057 in H5FD_mpio_open(): MPI_File_open failed
major: Internal error (too specific to document in detail)
minor: Some MPI function failed
#004: H5FDmpio.c line 1057 in H5FD_mpio_open(): File does not exist, error stack:
ADIOI_UFS_OPEN(39): File Solution_0.h5 does not exist
major: Internal error (too specific to document in detail)
minor: MPI Error String
...
The file mentioned in the error message is still created, but remains empty (file size 0)
(see testRunMPI.e1448739 for the full error message).
We tried different hdf5 versions (1.10.7, 1.8.21).
In the tutorial description it is mentioned that a limitation of 16 processors was chosen because such large examples have problems being visualised.
Is there a general rule of thumb that states graphical output for DOF numbers over a certain threshold are unfeasible?
Any help is much appreciated.
Christian