HDF5 output on multiple nodes at HPC cluster

67 views
Skip to first unread message

Christian Burkhardt

unread,
Apr 12, 2021, 12:15:09 PM4/12/21
to dea...@googlegroups.com
Hi everyone,
 
we installed dea...@9.2.0 on our HPC cluster (centos7) using spack and the intel compilers (dea...@9.2.0%in...@19.0.4~assimp~petsc~slepc~ginkgo~adol-c+mpi^intel-mpi^intel-mkl^boost).  
When running our code, which uses hdf5 for output, on the front node and when submitting it via the batch script everything works fine as long as we run on a single node (up to 40 cores).
As soon as we increase the node number above 1 (eg 41 cores) the code fails.
We were able to reproduce the problem with an adapted version of step-40 of the dealii tutorials that outputs using hdf5 (see attached step-40).  
The restriction to 32 MPI Processes for the output was bypassed by setting the limit to 42.
The code can overwrite existing files (created during a previous run with 40 processes or less and 1 node), but crashes when new files are to be created with the following error message, which is related to the hdf5 output:
 
...
HDF5-DIAG: Error detected in HDF5 (1.8.21) MPI-process 41:
  #000: H5F.c line 520 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Mon Apr 12 11:25:30 2021
, name = 'Solution_0.h5', tent_flags = 13
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 991 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDmpio.c line 1057 in H5FD_mpio_open(): MPI_File_open failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #004: H5FDmpio.c line 1057 in H5FD_mpio_open(): File does not exist, error stack:
ADIOI_UFS_OPEN(39): File Solution_0.h5 does not exist
    major: Internal error (too specific to document in detail)
    minor: MPI Error String
...
 
The file mentioned in the error message is still created, but remains empty (file size 0)
(see testRunMPI.e1448739 for the full error message).
We tried different hdf5 versions (1.10.7, 1.8.21).  
We also looked into https://github.com/choderalab/yank/issues/1165 and tried setting the environment variable "HDF5_USE_FILE_LOCKING=FALSE" which didn't alter the outcome.
As our configuration includes MPI "WITH_MPI=ON" the issue https://github.com/dealii/dealii/issues/605 is about something different, right?
 
In the tutorial description it is mentioned that a limitation of 16 processors was chosen because such large examples have problems being visualised.
Is there a general rule of thumb that states graphical output for DOF numbers over a certain threshold are unfeasible?
 
Any help is much appreciated.
 
Christian
step-40.cc
CMakeLists.txt
testRunMPI.e1448739

Timo Heister

unread,
Apr 12, 2021, 12:37:32 PM4/12/21
to dea...@googlegroups.com
Christian,

What kind of filesystem is this file written to? Does our parallel vtu
output that also uses MPI IO work correctly? (step-40 with grouping
set to a single file)
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CACORy7NFayrhRRYh8eR-5TrcVx%2Bjd61-1tKN9rkx9iWSFWTRVA%40mail.gmail.com.



--
Timo Heister
http://www.math.clemson.edu/~heister/

Christian Burkhardt

unread,
Apr 13, 2021, 4:03:23 AM4/13/21
to deal.II User Group
Thanks for your quick answer.
The problem persisted for the parallel vtu output.
Changing the file system where the output is written to, to a xfs solved the problem. Sorry for the inconvinience, but we are quite new to cluster systems.

Thanks,
Christian

Timo Heister

unread,
Apr 13, 2021, 11:50:42 AM4/13/21
to dea...@googlegroups.com
Great to hear. Was it a standard NFS filesystem that produced the failures?

On Tue, Apr 13, 2021 at 4:03 AM 'Christian Burkhardt' via deal.II User
> To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/f645bc47-7424-4b1e-844a-7ab784e4fae4n%40googlegroups.com.

Christian Burkhardt

unread,
Apr 14, 2021, 9:17:26 AM4/14/21
to deal.II User Group
Yes, following the command 'df -Th', the type of file system where the error occurred is 'nfs'.

Thanks again.


Reply all
Reply to author
Forward
0 new messages