Hello all,
Since the PDC/Dardel update I have been getting MPICH errors when sending jobs to the Dardel cluster.
The slurmfile contains several error lines similar to:
MPICH ERROR [Rank 199] [job id 4137700.0] [Thu Jun 13 18:30:54 2024]
[nid001905] - Abort(201924615) (rank 199 in comm 0): Fatal error in
PMPI_Bcast: Invalid root, error stack: PMPI_Bcast(446): MPI_Bcast(buf=0x7ffe85e28e40, count=7, MPI_DOUBLE_PRECISION, root=721, comm=comm=0x84000003) failed
PMPI_Bcast(406): Invalid root (value given was 721)
for different ranks.
The last message states:
'env MODULE_PREFIX=__ MODULE_INFIX=_MOD_ MODULE_SUFFIX= /usr/bin/time -p
srun -n 512 ./src/run.x' failed:
Aborting.
I talked with pdc support who told me the following:
Pencil users got similar error previously due to installation/settings. Did you
recompile it after the upgrade of Dardel ?
https://www.pdc.kth.se/about/pdc-news/dardel-status-after-the-upgrade-to-a-new-software-stack-1.1343510
So I started a new run/job directory /cfs/klemming/projects/snic/snic2020-4-12/ptengner/pencil-code/runs/simulation_run
based on my last one "runs/initial", compiled the new directory, and sent a "pc_run start" job which ran correctly.
When sending a job with "pc_run run", the job crashes with the error codes above.
Please see the attached screenshots of the error messages.
Have you seen this problem before and how do we fix this?
Regards
/Patrik Tengnér