Pencil Code MPICH Error

2 views
Skip to first unread message

Patrik Tengnér

unread,
Jun 26, 2024, 11:44:14 AM (9 days ago) Jun 26
to pencil-co...@googlegroups.com
Hello all,

Since the PDC/Dardel update I have been getting MPICH errors when sending jobs to the Dardel cluster.
The slurmfile contains several error lines similar to:

MPICH ERROR [Rank 199] [job id 4137700.0] [Thu Jun 13 18:30:54 2024] [nid001905] - Abort(201924615) (rank 199 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack: PMPI_Bcast(446): MPI_Bcast(buf=0x7ffe85e28e40, count=7, MPI_DOUBLE_PRECISION, root=721, comm=comm=0x84000003) failed
PMPI_Bcast(406): Invalid root (value given was 721)
for different ranks.

The last message states:
'env MODULE_PREFIX=__ MODULE_INFIX=_MOD_ MODULE_SUFFIX= /usr/bin/time -p srun -n 512 ./src/run.x' failed:                    
Aborting.

I talked with pdc support who told me the following:
Pencil users got similar error previously due to installation/settings. Did you
recompile it after the upgrade of Dardel ?

https://www.pdc.kth.se/about/pdc-news/dardel-status-after-the-upgrade-to-a-new-software-stack-1.1343510

So I started a new run/job directory /cfs/klemming/projects/snic/snic2020-4-12/ptengner/pencil-code/runs/simulation_run
based on my last one "runs/initial", compiled the new directory, and sent a "pc_run start" job which ran correctly.
When sending a job with "pc_run run", the job crashes with the error codes above. 
Please see the attached screenshots of the error messages.

Have you seen this problem before and how do we fix this?

Regards
/Patrik Tengnér
   

MPICH errors.png
MPICH errors last lines.png
Reply all
Reply to author
Forward
0 new messages