Code execution stuck during computation across nodes

64 views

Skip to first unread message

Vincent

unread,

Sep 30, 2025, 1:15:42 AMSep 30

to Dedalus Users

Hi all,

I am trying to run Dedalus code in a cluster. The code is mounted on all compute node from a control node via NFS and my idea was that all the results would be writing into the same dir.

At first, it didn’t work even running on one compute node and it was stuck after showing logs for the first time step. After reading this post, I add

export HDF5_USE_FILE_LOCKING='FALSE'

in the .slurm script, it works fine when running on just one compute node. But it is still stuck after showing logs for the first time step, when running across multiple nodes.

This is my test.slurm:

#!/bin/bash
#SBATCH -J dedalus_test_128_128
#SBATCH -N 2
#SBATCH -n 256
#SBATCH --ntasks-per-socket=64
#SBATCH --cpus-per-task=1
#SBATCH -o dedalus_test_128_128.o
#SBATCH -e dedalus_test_128_128.e

# activate env conda
eval "$(/home/user/miniforge3/bin/conda shell.bash hook)"

# activate env dedalus3
conda activate dedalus3

export HDF5_USE_FILE_LOCKING='FALSE'

srun python3 rbc3d.py

————————————————————————

Any help would be appreciated.

Thanks in advance

Vincent

Reply all

Reply to author

Forward

0 new messages