Simulation hanging up when using file handler in parallel

100 views

Skip to first unread message

Ryan Kelly

unread,

Jan 31, 2025, 1:24:46 PMJan 31

to Dedalus Users

Hi all,

I'm trying to run some large-scale fluids-related simulations (turbulent channel with polymer). I'm attaching a sample Dedalus v3 code for reference.

While the code seems to be running fine and calculating the correct (expected) values, I've been running into a strange issue where the code hangs up (freezes with no new iterations, data writing, or error messages) when I include file-writing via the built-in file handlers.

Also for some architecture context, I'm running on TACC Sapphire Rapids(SPR) nodes (112 cores/node, 128GB RAM/node).

Here are some of the observations I made while doing some testing:

When I run the attached code without the analysis file handler, I can run up to 512 MPI tasks on 8 SPR nodes
- This works on all mesh sizes I've tried
When I turn on the analysis file handler with 512 MPI tasks, the code hangs up on the first cadence iteration
- This happens even when I'm not actually adding any tasks to the file handler
- In order for this to run successfully, I have to lower the number of MPI tasks to 64, and this obviously slows the code down substantially.
I have also noticed that when I run using a lower resolution with, say 128 MPI tasks, the code will run fine with no mesh specified, but it will hang up when I try to use a mesh of size (16,8)

I'm not sure if this is a bug or known issue, but I would like to know if there is a way around this.

Thanks,

Ryan

test.py

Keaton Burns

unread,

Feb 1, 2025, 1:08:34 PMFeb 1

to dedalu...@googlegroups.com

Hi Ryan,

The fact that it hangs without any tasks is interesting — I think that means its hanging on the parallel folder/file creation. Two things you can try:

Try passing the “parallel” keyword to the FileHandler when you create it with “add_file_handler”. There are several options: “virtual” to do a virtual HDF5 merge (default), “mpio” to use parallel HDF5 writing to a single file (in theory the best, but requires parallel HDF5 built against the right MPI, etc.), or “gather” which gathers all the data and writes from a single core. This could be a little slow at extreme scale, but is probably the simplest option.

Try setting the below environment variable in your slurm or setup script to avoid some possible cache/lockup issues with the parallel file system: export HDF5_USE_FILE_LOCKING='FALSE'

Best,

-Keaton

--
You received this message because you are subscribed to the Google Groups "Dedalus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dedalus-users/66c15404-9386-4440-8126-4dee11d4be1an%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages