Hi all,
I'm trying to run some large-scale fluids-related simulations (turbulent channel with polymer). I'm attaching a sample Dedalus v3 code for reference.
While the code seems to be running fine and calculating the correct (expected) values, I've been running into a strange issue where the code hangs up (freezes with no new iterations, data writing, or error messages) when I include file-writing via the built-in file handlers.
Also for some architecture context, I'm running on TACC Sapphire Rapids(SPR) nodes (112 cores/node, 128GB RAM/node).
Here are some of the observations I made while doing some testing:
- When I run the attached code without the analysis file handler, I can run up to 512 MPI tasks on 8 SPR nodes
- This works on all mesh sizes I've tried
- When I turn on the analysis file handler with 512 MPI tasks, the code hangs up on the first cadence iteration
- This happens even when I'm not actually adding any tasks to the file handler
- In order for this to run successfully, I have to lower the number of MPI tasks to 64, and this obviously slows the code down substantially.
- I have also noticed that when I run using a lower resolution with, say 128 MPI tasks, the code will run fine with no mesh specified, but it will hang up when I try to use a mesh of size (16,8)
I'm not sure if this is a bug or known issue, but I would like to know if there is a way around this.
Thanks,
Ryan