files writing issue in multi-node job

subhajit kar

unread,

Jul 13, 2020, 6:18:57 AM7/13/20

to dedalu...@googlegroups.com

Hi,

I have installed the Dedalus code on our local cluster with the conda environment.

The code is working fine with one node. But when I switch to more than one node I am getting errors.

I also checked by running a simple hello program using multi-node without any errors.

So I guess the installation is correct.

I checked in the groups and find a thread where a similar issue has been solved -

https://groups.google.com/g/dedalus-users/c/1Q6R40f-wKg/m/JAL4vD_bBQAJ

So I change "FILEHANDLER_TOUCH_TMPFILE = True" to "dedalus.cfg", and got an error like -

2020-07-13 12:47:43,027 __main__ 0/2 INFO :: Solver built

2020-07-13 12:47:43,259 __main__ 0/2 INFO :: Starting loop
2020-07-13 12:47:44,475 __main__ 1/2 ERROR :: Exception raised, triggering end of main loop.

And I checked the error file, it is showing -

FileNotFoundError: [Errno 2] No such file or directory: '/scratch/subhajitkar/wave_flow/tmpfile_p1'

So In our cluster, I have /scratch/subhajitkar space for each node (they are independent)

In this case I have used a total 2 processors with 1 processor from each node, and

checked that the hosting node is creating the folder but the second node is not able to see it.

Here I have a question - do each node independently create temporary files?

Can you please suggest how to do this?

Please let me know if you need any other information.

Thanks for the help!

Subhajit Kar

sch...@ntu.edu.tw

unread,

Oct 28, 2021, 4:33:34 AM10/28/21

to Dedalus Users

Hi Subhajit,

I encountered the same errors when I tried to use more than one computing nodes. Copying dedalus.cfg to the local directory and having "FILEHANDLER_TOUCH_TMPFILE = True" did not solve the problem. I wonder if you have found a solution to this (or other people have suggestions)? Thanks ---- Shih-Nan

Louis-Alexandre Couston

unread,

Jan 25, 2022, 7:30:22 AM1/25/22

to Dedalus Users

Hello,

Just to add to this thread, we also have an issue on the supercomputer of ENS de Lyon with I/O.

Everything is fine on one node but crashes on multiple nodes with the following error:

FileNotFoundError: [Errno 2] Unable to create file (unable to open file: name = '/home/lcouston/test_dedalus/ggachon/snapshots/snapshots_s1/snapshots_s1_p8.h5', errno = 2, error message = 'No such file or directory', flags = 15, o_flags = c2)

Basically, the error occurs because every node that's not the master node attempts to reach the analysis folders (here, snapshots) before they are created.

If the analysis folders already exist, then we don't have that error. A solution is to check that the folder exist before using mpirun, i.e., adding something like

### ensure snapshots/snapshots_s1 exists
DIRS_SNAP=$PWD/snapshots/snapshots_s1/
if [[ ! -d "$DIRS_SNAP" ]]
then
/bin/mkdir -p $DIRS_SNAP
fi

It's not ideal but a start... if anyone has other ideas, happy to hear them!

The supercomputer uses debian 9.

Cheers

Reply all

Reply to author

Forward