files writing issue in multi-node job

155 views
Skip to first unread message

subhajit kar

unread,
Jul 13, 2020, 6:18:57 AM7/13/20
to dedalu...@googlegroups.com
Hi, 

I have installed the Dedalus code on our local cluster with the conda environment.

The code is working fine with one node. But when I switch to more than one node I am getting errors.
I also checked by running a simple hello program using multi-node without any errors.
So I guess the installation is correct.

I checked in the groups and find a thread where a similar issue has been solved - 

So I change  "FILEHANDLER_TOUCH_TMPFILE = True" to "dedalus.cfg", and got an error like - 

2020-07-13 12:47:43,027 __main__ 0/2 INFO :: Solver built
2020-07-13 12:47:43,259 __main__ 0/2 INFO :: Starting loop
2020-07-13 12:47:44,475 __main__ 1/2 ERROR :: Exception raised, triggering end of main loop.

And I checked the error file, it is showing - 
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/subhajitkar/wave_flow/tmpfile_p1'

So In our cluster, I have /scratch/subhajitkar space for each node (they are independent)

In this case I have used a total 2  processors with 1 processor from each node, and 
checked that the hosting node is creating the folder but the second node is not able to see it.
Here I have a question - do each node independently create temporary files? 

Can you please suggest how to  do this?
Please let me know if you need any other information.

Thanks for the help!

Subhajit Kar


sch...@ntu.edu.tw

unread,
Oct 28, 2021, 4:33:34 AM10/28/21
to Dedalus Users
Hi Subhajit,
    I encountered the same errors when I tried to use more than one computing nodes. Copying dedalus.cfg to the local directory and having "FILEHANDLER_TOUCH_TMPFILE = True" did not solve the problem. I wonder if you have found a solution to this (or other people have suggestions)? Thanks    ---- Shih-Nan

Louis-Alexandre Couston

unread,
Jan 25, 2022, 7:30:22 AM1/25/22
to Dedalus Users
Hello,

Just to add to this thread, we also have an issue on the supercomputer of ENS de Lyon with I/O.
Everything is fine on one node but crashes on multiple nodes with the following error:

FileNotFoundError: [Errno 2] Unable to create file (unable to open file: name = '/home/lcouston/test_dedalus/ggachon/snapshots/snapshots_s1/snapshots_s1_p8.h5', errno = 2, error message = 'No such file or directory', flags = 15, o_flags = c2)

Basically, the error occurs because every node that's not the master node attempts to reach the analysis folders (here, snapshots) before they are created.
If the analysis folders already exist, then we don't have that error. A solution is to check that the folder exist before using mpirun, i.e., adding something like

### ensure snapshots/snapshots_s1 exists
DIRS_SNAP=$PWD/snapshots/snapshots_s1/
if [[ ! -d "$DIRS_SNAP" ]]
then
  /bin/mkdir -p $DIRS_SNAP
fi

It's not ideal but a start... if anyone has other ideas, happy to hear them!
The supercomputer uses debian 9.

Cheers
Reply all
Reply to author
Forward
0 new messages