Hi,
I am encountering a problem while running a job. I am unsure if this is to do with COSMA or Dedalus. Could you please look into this?
I have successfully installed Dedalus on COSMA and submitted a job on a single node and the runs without any errors. However when I run the job on multiple nodes, it throws an error and the output is not printed on log file no matter how long I let it run. The error with multiple nodes is as follows:
[m7352][[59694,1],16][btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[59694,1],24]
Is this something to do with Dedalus?
I have googled about this and found one of the posts (
https://github.com/horovod/horovod/issues/1516) saying they solved the problem adding '-mca btl_tcp_if_include ib0' to the mpirun command line solved their issue. Do you have an idea what this is? I tried doing the same with command line as follows:
mpiexec -np 32 -mca btl_tcp_if_include ib0 python3 ./run_script_3D_1.py > log
This leads to Exception raised in Dedalus
2023-12-07 14:49:31,288 __main__ 22/32 ERROR :: Exception raised, triggering end of main loop.
What exception is this exactly? Any clue?