Dear George,
The output does not seem to continue to run to the end. Here is the command I ran.
$ mpirun --with-ft=ulfm -np 4 --host queue1-dy-t3nano-1,queue1-dy-t3nano-1,queue1-dy-t3nano-2,queue1-dy-t3nano-2 ./kill_node
and here is the result.
Warning: Permanently added 'queue1-dy-t3nano-2,172.31.68.234' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue1-dy-t3nano-1,172.31.65.214' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-ip-172-31-48-121-5100@0,0] on node ip-172-31-48-121
Remote daemon: [prterun-ip-172-31-48-121-5100@0,2] on node queue1-dy-t3nano-2
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
I am not sure if I miss any parameters for mpirun. Could you please advise?
Regards,
Kawin