Hello,
I'm running some machine learning models with Tesla GPUs. But, while the experiment is in progress, the node becomes unresponsive and I get exit out of ssh session. Only way to resume is to restart the node from cloudlab portal. FYI, on cloudlab portal, the node is still shown in "Ready/Green" state. Currently, the node is in the same state in case you want to take a look at the logs.
The same program runs fine on other node (ibm8335). So, I assume issue not related to the underlying program.
Is there any issue with this node or GPUs?
Thanks,
Umakant