GPU server goes down once in a while

235 views
Skip to first unread message

Xie Minhui

unread,
Mar 2, 2022, 9:01:34 AM3/2/22
to cloudlab-users
Hi,

I have a machine (Wisconsin Lab), c4130-110133.wisc.cloudlab.us. It goes down once in a while. 
The icon shows that the machine is ready and ISUP, but I am unable to connect to the console or ssh. 
After my manual rebooting it,  it can back to normal for a while, but after a while the same problem will happen again.

 I can't find any useful messages from the console output. The OS is standard ubuntu 20.04 and I don't change the running kernel. 

Can someone help me and thanks.


Minhui

Leigh Stoller

unread,
Mar 2, 2022, 9:24:42 AM3/2/22
to cloudla...@googlegroups.com

> I have a machine (Wisconsin Lab), c4130-110133.wisc.cloudlab.us. It goes down once in a while.
> The icon shows that the machine is ready and ISUP, but I am unable to connect to the console or ssh.
> After my manual rebooting it, it can back to normal for a while, but after a while the same problem will happen again.

Hi. Next time this happens, please do not reboot the machine.
Instead, send us email so that we can diagnose the problem.

Thanks
Leigh


Yueying Li

unread,
Mar 2, 2022, 1:24:06 PM3/2/22
to cloudlab-users
Hi Leigh,

Thank you for taking a look in advance. I was recently trying to install CUDA on c240g5-110225.wisc.cloudlab.us. And because we need a reboot after that, we cannot connect to the machines now. The console shows 
Trying 127.0.0.1... Connected to localhost. Escape character is 'off'.

Could you help take a look?

Thank you very much!

Regards

David M Johnson

unread,
Mar 2, 2022, 1:38:32 PM3/2/22
to cloudla...@googlegroups.com
On 3/2/22 11:24 AM, Yueying Li wrote:
> Hi Leigh,
>
> Thank you for taking a look in advance. I was recently trying to install
> CUDA on c240g5-110225.wisc.cloudlab.us. And because we need a reboot
> after that, we cannot connect to the machines now. The console shows 
> Trying 127.0.0.1... Connected to localhost. Escape character is 'off'.
>
> Could you help take a look?

If you look at the console log, you will see that NetworkManager has
taken over the network init path, instead of our systemd-networkd path.
This is a known problem with various nvidia toolkits' package
dependencies (e.g. see search results on this list,
https://groups.google.com/g/cloudlab-users/search?q=networkmanager).
The fix is either to `systemctl disable NetworkManager`, or do something
like `sudo ln -s /dev/null /etc/systemd/system/NetworkManager.service`
before installing.

If you want to rescue these particular machines, you will need to boot
them into the Recovery MSF via the node popup menus in the Topology
View, then mount the disks and chroot into the on-disk root, and disable
NetworkManager that way. See
https://gitlab.flux.utah.edu/emulab/emulab-devel/-/wikis/faq/Using-the-Testbed/Using-the-Recovery-MFS
for instructions.

David

> On Wednesday, March 2, 2022 at 9:24:42 AM UTC-5 Leigh Stoller wrote:
>
>
> > I have a machine (Wisconsin Lab), c4130-110133.wisc.cloudlab.us
> <http://c4130-110133.wisc.cloudlab.us>. It goes down once in a while.
> > The icon shows that the machine is ready and ISUP, but I am unable
> to connect to the console or ssh.
> > After my manual rebooting it, it can back to normal for a while,
> but after a while the same problem will happen again.
>
> Hi. Next time this happens, please do not reboot the machine.
> Instead, send us email so that we can diagnose the problem.
>
> Thanks
> Leigh
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cloudlab-user...@googlegroups.com
> <mailto:cloudlab-user...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cloudlab-users/af5ee7e6-22e6-49de-a7f0-fca5cf9a3262n%40googlegroups.com
> <https://groups.google.com/d/msgid/cloudlab-users/af5ee7e6-22e6-49de-a7f0-fca5cf9a3262n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Yueying Li

unread,
Mar 2, 2022, 3:06:57 PM3/2/22
to cloudlab-users
This is very helpful. Thank you so much! I was wondering if there is a way to do a CUDA installation script that could start before the machines got initialized. Thanks! 

Xie Minhui

unread,
Mar 2, 2022, 7:52:07 PM3/2/22
to cloudla...@googlegroups.com
Hi Leigh,

The server c4130-110133.wisc.cloudlab.us now break down, and I don't reboot it.
Please help me take a look. 

Thanks in advance for your time.

Minhui 

--
You received this message because you are subscribed to a topic in the Google Groups "cloudlab-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloudlab-users/B6rNj7Vhltk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloudlab-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/af5ee7e6-22e6-49de-a7f0-fca5cf9a3262n%40googlegroups.com.


--
Minhui Xie
Department of Computer Science & Technology
Tsinghua University
China

David M Johnson

unread,
Mar 2, 2022, 9:20:06 PM3/2/22
to cloudla...@googlegroups.com
On 3/2/22 5:51 PM, Xie Minhui wrote:
> Hi Leigh,
>
> The server c4130-110133.wisc.cloudlab.us
> <http://c4130-110133.wisc.cloudlab.us> now break down, and I don't
> reboot it.
> Please help me take a look. 

Maybe you missed the reply in this thread earlier today? Looks like the
same problem... see
https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ .

> Minhui 

David

Xie Minhui

unread,
Mar 3, 2022, 6:09:54 AM3/3/22
to cloudlab-users
Thank you. However, the server still break down even after disabling the NetworkManager via
sudo ln -s /dev/null /etc/systemd/system/NetworkManager.service. 

My symptoms are different from Yueying Li's. I can login to the server via ssh after rebooting it.
But after a while, it still breaks down and the topology view still shows that it is ready and ISUP.
What should I do next?

David M Johnson

unread,
Mar 3, 2022, 12:36:32 PM3/3/22
to cloudla...@googlegroups.com
On 3/3/22 4:09 AM, Xie Minhui wrote:
> Thank you. However, the server still break down even after disabling the
> NetworkManager via
> sudo ln -s /dev/null /etc/systemd/system/NetworkManager.service. 
>
> My symptoms are different from Yueying Li's. I can login to the server
> via ssh after rebooting it.
> But after a while, it still breaks down and the topology view still
> shows that it is ready and ISUP.
> What should I do next?

Well, I saw that NetworkManager-wait-online.service was still running in
the console log from your node, so I'm wondering if it's really
disabled. You probably want to disable the wait-online helper service
too, although you are right -- I see that your node did successfully
come up on the control network. (This is not the behavior I would
expect to see -- NetworkManager.service should be the only thing that
pulls in NetworkManger-wait-online.service, in the same dependency style
that systemd-networkd uses.)

Going back to your original message, though, I see that you did not get
any response from the serial console when you tried to interact with it?
What was in the console log at the time? If the serial console is not
responding, the node may be completely hung.

David

> On Thursday, March 3, 2022 at 10:20:06 AM UTC+8 john...@cs.utah.edu wrote:
>
> On 3/2/22 5:51 PM, Xie Minhui wrote:
> > Hi Leigh,
> >
> > The server c4130-110133.wisc.cloudlab.us
> <http://c4130-110133.wisc.cloudlab.us>
> > <http://c4130-110133.wisc.cloudlab.us
> <http://c4130-110133.wisc.cloudlab.us>> now break down, and I don't
> > reboot it.
> > Please help me take a look. 
>
> Maybe you missed the reply in this thread earlier today? Looks like the
> same problem... see
> https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ
> <https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ>
> .
>
> > Minhui 
>
> David
>
> --
> You received this message because you are subscribed to the Google
> Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cloudlab-user...@googlegroups.com
> <mailto:cloudlab-user...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cloudlab-users/6683254d-3f0e-43da-88d5-f303c2a3ad97n%40googlegroups.com
> <https://groups.google.com/d/msgid/cloudlab-users/6683254d-3f0e-43da-88d5-f303c2a3ad97n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages