Node Becomes Unreachable After Reboot

10 views
Skip to first unread message

aerber zhou (Aerber)

unread,
May 30, 2025, 7:55:20 AMMay 30
to emulab-users
Dear Emulab Team,

I'm encountering an issue where a node becomes unreachable a few minutes after reboot. Initially, it's accessible (e.g., via SSH), but soon after, it's no longer reachable—even from other nodes on the same LAN.

Do you know what might cause this? Is there a system-level process or config that disables networking post-boot?

Happy to provide node or experiment details if needed.

Thanks,
Aerber

Eric Eide

unread,
May 30, 2025, 8:54:45 AMMay 30
to Emulab Users Mailing List
Hi!

Please provide more details so that we can investigate: your experiment that is
affected, which of the nodes is affected, what OS/disk image that node is
running, etc.

Thanks ---

Eric.

--
-------------------------------------------------------------------------------
Eric Eide <ee...@cs.utah.edu> . University of Utah Kahlert School of Computing
https://www.cs.utah.edu/~eeide/ . +1 801-585-5512 . Salt Lake City, Utah, USA

Mengying Zhou

unread,
May 30, 2025, 10:25:06 AMMay 30
to emulab...@googlegroups.com
Thanks for the response.

Problem description:
If I reboot the machine by the "sudo reboot" command on node A. Only node A cannot be accessible. Others are not affected.
But I can still let the machine reboot by using the interactive button on the web. Then I can access the node temporarily and lose the connection later.

The more details are:
1. configuration
The experiment contains 4 nodes. The node type is c240g5, which has the P100 GPU. 
Each node is configured with the standard Ubuntu 20.04 OS without any modification. 
But I set a 30 GB Temporary Filesystem Mount Point on /usr/local.

2. modification
I add the "ops.wisc.cloudlab.us:/proj/quic-PG0  /proj/quic-PG0  nfs  defaults  0  0" to /etc/fstab since I found the /proj disk wouldn't be mounted automatically after reboot. 

3. packages
- NVIDIA driver 535
- CUDA 11.6
- CuDNN 8.4

4. check the log
I list some logs from /var/log/syslog. Does it show the inaccessible problem from NVIDIA-related packages?

May 30 07:02:50 node1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
May 30 07:02:50 node1 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
May 30 07:02:50 node1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
May 30 07:02:51 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
May 30 07:02:51 node1 systemd[1]: Stopping LSB: automatic crash report generation...
# ... some repeat logs for "Failed to connect system bus: No such file or directory#033[0m"
May 30 07:05:30 localhost sh[1670]: #033[0;1;31mFailed to connect system bus: No such file or directory#033[0m
May 30 07:05:30 localhost kernel: [   12.046830] mpt3sas_cm0: sending message unit reset !!
May 30 07:05:30 localhost kernel: [   12.054076] mpt3sas_cm0: message unit reset: SUCCESS
May 30 07:05:30 localhost sh[1675]: #033[0;1;31mFailed to connect system bus: No such file or directory#033[0m
# ... some repeat logs for "pubsubd.service: Failed to execute command: No such file or directory"
May 30 07:06:01 localhost systemd[4017]: pubsubd.service: Failed to execute command: No such file or directory
May 30 07:06:01 localhost systemd[4017]: pubsubd.service: Failed at step EXEC spawning /usr/local/libexec/pubsubd: No such file or directory
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: #011(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost sh[3883]: Job for pubsubd.service failed because the control process exited with error code.
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost pulseaudio[4018]: Failed to open cookie file '/var/lib/gdm3/.config/pulse/cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to load authentication key '/var/lib/gdm3/.config/pulse/cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to open cookie file '/var/lib/gdm3/.pulse-cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to load authentication key '/var/lib/gdm3/.pulse-cookie': No such file or directory
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) Failed to load module "mga" (module does not exist, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) Failed to load module "mga" (module does not exist, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "nvidia" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "nouveau" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "modesetting" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "fbdev" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "vesa" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: MESA-LOADER: failed to open mgag200: /usr/lib/dri/mgag200_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri, suffix _dri)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: failed to load driver: mgag200
May 30 07:06:02 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) modeset(0): glamor initialization failed
May 30 07:06:02 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) NVIDIA(G0): ACPI: failed to connect to the ACPI event daemon; the daemon
May 30 07:06:02 localhost gnome-session[4224]: libEGL warning: DRI2: failed to authenticate
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost gnome-session[4162]: gnome-session-binary[4162]: WARNING: Falling back to non-systemd startup procedure due to error: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost gnome-session-binary[4162]: WARNING: Falling back to non-systemd startup procedure due to error: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost ntpd[3023]: bind(26) AF_INET6 fe80::ba3f:d2ff:fe77:c643%4#123 flags 0x11 failed: Cannot assign requested address
May 30 07:06:05 localhost ntpd[3023]: failed to init interface for address fe80::ba3f:d2ff:fe77:c643%4
May 30 07:06:05 localhost colord[4510]: failed to get edid data: EDID length is too small
May 30 07:06:05 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost gsd-sharing[4486]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost gsd-sharing[4486]: message repeated 3 times: [ Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1]
May 30 07:06:05 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:06:06 localhost gsd-color[4453]: failed to get edid: unable to get EDID for output
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:playback-repeat
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:hibernate
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:rfkill
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:playback-random
May 30 07:06:06 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:06 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:08:07 localhost sh[3883]: *** ERROR: linktest daemon failed to start. Timed out.
May 30 07:10:08 localhost pulseaudio[4728]: Failed to open cookie file '/users/myzhou/.config/pulse/cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to load authentication key '/users/myzhou/.config/pulse/cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to open cookie file '/users/myzhou/.pulse-cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to load authentication key '/users/myzhou/.pulse-cookie': No such file or directory
May 30 07:10:33 localhost pulseaudio[4728]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
May 30 07:11:15 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:11:15 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:26:05 localhost systemd[1]: Starting GRUB failed boot detection...
May 30 07:26:05 localhost systemd[1]: Finished GRUB failed boot detection.

Thanks again.
Best, 
Aerber


--
You received this message because you are subscribed to the Google Groups "emulab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emulab-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/emulab-users/m1wm9yntrk.fsf%40cs.utah.edu.


--
Best Regards.

Mengying Zhou
PhD student
School of Computer Science
Fudan University

Mike Hibler

unread,
May 30, 2025, 11:00:34 AMMay 30
to emulab...@googlegroups.com
You should not mount your temporary filesystem on /usr/local. That causes
it to cover up the default /usr/local and all the system tools under that.

I assume you are trying to make enough space to load all the Nvidia packages?
If possible, you should use the Ubuntu22 or 24 image as they have a larger
root filesystem.
> CAMSpUYRa4SA%3DrzxcUxw2uRPAEGvBiNnXYra0JFujSEhH%2BmfFNA%40mail.gmail.com.

Mengying Zhou

unread,
Jun 3, 2025, 9:47:21 AMJun 3
to emulab...@googlegroups.com
Thank you for the suggestion.

However, I think that may not be the root cause of the issue.
Over the past couple of days, I’ve tested several configurations. I found that even without mounting a temporary filesystem—and installing the NVIDIA driver, CUDA, and cuDNN directly on the root filesystem—the problem still persists. Specifically, after a reboot, the node becomes unreachable via SSH after some time, and I can only recover it using the reboot button on the web interface.

I currently have a node running in this state: c240g5-110121.wisc.cloudlab.us.


Below is my initialization script. Hope it help:

#!/bin/bash

# System Update
sudo apt-get update -y
sudo apt-get upgrade -y

# Install NVIDIA driver
sudo apt install ubuntu-drivers-common -y
sudo apt install nvidia-driver-535-server -y

# Install CUDA Toolkit (11.6)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb
rm cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb

sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
sudo rm -rf /var/cuda-repo-ubuntu2004-11-6-local

# Install cuDNN (v8.4 for CUDA 11.6)
wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.4.0/local_installers/11.6/cudnn-local-repo-ubuntu2004-8.4.0.27_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2004-8.4.0.27_1.0-1_amd64.deb
rm cudnn-local-repo-ubuntu2004-8.4.0.27_1.0-1_amd64.deb

sudo cp /var/cudnn-local-repo-ubuntu2004-8.4.0.27/*.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install libcudnn8 libcudnn8-dev
sudo rm -rf /var/cudnn-local-repo-ubuntu2004-8.4.0.27

# System Cleanup
sudo apt clean
sudo rm -rf /var/tmp/*
sudo rm -rf /var/cache/*
sudo rm -f /var/crash/*.crash
sudo journalctl --vacuum-time=7d

# Reboot to apply changes
sudo reboot


Thanks a lot.
Best,
Mengying

Mike Hibler

unread,
Jun 3, 2025, 3:34:25 PMJun 3
to emulab...@googlegroups.com
For one thing, you are installing an X display server which you definitely
do not want. I suspect the whole X windows environment that gets loaded
includes something that is putting the machine to sleep. So I think you are
installing the wrong nvidia packages for a headless machine.

Second, you are (indirectly) loading the network manager which conflicts with
our setup of the control network interface. See:
https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ

Hope this helps.
> CAMSpUYS%2BuHNb%2Bo7oD13Hoocd%3DYTYcwEgWjYObS%3D3zjy4NXN5oA%40mail.gmail.com.

Mengying Zhou

unread,
Jun 7, 2025, 9:54:40 AMJun 7
to emulab...@googlegroups.com
Hi Mike and Eric,

Thank you very much for your guidance and for sharing the experiences and solutions from others.

After some investigation, I identified three main causes that prevented SSH access to the node:
1. The NetworkManager installed with the NVIDIA driver overrode the CloudLab platform’s default systemd-networkd.
2. The NVIDIA driver package included X/display support
3. The machine was configured with a suspend timeout, likely also triggered by the NVIDIA driver installation.


Here are the steps I took to resolve these issues:
1. Disable the NVIDIA-related NetworkManager:
sudo systemctl disable NetworkManager
sudo systemctl disable NetworkManager-wait-online
sudo ln -s /dev/null /etc/systemd/system/NetworkManager.service

2. Install the headless version of the NVIDIA driver (no X/display):
sudo apt install nvidia-headless-535-server -y

3. Disable suspend/hibernate settings and set timeout to none:
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
sudo gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-ac-timeout 0
sudo gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-battery-timeout 0
sudo gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-ac-type 'nothing'
sudo gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-battery-type 'nothing'


In addition, I mentioned that I set a 30 GB temporary filesystem mount point on /usr/local. You’re right that it has the risk of shadowing the default contents. However, this did not cause access issues.  Moreover, when using sudo to install Python packages, the site-packages content is also installed under /usr/local. This significantly alleviates the disk space limitations in Ubuntu 20.04, especially when installing large packages like the NVIDIA driver, CUDA, cuDNN, and Python dependencies. Of course, I agree that using Ubuntu 22.04 or later is preferable. But for those who need to stick with older versions like 20.04, this method can be a practical workaround.


Thanks again for your support!

Best regards,
Mengying

Reply all
Reply to author
Forward
0 new messages