Thanks for the response.
Problem description:
If I reboot the machine by the "sudo reboot" command on node A. Only node A cannot be accessible. Others are not affected.
But I can still let the machine reboot by using the interactive button on the web. Then I can access the node temporarily and lose the connection later.
The more details are:
1. configuration
The experiment contains 4 nodes. The node type is c240g5, which has the P100 GPU.
Each node is configured with the standard Ubuntu 20.04 OS without any modification.
But I set a 30 GB Temporary Filesystem Mount Point on /usr/local.
2. modification
I add the "ops.wisc.cloudlab.us:/proj/quic-PG0 /proj/quic-PG0 nfs defaults 0 0" to /etc/fstab since I found the /proj disk wouldn't be mounted automatically after reboot.
3. packages
- NVIDIA driver 535
- CUDA 11.6
- CuDNN 8.4
4. check the log
I list some logs from /var/log/syslog. Does it show the inaccessible problem from NVIDIA-related packages?
May 30 07:02:50 node1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
May 30 07:02:50 node1 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
May 30 07:02:50 node1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
May 30 07:02:50 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
May 30 07:02:51 node1 systemd-udevd[69173]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
May 30 07:02:51 node1 systemd[1]: Stopping LSB: automatic crash report generation...
# ... some repeat logs for "Failed to connect system bus: No such file or directory#033[0m"
May 30 07:05:30 localhost sh[1670]: #033[0;1;31mFailed to connect system bus: No such file or directory#033[0m
May 30 07:05:30 localhost kernel: [ 12.046830] mpt3sas_cm0: sending message unit reset !!
May 30 07:05:30 localhost kernel: [ 12.054076] mpt3sas_cm0: message unit reset: SUCCESS
May 30 07:05:30 localhost sh[1675]: #033[0;1;31mFailed to connect system bus: No such file or directory#033[0m
# ... some repeat logs for "pubsubd.service: Failed to execute command: No such file or directory"
May 30 07:06:01 localhost systemd[4017]: pubsubd.service: Failed to execute command: No such file or directory
May 30 07:06:01 localhost systemd[4017]: pubsubd.service: Failed at step EXEC spawning /usr/local/libexec/pubsubd: No such file or directory
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: #011(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost sh[3883]: Job for pubsubd.service failed because the control process exited with error code.
May 30 07:06:01 localhost systemd[1]: pubsubd.service: Failed with result 'exit-code'.
May 30 07:06:01 localhost systemd[1]: Failed to start The Emulab publish/subscribe daemon.
May 30 07:06:01 localhost pulseaudio[4018]: Failed to open cookie file '/var/lib/gdm3/.config/pulse/cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to load authentication key '/var/lib/gdm3/.config/pulse/cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to open cookie file '/var/lib/gdm3/.pulse-cookie': No such file or directory
May 30 07:06:01 localhost pulseaudio[4018]: Failed to load authentication key '/var/lib/gdm3/.pulse-cookie': No such file or directory
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) Failed to load module "mga" (module does not exist, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) Failed to load module "mga" (module does not exist, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "nvidia" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "nouveau" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "modesetting" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "fbdev" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) Failed to load module "vesa" (already loaded, 0)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: MESA-LOADER: failed to open mgag200: /usr/lib/dri/mgag200_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/x86_64-linux-gnu/dri:\$${ORIGIN}/dri:/usr/lib/dri, suffix _dri)
May 30 07:06:01 localhost /usr/lib/gdm3/gdm-x-session[4022]: failed to load driver: mgag200
May 30 07:06:02 localhost /usr/lib/gdm3/gdm-x-session[4022]: (EE) modeset(0): glamor initialization failed
May 30 07:06:02 localhost /usr/lib/gdm3/gdm-x-session[4022]: (II) NVIDIA(G0): ACPI: failed to connect to the ACPI event daemon; the daemon
May 30 07:06:02 localhost gnome-session[4224]: libEGL warning: DRI2: failed to authenticate
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost gnome-session[4162]: gnome-session-binary[4162]: WARNING: Falling back to non-systemd startup procedure due to error: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost gnome-session-binary[4162]: WARNING: Falling back to non-systemd startup procedure due to error: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:03 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost ntpd[3023]: bind(26) AF_INET6 fe80::ba3f:d2ff:fe77:c643%4#123 flags 0x11 failed: Cannot assign requested address
May 30 07:06:05 localhost ntpd[3023]: failed to init interface for address fe80::ba3f:d2ff:fe77:c643%4
May 30 07:06:05 localhost colord[4510]: failed to get edid data: EDID length is too small
May 30 07:06:05 localhost /usr/lib/gdm3/gdm-x-session[4158]: dbus-daemon[4158]: [session uid=122 pid=4158] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost gsd-sharing[4486]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
May 30 07:06:05 localhost gsd-sharing[4486]: message repeated 3 times: [ Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1]
May 30 07:06:05 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:05 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:06:06 localhost gsd-color[4453]: failed to get edid: unable to get EDID for output
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:playback-repeat
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:hibernate
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:rfkill
May 30 07:06:06 localhost gsd-media-keys[4462]: Failed to grab accelerator for keybinding settings:playback-random
May 30 07:06:06 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:06:06 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:08:07 localhost sh[3883]: *** ERROR: linktest daemon failed to start. Timed out.
May 30 07:10:08 localhost pulseaudio[4728]: Failed to open cookie file '/users/myzhou/.config/pulse/cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to load authentication key '/users/myzhou/.config/pulse/cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to open cookie file '/users/myzhou/.pulse-cookie': No such file or directory
May 30 07:10:08 localhost pulseaudio[4728]: Failed to load authentication key '/users/myzhou/.pulse-cookie': No such file or directory
May 30 07:10:33 localhost pulseaudio[4728]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
May 30 07:11:15 localhost gnome-shell[4286]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
May 30 07:11:15 localhost gnome-shell[4286]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
May 30 07:26:05 localhost systemd[1]: Starting GRUB failed boot detection...
May 30 07:26:05 localhost systemd[1]: Finished GRUB failed boot detection.
Thanks again.