Node going offline after installing CUDA because of NVIDIA power management

105 views
Skip to first unread message

ff...@nyu.edu

unread,
Dec 4, 2022, 12:32:16 PM12/4/22
to cloudlab-users
Hi, I read a few recent threads about people experiencing nodes going offline after installing CUDA, e.g. GPU server goes down once in a whileWisconsin node c4130-110133 becomes unresponsive frequentlyClemson Node clgpu001 (c4130) becomes unresponsivenode frequent goes offline.

I was experiencing something similar: in Ubuntu 20.04, after installing CUDA libraries and then rebooting, node would become unresponsive every ~15 minutes.  (and NetworkManager was not running.)

In my case, it turned out to be NVIDIA power management (I saw in the syslog that system was going to suspend before it became unresponsive). after running

sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

it stopped going offline.

Just sharing this in case it helps someone else...

Leigh Stoller

unread,
Dec 4, 2022, 12:44:30 PM12/4/22
to cloudla...@googlegroups.com
Thank you very much!!!

Leigh
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/b2a6ae47-9e51-4bab-a263-84d658f78246n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages