Fix for https://issuetracker.google.com/issues/191551132

810 views
Skip to first unread message

Yuxuan Chen

unread,
Jun 30, 2021, 1:47:39 PM6/30/21
to google-dl...@googlegroups.com, cloudml-devrel, Data Science Experience Team

Some customers have been experiencing GPU driver issues on instance reboot. The symptom is that GPUs become inaccessible from their code (e.g. TensorFlow or PyTorch).  Customers using Deep Learning VMs and Notebooks with environment versions as old as M66 are seeing this issue. We have fixed this issue with our M74 release.

If you have not experienced this issue, no action is needed. If your GPU is disconnected from your instance after reboot, here are some ways to fix the issue:

Deep Learning VM users:

Users can use of one of the two fixes below to solve issue

Fix #1: Use the latest DLVM image (M74 or later) in a new VM instance: We have released a fix for the newest DLVM image in M74 so you will no longer be affected by this issue.

Fix #2 Patch your existing instance running images older than M74: 

Run the following via an SSH session on the affected instance:

gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh

chmod +x /tmp/restart_patch.sh

sudo /tmp/restart_patch.sh

sudo service jupyter restart

This only needs to be done once, and does not need to be rerun each time the instance is rebooted.

Notebooks users:

Users can use of one of the two fixes below to solve issue: 

Fix #1: Use the instance upgrade feature to schedule an environment upgrade to consume the latest fix (screenshot below):


Fix#2: Patch your existing Notebook instance running images older than M74: 

Run the following via an SSH session on the affected instance:

gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh

chmod +x /tmp/restart_patch.sh

sudo /tmp/restart_patch.sh

sudo service jupyter restart

This only needs to be done once, and does not need to be rerun each time the instance is rebooted.

--
Reply all
Reply to author
Forward
0 new messages