Some customers have been experiencing GPU driver issues on instance reboot. The symptom is that GPUs become inaccessible from their code (e.g. TensorFlow or PyTorch). Customers using Deep Learning VMs and Notebooks with environment versions as old as M66 are seeing this issue. We have fixed this issue with our M74 release.
If you have not experienced this issue, no action is needed. If your GPU is disconnected from your instance after reboot, here are some ways to fix the issue:
Deep Learning VM users:
Users can use of one of the two fixes below to solve issue
Fix #1: Use the latest DLVM image (M74 or later) in a new VM instance: We have released a fix for the newest DLVM image in M74 so you will no longer be affected by this issue.
Fix #2 Patch your existing instance running images older than M74:
Run the following via an SSH session on the affected instance:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.
Notebooks users:
Users can use of one of the two fixes below to solve issue:
Fix #1: Use the instance upgrade feature to schedule an environment upgrade to consume the latest fix (screenshot below):
Fix#2: Patch your existing Notebook instance running images older than M74:
Run the following via an SSH session on the affected instance:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.
--