Getting Error while installing Nvidia Driver on GCP VM with image c0-deeplearning-common-gpu-v20240128-debian-11-py310

81 views
Skip to first unread message

Krunal Doshi

unread,
May 9, 2024, 2:20:14 AMMay 9
to google-dl-platform
We had created after VM in GCP using image c0-deeplearning-common-gpu-v20240128-debian-11-py310 (Debian 11, Python 3.10. With CUDA 11.8 preinstalled.)

We faced the issue where nvidia driver were not able to detect GPU. So it mention that we need to run script
sudo /opt/deeplearning/install-driver.sh

While running mentioned script we were getting below error

ERROR: An error occurred while performing the step: “Building kernel modules”. See
/var/log/nvidia-installer.log for details.

ERROR: An error occurred while performing the step: “Checking to see whether the nvidia kernel module was
successfully built”. See /var/log/nvidia-installer.log for details.

ERROR: The nvidia kernel module was not created.

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find
suggestions on fixing installation problems in the README available on the Linux driver download page at
www.nvidia.com.


nvidia-installer.log

William Grisaitis

unread,
May 22, 2024, 4:34:37 PMMay 22
to google-dl-platform
i experienced the same issue and error. 

this was fixed in the M121 image (release notes).

i think it was caused by an incompatibility between the linux kernel version (5.10.0-29-cloud-amd64) and installing nvidia dkms... but the new environment still uses linux kernel 5.10.0-29. 

someone on reddit said downgrading their kernel to 5.10.0-28 fixed things, but i couldn't figure out how to do that on the Vertex AI image. tried modifying /etc/default/grub but didn't work.

anyway: try manually upgrading your instance environment. docs for that here. 
Reply all
Reply to author
Forward
0 new messages