Error using M121 CUDA DLC release: NVIDIA Driver too old

25 views
Skip to first unread message

Jason Brancazio

unread,
May 21, 2024, 11:58:28 AMMay 21
to google-dl-platform
DL Platform Team,

I'm trying to run Vertex AI Training jobs for a PyTorch model using Lightning. My base container is gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121

I see the following error raised by the Lightning Trainer, ultimately traceable to torch._C._cuda_init(): "The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version..."

I've successfully run the training routine on a Vertex AI Workbench instance using a similar base image.

Looking at /usr/local in the DLC image, I see the following:

lrwxrwxrwx 1 root root   22 Nov 10  2023 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root   25 Nov 10  2023 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root 4096 Nov 10  2023 cuda-12.1


Whereas in the Vertex AI Workbench instance pytorch-2-2-cu121-v20240417-debian-11-py310 it's simply

lrwxrwxrwx  1 root root   21 Apr 21 14:58 cuda -> /usr/local/cuda-12.1/
drwxr-xr-x 17 root root 4096 Apr 21 14:59 cuda-12.1


I assume this is the source of the error.

Why isn't the cuda 12.1 directory symlinked to "cuda" in gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121?

The image is released by Google Cloud as a "cu121" image, so I'm surprised that it's not the default, and that other cuda paths even exist in the container.

Jason Brancazio

unread,
May 21, 2024, 12:03:52 PMMay 21
to google-dl-platform
Actually, if I look at /etc/alternatives in that image, I see

lrwxrwxrwx 1 root root   20 Nov 10  2023 cuda -> /usr/local/cuda-12.1
lrwxrwxrwx 1 root root   20 Nov 10  2023 cuda-12 -> /usr/local/cuda-12.1


So now I don't really have a theory for why torch._C._cuda_init() reports that the driver is too old. 

Jason Brancazio

unread,
May 21, 2024, 12:12:06 PMMay 21
to google-dl-platform
Sorry for the continued messages - I now suspect the error is traceable to the version of NVIDIA drivers on the machines that run Vertex AI training. 

On that Workbench instance I describe in my earlier message, I ran the following:

docker run --gpus all -it --entrypoint /bin/bash gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121

then once inside the container, I ran
python

and in the python REPL I ran
import torch
torch._C._cuda_init()

I didn't receive the "too old" error.



Jason Brancazio

unread,
May 21, 2024, 1:32:28 PMMay 21
to google-dl-platform
For anyone in the future reading this message: try a G2 machine on Vertex AI Training. As of May 2024, those machines have CUDA 12.4.
Reply all
Reply to author
Forward
0 new messages