Error using M121 CUDA DLC release: NVIDIA Driver too old

Jason Brancazio

unread,

May 21, 2024, 11:58:28 AMMay 21

to google-dl-platform

DL Platform Team,

I'm trying to run Vertex AI Training jobs for a PyTorch model using Lightning. My base container is gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121

I see the following error raised by the Lightning Trainer, ultimately traceable to torch._C._cuda_init(): "The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version..."

I've successfully run the training routine on a Vertex AI Workbench instance using a similar base image.

Looking at /usr/local in the DLC image, I see the following:

lrwxrwxrwx 1 root root 22 Nov 10 2023 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root 25 Nov 10 2023 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root 4096 Nov 10 2023 cuda-12.1

Whereas in the Vertex AI Workbench instance pytorch-2-2-cu121-v20240417-debian-11-py310 it's simply

lrwxrwxrwx 1 root root 21 Apr 21 14:58 cuda -> /usr/local/cuda-12.1/
drwxr-xr-x 17 root root 4096 Apr 21 14:59 cuda-12.1

I assume this is the source of the error.

Why isn't the cuda 12.1 directory symlinked to "cuda" in gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121?

The image is released by Google Cloud as a "cu121" image, so I'm surprised that it's not the default, and that other cuda paths even exist in the container.

Jason Brancazio

unread,

May 21, 2024, 12:03:52 PMMay 21

to google-dl-platform

Actually, if I look at /etc/alternatives in that image, I see

lrwxrwxrwx 1 root root 20 Nov 10 2023 cuda -> /usr/local/cuda-12.1
lrwxrwxrwx 1 root root 20 Nov 10 2023 cuda-12 -> /usr/local/cuda-12.1

So now I don't really have a theory for why torch._C._cuda_init() reports that the driver is too old.

Jason Brancazio

unread,

May 21, 2024, 12:12:06 PMMay 21

to google-dl-platform

Sorry for the continued messages - I now suspect the error is traceable to the version of NVIDIA drivers on the machines that run Vertex AI training.

On that Workbench instance I describe in my earlier message, I ran the following:

docker run --gpus all -it --entrypoint /bin/bash gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121

then once inside the container, I ran

python

and in the python REPL I ran

import torch

torch._C._cuda_init()

I didn't receive the "too old" error.

Jason Brancazio

unread,

May 21, 2024, 1:32:28 PMMay 21

to google-dl-platform

For anyone in the future reading this message: try a G2 machine on Vertex AI Training. As of May 2024, those machines have CUDA 12.4.

Reply all

Reply to author

Forward