DL Platform Team,
I'm trying to run Vertex AI Training jobs for a PyTorch model using Lightning. My base container is
gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121I see the following error raised by the Lightning Trainer, ultimately traceable to
torch._C._cuda_init(): "The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version..."
I've successfully run the training routine on a Vertex AI Workbench instance using a similar base image.
Looking at /usr/local in the DLC image, I see the following:
lrwxrwxrwx 1 root root 22 Nov 10 2023 cuda -> /etc/alternatives/cuda
lrwxrwxrwx 1 root root 25 Nov 10 2023 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 1 root root 4096 Nov 10 2023 cuda-12.1Whereas in the Vertex AI Workbench instance
pytorch-2-2-cu121-v20240417-debian-11-py310 it's simply
lrwxrwxrwx 1 root root 21 Apr 21 14:58 cuda -> /usr/local/cuda-12.1/
drwxr-xr-x 17 root root 4096 Apr 21 14:59 cuda-12.1I assume this is the source of the error.
Why isn't the cuda 12.1 directory symlinked to "cuda" in gcr.io/deeplearning-platform-release/pytorch-cu121.py310:m121? The image is released by Google Cloud as a "cu121" image, so I'm surprised that it's not the default, and that other cuda paths even exist in the container.