Unable to detect CUDA via Tensorflow/Pytorch after restart DLVM

446 views
Skip to first unread message

palpitation

unread,
Jun 20, 2021, 1:19:37 PM6/20/21
to google-dl-platform
This issue happened when I restarted my cloud notebook server today. Can be reproduced using the steps below:
  1. Create a Google Cloud Notebook server with Tensorflow or Pytorch and GPU

  2. After starting the server, open the python console, CUDA device is available until now.Screen Shot 2021-06-21 at 1.14.02 AM.png

  3. Restart the server, and open the notebook again. Pytorch can not detect GPU devices.Screen Shot 2021-06-21 at 1.17.06 AM.png

  4. nvidia-smi command works fine.Screen Shot 2021-06-21 at 1.18.37 AM.png


This issue can also be reproduced by Tensorflow DLVM.

palpitation

unread,
Jun 20, 2021, 1:20:41 PM6/20/21
to google-dl-platform

The error message detail is:

>>> import torch
>>> torch.cuda.is_available()
/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1614378098133/work/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

Harveen Chadha

unread,
Jun 28, 2021, 5:24:01 PM6/28/21
to google-dl-platform
Is there a fix for this problem? All 15 instances in our account have stopped working. We may need to switch to AWS if this issue remain as it is.

Hoa Mai

unread,
Jun 28, 2021, 5:30:01 PM6/28/21
to google-dl-platform
Hi, we released a hot fix on Thursday afternoon. Please update to the latest images and if the issue persists, let us know.

Harveen Chadha

unread,
Jun 28, 2021, 5:41:22 PM6/28/21
to google-dl-platform
Hi,

Can you please list down the steps on how to update to latest image?

This is how I tried to do it and it did not work:

1. sudo apt-get update
2. sudo apt upgrade
3. sudo ldconfig
4. restart

I am attaching a screenshot.

Screenshot 2021-06-29 at 3.10.19 AM.png

Hoa Mai

unread,
Jun 28, 2021, 5:59:02 PM6/28/21
to google-dl-platform
Sure,

Step 1: Go to the Notebooks UI.
Step 2: Click on your instance's name.
Step 3: Press the "upgrade" button at the top right corner.

Attached is some screenshots to help guide you. Keep us updated on your status.

Cheers,

Hoa

Screen Shot 2021-06-28 at 2.53.44 PM.png
Screen Shot 2021-06-28 at 2.56.48 PM.png
Message has been deleted

Harveen Chadha

unread,
Jun 28, 2021, 6:16:45 PM6/28/21
to google-dl-platform
Hi,

Unfortunately I am not using Notebook, I am using Virtual Machines (VM) with Deep learning images (Google Deep learning VM).

Last Tuesday , all of a sudden GPU on all 15 instances stopped working with the same error that device count for GPU is zero. In some instances GPU is not detected but after a reinstall of driver (using script present in /opt/deeplearning/install_driver.sh),  GPU was visible again but not usable as device count in pytorch kept returning 0 (Tested even in tensorflow, no gpu was detected)

This happened only when we attached V-100 GPU to all the 15 instances. Even switching back to A-100 or T-4 the same error comes.

Now we have 128 GPU's allocated but we cannot use even one because of this error. The inference for end users is working on CPU's which is incredibly slow. Would really appreciate if you can suggest something for VM's, we can't afford to manually update CUDA and drivers on every single VM.

Hanchao Liu

unread,
Jun 28, 2021, 6:25:07 PM6/28/21
to google-dl-platform
Hi,

In this case, unfortunately you would need to create new VMs using the latest released VM images. If you are creating from the UI or using image family names from CLI, it will automatically using the latest images.

Harveen Chadha

unread,
Jun 28, 2021, 6:32:42 PM6/28/21
to google-dl-platform
Unfortunately, this is not an enterprise level solution.

You want me to setup thousands of libraries again on all the 15 instances? Can you estimate how much  manpower is needed to do this? You are one of the top cloud providers in the world, we were happily using your services and suddenly one day you decide to break the instances by pushing updates without my permission and then you are telling me that I need to setup all the instances again.

This is totally not acceptable. 

Hoa Mai

unread,
Jun 28, 2021, 7:04:20 PM6/28/21
to Harveen Chadha, google-dl-platform
Hi Harveen,

I can assure you that you will not have to reconfigure all 15 instances again. 

We are working on a script that you can run to address this issue and should have it ready soon. In addition, I can help you directly in the meantime to address this issue. Let me know when is a good time for you and I can set up a short video call between the two of us to resolve your issue immediately.

Thanks for understanding,

Hoa

--
You received this message because you are subscribed to a topic in the Google Groups "google-dl-platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-dl-platform/RB0vjPCm58I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-dl-platf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-dl-platform/ed7ec166-7c4d-440d-965d-9449fcbb9f62n%40googlegroups.com.

Hoa Mai

unread,
Jun 30, 2021, 1:49:34 PM6/30/21
to google-dl-platform
Hi Harveen,

We released a public solution for your use case. Let me know if you continue to encounter any issues:


Thanks!

Harveen Chadha

unread,
Jun 30, 2021, 2:04:41 PM6/30/21
to google-dl-platform
I can confirm this is working. Unfortunately we lost so many traffic due to this! 


Thanks!
Reply all
Reply to author
Forward
0 new messages