Caffe Train Error on second GPU and this GPU is lost

Steven Liu

unread,

Dec 30, 2016, 3:22:40 AM12/30/16

to Caffe Users

Hi,

I have a server with 2GPUS（GTX1070×2）。
When I train network using caffe on GPU 0, it is OK.
But when I train on both GPUs （0 & 1）, cuda error is encoutered.

F1230 15:49:14.493897 34786 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
F1230 15:49:14.493898 34804 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***

Nidia-smi GPU 1 report error:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap|         Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
|   0 GeForce GTX 1070    Off | 0000:83:00.0     Off |                  N/A |
| 46%   60C    P8    10W / 151W |    121MiB / 8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1 GeForce GTX 1070    Off | 0000:84:00.0     Off |                  N/A |
|ERR!   46C    P0   ERR! / 151W |      2MiB / 8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
| GPU       PID Type Process name                               Usage      |
|=============================================================================|
|    0     14936    C   /usr/bin/python                                119MiB |
+-----------------------------------------------------------------------------+

After a few minutes, nvidia-smi report GPU lost, and should reboot system.

Unable to determine the device handle for GPU 0000:84:00.0: GPU is lost. Reboot the system to recover this GPU

I guess this problem is caused by GPU.

Anyone can help me?

Thanks in advance!

Steven

Jonathan R. Williford

unread,

Jan 2, 2017, 3:01:38 PM1/2/17

to Caffe Users

What is your make configuration? Perhaps you could try a different BLAS or (if applicable) disable CUDNN?

Steven Liu

unread,

Jan 5, 2017, 2:32:24 AM1/5/17

to Caffe Users

Thanks for your advice, why you think blas library will cause it?

I already disabled cudnn in Makefile.config. I use atlas BLAS library.

I guess cuda GPUDirect maybe necessary.

I still have not solve this problem. I will try.

Steven

Hojjat seyed mousavi

unread,

Jan 19, 2017, 5:21:46 PM1/19/17

to Caffe Users

I have the same Issue.

I have two TITAN X on ubuntu 16.04. When I ran the experiments on one of the two GPUs, it gives the same error :
"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

My Experiment on the other GPU is still running

Can anyone help please?

On Friday, December 30, 2016 at 3:22:40 AM UTC-5, Steven Liu wrote:

Message has been deleted

pan tan

unread,

Jan 7, 2018, 9:15:37 AM1/7/18

to Caffe Users

any solution for this ? I'm the same as your phenomenon

在 2017年1月20日星期五 UTC+8上午6:21:46，Hojjat seyed mousavi写道：

Reply all

Reply to author

Forward