Caffe Train Error on second GPU and this GPU is lost

704 views
Skip to first unread message

Steven Liu

unread,
Dec 30, 2016, 3:22:40 AM12/30/16
to Caffe Users

Hi,

I have a server with 2GPUS(GTX1070×2)。
When I train network using caffe on GPU 0, it is OK.
But when I train on both GPUs  (0 & 1),  cuda error is encoutered.


F1230 15:49:14.493897 34786 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
F1230 15:49:14.493898 34804 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0)  CUBLAS_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***


Nidia-smi GPU 1 report error:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:83:00.0     Off |                  N/A |
| 46%   60C    P8    10W / 151W |    121MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 0000:84:00.0     Off |                  N/A |
|ERR!   46C    P0   ERR! / 151W |      2MiB /  8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14936    C   /usr/bin/python                                119MiB |
+-----------------------------------------------------------------------------+


After a few minutes,  nvidia-smi report GPU lost, and should reboot system.

Unable to determine the device handle for GPU 0000:84:00.0: GPU is lost.  Reboot the system to recover this GPU

I guess this problem is caused by GPU.

Anyone can help me?

Thanks in advance!


Steven

Jonathan R. Williford

unread,
Jan 2, 2017, 3:01:38 PM1/2/17
to Caffe Users
What is your make configuration? Perhaps you could try a different BLAS or (if applicable) disable CUDNN?

Steven Liu

unread,
Jan 5, 2017, 2:32:24 AM1/5/17
to Caffe Users
Thanks for your advice, why you think blas library will cause it?

I already disabled cudnn in Makefile.config.    I use atlas BLAS library. 
I guess cuda GPUDirect maybe necessary.  

I still have not solve this problem. I will try.

Steven

Hojjat seyed mousavi

unread,
Jan 19, 2017, 5:21:46 PM1/19/17
to Caffe Users
I have the same Issue.

I have two TITAN X on ubuntu 16.04. When I ran the experiments on one of the two GPUs, it gives the same error :
"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost.  Reboot the system to recover this GPU"

My Experiment on the other GPU is still running

Can anyone help please?



On Friday, December 30, 2016 at 3:22:40 AM UTC-5, Steven Liu wrote:
Message has been deleted

pan tan

unread,
Jan 7, 2018, 9:15:37 AM1/7/18
to Caffe Users
any solution for this ? I'm the same as your phenomenon

在 2017年1月20日星期五 UTC+8上午6:21:46,Hojjat seyed mousavi写道:
Reply all
Reply to author
Forward
0 new messages