Hi,
I have a server with 2GPUS(GTX1070×2)。
When I train network using caffe on GPU 0, it is OK.
But when I train on both GPUs (0 & 1), cuda error is encoutered.
F1230 15:49:14.493897 34786 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
F1230 15:49:14.493898 34804 math_functions.cu:26] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***Nidia-smi GPU 1 report error:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:83:00.0 Off | N/A |
| 46% 60C P8 10W / 151W | 121MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1070 Off | 0000:84:00.0 Off | N/A |
|ERR! 46C P0 ERR! / 151W | 2MiB / 8113MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14936 C /usr/bin/python 119MiB |
+-----------------------------------------------------------------------------+After a few minutes, nvidia-smi report GPU lost, and should reboot system.
Unable to determine the device handle for GPU 0000:84:00.0: GPU is lost. Reboot the system to recover this GPU
I guess this problem is caused by GPU.
Anyone can help me?
Thanks in advance!
Steven