CUBLAS_STATUS_EXECUTION_FAILED in caffe_gpu_gemm

2,421 views
Skip to first unread message

Daffe

unread,
Oct 15, 2014, 4:13:59 PM10/15/14
to caffe...@googlegroups.com
Hello,

I randomly run into this error when training on ImageNet:

...
F1015 20:56:12.179136 39375 math_functions.cpp:59] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
    @     0x2ba1d161bb7d  google::LogMessage::Fail()
    @     0x2ba1d161dc7f  google::LogMessage::SendToLog()
    @     0x2ba1d161b76c  google::LogMessage::Flush()
    @     0x2ba1d161e51d  google::LogMessageFatal::~LogMessageFatal()
    @           0x461320  caffe::caffe_gpu_gemm<>()
...
Aborted (core dumped)

and line 59 of math_functions.cpp:59 is:

  CUBLAS_CHECK(cublasSgemm(Caffe::cublas_handle(), cuTransB, cuTransA,
      N, M, K, &alpha, B, ldb, A, lda, &beta, C, N));

Is this likely to be related to NVidia driver or hardware problems eg fan speed? Thanks.



Mender

unread,
Oct 20, 2014, 9:17:00 AM10/20/14
to caffe...@googlegroups.com
I get this error too, when trainning my network on my dataset. 

 

Cliff Woolley

unread,
Oct 20, 2014, 9:29:21 AM10/20/14
to caffe...@googlegroups.com

It would be helpful to know the arguments passed to cublasSgemm at the time of the failure.

Can you attach a debugger and find out, or is it not easily reproducible?

Thanks,
Cliff

On Oct 20, 2014 9:17 AM, "Mender" <ran.w...@gmail.com> wrote:
I get this error too, when trainning my network on my dataset. 

 

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/2c3aa2fb-271e-47f2-9609-4f4820ee4442%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mender

unread,
Oct 21, 2014, 5:18:14 AM10/21/14
to caffe...@googlegroups.com
I train the net again, the bug reproduced.

[17938.298821] NVRM: Xid(PCI:0000:01:00): 31, Ch 00000002, engmask 00000101, intr 10000000
^[[2~F1021 14:40:44.983147 12312 math_functions.cu:32] Check failed: status == CUBLAS_STATUS_SUCCESS(13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
cuTransB
:1. cuTransA:0, N:363, M:96, K:2809,  alpha:1, B:0x4049c0000, ldb:2809, A:0x42c7e7580, lda:2809, beta:1, C:0x404e12200, N:363



To print some logs, I changed the caffe_gpu_gemm<float> in  math_functions.cu. But I don't know the meaning of these variables :

void caffe_gpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const float alpha, const float* A, const float* B, const float beta,
    float* C) {
  // Note that cublas follows fortran order.
  int lda = (TransA == CblasNoTrans) ? K : M;
  int ldb = (TransB == CblasNoTrans) ? N : K;
  cublasOperation_t cuTransA =
      (TransA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
  cublasOperation_t cuTransB =
      (TransB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
  // CUBLAS_CHECK(cublasSgemm(Caffe::cublas_handle(), cuTransB, cuTransA,
  //     N, M, K, &alpha, B, ldb, A, lda, &beta, C, N));
  cublasStatus_t status = cublasSgemm(Caffe::cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, B, ldb, A, lda, &beta, C, N);
  CHECK_EQ(status, CUBLAS_STATUS_SUCCESS)
          << caffe::cublasGetErrorString(status)
 << "\ncuTransB:" << cuTransB
 << ", cuTransA:" << cuTransA
 << ", N:" << N << ", M:" << M << ", K:" << K << ", alpha:" << alpha
 << ", B:" << B << ", ldb:"<< ldb << ", A:" << A << ", lda:" << lda
 << ", beta:" << beta << ", C:" << C << ", N:" << N;
}

Cliff Woolley

unread,
Oct 21, 2014, 11:34:52 AM10/21/14
to caffe...@googlegroups.com
Ah, okay, interesting.  I hadn't noticed the Xid 31 message in your earlier snapshot (it was there, but I didn't see it).  This is an important hint.  Xid errors are the NVIDIA Linux driver's way of reporting fatal exceptions that occur.  http://docs.nvidia.com/deploy/xid-errors/index.html documents these.  Its description of Xid 31 specifically is pasted below.  But in this particular instance, note that subsequent to the exception that the Xid is reporting, then *later* CUDA API calls (including those from within cuBLAS) will fail, so the CUBLAS_STATUS_EXECUTION_FAILED that you're seeing is actually an after-the-fact symptom rather than the actual issue itself.
 
While the below doesn't mention it specifically, it's conceivable that this could be a side effect of power system instability.  How certain are you that your power supply is sufficient?
 
--Cliff
 
 

XID 31: Fifo: MMU Error

This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs.

When this event is logged, NVIDIA recommends the following:

  1. Run the application in cuda-gdb or cuda-memcheck , or
  2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
  3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read.


Mender

unread,
Oct 21, 2014, 9:59:02 PM10/21/14
to caffe...@googlegroups.com
Thank you Cliff! You are right. It's an error caused by insufficient power supply.

deepcnn

unread,
Oct 22, 2014, 12:03:09 AM10/22/14
to caffe...@googlegroups.com
I faced similar problem but in the runing the runtest, I found that if lack sudo, then it seems some cuda errors will return. So add sudo before the caffe in the commnadline to see whether it's help.

Daffe於 2014年10月16日星期四UTC+8上午4時13分59秒寫道:

Cliff Woolley

unread,
Oct 22, 2014, 1:14:21 AM10/22/14
to caffe...@googlegroups.com

You shouldn't have to run Caffe (or any other regular CUDA app, for that matter) as root to get it to work.  If you do, that sounds like a bug of some kind.

--Cliff

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.

zhen zhou

unread,
Mar 26, 2015, 9:11:26 AM3/26/15
to caffe...@googlegroups.com
I met the same problem but in math_functions.cu:28 when training my own data. The other information is exactly the same. It is still not solved. I'll try to use the solution Cliff provided. Let's see.
Reply all
Reply to author
Forward
0 new messages