"error == cudaSuccess (30 vs. 0)" — What causes this and how to solve it?

162 views
Skip to first unread message

Jonathan

unread,
Sep 30, 2016, 2:18:59 PM9/30/16
to Caffe Users
Hi all,

I am trying to get Caffe working on my machine. Everything seems to be installed correctly, but when running Caffe on an AlexNet model I keep getting an error.

I am running macOS Sierra (10.12) on a MacBook Pro (2.7GHz quad-core Intel Core i7; 16GB DDR3 RAM; NVIDIA GeForce GT 650M with 1GB of GDDR5 memory). I have installed CUDA Toolkit 7.5.27, CUDA Driver 7.5.30 (for macOS Sierra support), and MKL from the Intel Parallel Studio XE 2017.

Caffe is configured without CuDNN, using MKL as BLAS. Furthermore I’ve disabled all CUDA architectures except for 3.0, since 3.0 is the Compute Capability my GPU supports. I have compiled Caffe using Clang 7.3.0 (from the Xcode Command Line Tools 7.3.1). Caffe installed without errors. All tests run fine (using make test && make runtest).

Now I’m trying to run Caffe on a model (a slight modification of the included AlexNet). When I set the batch_size of the training/validation data to 3/4, Caffe runs fine. However, when increasing the batch size (to 4/5), Caffe crashes with the following error:
I0930 19:56:37.181311 2936206272 net.cpp:283] Network initialization done.
I0930
19:56:37.181493 2936206272 solver.cpp:60] Solver scaffolding done.
I0930
19:56:37.182329 2936206272 caffe.cpp:251] Starting Optimization
I0930
19:56:37.182346 2936206272 solver.cpp:279] Solving AlexNet
I0930
19:56:37.182355 2936206272 solver.cpp:280] Learning Rate Policy: step
F0930
19:56:37.183706 2936206272 syncedmem.cpp:56] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***
   
@        0x101baef28  google::LogMessage::Fail()
   
@        0x101bae39c  google::LogMessage::SendToLog()
   
@        0x101bae8fa  google::LogMessage::Flush()
   
@        0x101bb1caf  google::LogMessageFatal::~LogMessageFatal()
   
@        0x101baf20f  google::LogMessageFatal::~LogMessageFatal()
   
@        0x102132474  caffe::SyncedMemory::to_gpu()
   
@        0x102131f2e  caffe::SyncedMemory::mutable_gpu_data()
   
@        0x10210288d  caffe::Net<>::ClearParamDiffs()
   
@        0x10211e9ad  caffe::Solver<>::Step()
   
@        0x10211e41d  caffe::Solver<>::Solve()
   
@        0x101b4be28  train()
   
@        0x101b4e4ea  main
   
@     0x7fffa632c255  start
Abort trap: 6

When digging into the Caffe source code, I found that the following line is causing the error (syncedmem.cpp:56):
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));

I am not familiar with CUDA or Caffe, but from this code, I got the feeling that Caffe tries to allocate too much memory on my GPU, which causes the crash (since the network runs fine with a lower batch size, and the error occurs on a malloc).

However, I am not completely convinced of this. Why do I get cudaErrorUnknown (error 30) instead of a cudaErrorMemoryAllocation (error 2)? Furthermore: when looking at my GPU’s processor and memory usage during a correct run (with a low batch size), the memory usage stays pretty low (10–20% during the whole process), so why would increasing the batch size with only 1 make my GPU run out of memory completely? (See the attached screenshot of my GPU’s stats.)

My questions are: what is causing this error — is the memory of my GPU insufficient, or is it something else? And, more importantly: (how) can I solve this error?

Thanks in advance!

Kind regards,
Jonathan
gpu-stats.png
Reply all
Reply to author
Forward
0 new messages