Hi all,
I am trying to get Caffe working on my machine. Everything seems to be installed correctly, but when running Caffe on an AlexNet model I keep getting an error.
I am running macOS Sierra (10.12) on a MacBook Pro (2.7GHz quad-core Intel Core i7; 16GB DDR3 RAM; NVIDIA GeForce GT 650M with 1GB of GDDR5 memory). I have installed CUDA Toolkit 7.5.27, CUDA Driver 7.5.30 (for macOS Sierra support), and MKL from the Intel Parallel Studio XE 2017.
Caffe is configured without CuDNN, using MKL as BLAS. Furthermore I’ve disabled all CUDA architectures except for 3.0, since 3.0 is the Compute Capability my GPU supports. I have compiled Caffe using Clang 7.3.0 (from the Xcode Command Line Tools 7.3.1). Caffe installed without errors. All tests run fine (using make test && make runtest).
Now I’m trying to run Caffe on a model (a slight modification of the included AlexNet). When I set the batch_size of the training/validation data to 3/4, Caffe runs fine. However, when increasing the batch size (to 4/5), Caffe crashes with the following error:
I0930 19:56:37.181311 2936206272 net.cpp:283] Network initialization done.
I0930 19:56:37.181493 2936206272 solver.cpp:60] Solver scaffolding done.
I0930 19:56:37.182329 2936206272 caffe.cpp:251] Starting Optimization
I0930 19:56:37.182346 2936206272 solver.cpp:279] Solving AlexNet
I0930 19:56:37.182355 2936206272 solver.cpp:280] Learning Rate Policy: step
F0930 19:56:37.183706 2936206272 syncedmem.cpp:56] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
@ 0x101baef28 google::LogMessage::Fail()
@ 0x101bae39c google::LogMessage::SendToLog()
@ 0x101bae8fa google::LogMessage::Flush()
@ 0x101bb1caf google::LogMessageFatal::~LogMessageFatal()
@ 0x101baf20f google::LogMessageFatal::~LogMessageFatal()
@ 0x102132474 caffe::SyncedMemory::to_gpu()
@ 0x102131f2e caffe::SyncedMemory::mutable_gpu_data()
@ 0x10210288d caffe::Net<>::ClearParamDiffs()
@ 0x10211e9ad caffe::Solver<>::Step()
@ 0x10211e41d caffe::Solver<>::Solve()
@ 0x101b4be28 train()
@ 0x101b4e4ea main
@ 0x7fffa632c255 start
Abort trap: 6
When digging into the Caffe source code, I found that the following line is causing the error (syncedmem.cpp:56):
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
I am not familiar with CUDA or Caffe, but from this code, I got the feeling that Caffe tries to allocate too much memory on my GPU, which causes the crash (since the network runs fine with a lower batch size, and the error occurs on a malloc).
However, I am not completely convinced of this. Why do I get cudaErrorUnknown (error 30) instead of a cudaErrorMemoryAllocation (error 2)? Furthermore: when looking at my GPU’s processor and memory usage during a correct run (with a low batch size), the memory usage stays pretty low (10–20% during the whole process), so why would increasing the batch size with only 1 make my GPU run out of memory completely? (See the attached screenshot of my GPU’s stats.)
My questions are: what is causing this error — is the memory of my GPU insufficient, or is it something else? And, more importantly: (how) can I solve this error?
Thanks in advance!
Kind regards,
Jonathan