Out of memory when training with GPU but works fine with CPU.

45 views
Skip to first unread message

Kathy Weiyi Li

unread,
Sep 8, 2016, 4:49:19 PM9/8/16
to Caffe Users
I was trying to learn SegNet but when I trained ANY model with GPU, I always got the error message "Out of memory":
I0908 06:45:47.229077  3659 net.cpp:247] Network initialization done.
I0908
06:45:47.229081  3659 net.cpp:248] Memory required for data: 1043082268
I0908
06:45:47.229259  3659 solver.cpp:42] Solver scaffolding done.
I0908
06:45:47.229348  3659 solver.cpp:250] Solving VGG_ILSVRC_16_layer
I0908
06:45:47.229353  3659 solver.cpp:251] Learning Rate Policy: step
F0908
06:45:47.533745  3659 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
   
@     0x7f8ff8ceadaa  (unknown)
   
@     0x7f8ff8ceace4  (unknown)
   
@     0x7f8ff8cea6e6  (unknown)
   
@     0x7f8ff8ced687  (unknown)
   
@     0x7f8ff903073a  caffe::SyncedMemory::mutable_gpu_data()
   
@     0x7f8ff8ff2393  caffe::Blob<>::mutable_gpu_diff()
   
@     0x7f8ff912eadd  caffe::ConvolutionLayer<>::Backward_gpu()
   
@     0x7f8ff9111efc  caffe::Net<>::BackwardFromTo()
   
@     0x7f8ff9112141  caffe::Net<>::Backward()
   
@     0x7f8ff9106f4d  caffe::Solver<>::Step()
   
@     0x7f8ff910786f  caffe::Solver<>::Solve()
   
@           0x4086c8  train()
   
@           0x406c61  main
   
@     0x7f8ff81fcf45  (unknown)
   
@           0x40720d  (unknown)
   
@              (nil)  (unknown)
Aborted (core dumped)

I tried to reduce the batch size in the prototxt files to 1 but this error still appear. I run
nvidia-smi

to check the status of the gpu, the output is

Thu Sep  8 06:39:13 2016      
+------------------------------------------------------+                      
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                      
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000M       Off  | 0000:01:00.0      On |                  N/A |
| N/A   42C    P0    25W / 100W |    229MiB /  4087MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1252    G   /usr/bin/X                                     154MiB |
|    0      2208    G   compiz                                          59MiB |
+-----------------------------------------------------------------------------+

But if I trained with CPU, it doesn't have this problem.

Can anyone help me solve this problem?

Thank you!



Reply all
Reply to author
Forward
0 new messages