'caffe train' hang at reading mean proto

1,464 views
Skip to first unread message

Wei Liu

unread,
Nov 1, 2015, 11:18:39 PM11/1/15
to Caffe Users
I'm playing with the dataset from  Kaggle 'plankton' challenge by using caffe. When I run 'caffe train', the program hangs at 'load mean file'. Here is what I did: 

- Resize the training images to 256x256. Instead of using caffe's buildin resize routine, I manually resized each image without changing aspect ratio, then pad the image so the output image are square even input is not. 
- Following imagenet example, create a txt file which contain the training file path, name and the target labels. 
- Create LMDB. 
- compute_image_mean similar to imagenet example. 
- Run 'caffe train' with solver.prototxt. 

The I found program hanging when loading mean file: 
========== begin log  ================
....
I1101 21:36:03.284312 12459 layer_factory.hpp:76] Creating layer data
I1101 21:36:03.284350 12459 net.cpp:106] Creating Layer data
I1101 21:36:03.284356 12459 net.cpp:411] data -> data
I1101 21:36:03.284365 12459 net.cpp:411] data -> label
I1101 21:36:03.284374 12459 data_transformer.cpp:25] Loading mean file from: /home/weiliu/largedisk/projects/plankton/mean.binaryproto
======= end of log ====================

The program just hang there, no further output.   No cpu usage. I can see gpu memory change by running 'nvidia-smi' command:
======== nvidia-smi ================
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1168    G   /usr/bin/X                                      99MiB |
|    0      2621    C   ...argedisk/packages/caffe/build/tools/caffe   572MiB |
|    0      2748    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    56MiB |
+-----------------------------------------------------------------------------+

======== end of nvida-smi ==============

Even ctrl-C doesn't work. I have to force kill the terminal. 

Following this link to convert protobuf to numpy array, and got a 1x3x256x256 array which looks fine. 

Occasionally (once or twice), I saw an error happen at reading mean file. I don't remember exactly error since I cannot reproduct, but the error points to line 98 of data_reader.cpp, which has following code:
------------- code ---------------
// Check no additional readers have been created. This can happen if
      // more than one net is trained at a time per process, whether single
      // or multi solver. It might also happen if two data layers have same
      // name and same source.
      CHECK_EQ(new_queue_pairs_.size(), 0);
----------- end of code ---------------------

Can anyone give me a direction how to debug? I can provide more information. I appreciate your help. 

Wei Liu

unread,
Nov 2, 2015, 8:15:21 AM11/2/15
to Caffe Users
I found that during initialization, training model can successfully read the mean proto file. It's the test model that freeze during reading the same file. 

I even made a second copy of the same file, with a different filename and use it for data layer in test model, but still. 

Any thoughts where to start to look at? Thank you very much. 

Wei Liu

unread,
Nov 2, 2015, 8:23:53 AM11/2/15
to Caffe Users
I see what the problem is. I didn't bother creating the 'test_lmdb', and just used 'train_lmdb' in test data layer. Once I make 'test_lmdb' (even same as train_lmdb, but a separate copy with different file name), the program is able to proceed. 

It is not about the mean.binaryproto file, though it appears to be. 

Thanks,

patric zhao

unread,
Nov 3, 2015, 2:52:16 AM11/3/15
to Caffe Users
Thanks a lot!  This trick works for me!
Message has been deleted

Hosna Sattar

unread,
Nov 1, 2016, 10:52:01 AM11/1/16
to Caffe Users
Thanks. works for me too.
Reply all
Reply to author
Forward
0 new messages