I'm playing with the dataset from Kaggle 'plankton' challenge by using caffe. When I run 'caffe train', the program hangs at 'load mean file'. Here is what I did:
- Resize the training images to 256x256. Instead of using caffe's buildin resize routine, I manually resized each image without changing aspect ratio, then pad the image so the output image are square even input is not.
- Following imagenet example, create a txt file which contain the training file path, name and the target labels.
- Create LMDB.
- compute_image_mean similar to imagenet example.
- Run 'caffe train' with solver.prototxt.
The I found program hanging when loading mean file:
========== begin log ================
....
I1101 21:36:03.284312 12459 layer_factory.hpp:76] Creating layer data
I1101 21:36:03.284350 12459 net.cpp:106] Creating Layer data
I1101 21:36:03.284356 12459 net.cpp:411] data -> data
I1101 21:36:03.284365 12459 net.cpp:411] data -> label
I1101 21:36:03.284374 12459 data_transformer.cpp:25] Loading mean file from: /home/weiliu/largedisk/projects/plankton/mean.binaryproto
======= end of log ====================
The program just hang there, no further output. No cpu usage. I can see gpu memory change by running 'nvidia-smi' command:
======== nvidia-smi ================
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1168 G /usr/bin/X 99MiB |
| 0 2621 C ...argedisk/packages/caffe/build/tools/caffe 572MiB |
| 0 2748 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 56MiB |
+-----------------------------------------------------------------------------+
======== end of nvida-smi ==============
Even ctrl-C doesn't work. I have to force kill the terminal.
Following
this link to convert protobuf to numpy array, and got a 1x3x256x256 array which looks fine.
Occasionally (once or twice), I saw an error happen at reading mean file. I don't remember exactly error since I cannot reproduct, but the error points to line 98 of data_reader.cpp, which has following code:
------------- code ---------------
// Check no additional readers have been created. This can happen if
// more than one net is trained at a time per process, whether single
// or multi solver. It might also happen if two data layers have same
// name and same source.
CHECK_EQ(new_queue_pairs_.size(), 0);
----------- end of code ---------------------
Can anyone give me a direction how to debug? I can provide more information. I appreciate your help.