caffe-ssd (weiliu89) and mobilenet-ssd(chuanqi305) training problems

crete

unread,

Dec 2, 2018, 10:02:41 PM12/2/18

to Caffe Users

I have been stuck on this problem for days. I have a gpu version caffe (cloned from weiliu89's caffe-ssd) on ubuntu 18.04 successfully compiled, with nvidia drivers 410.48 (GeForce GTX 1060 laptop version) and cuda 9.2 and corresponding cudnn installed. I followed the repo's instructions, downloaded VGGNet file, downloaded VOC07 and VOC12 datasets, made the lmdb files, and when i started training with "python examples/ssd/ssd_pascal.py", I got the following error:

math_functions.cpp:250] Check failed: a <= b (0 vs. -1.19209e-07)

*** Check failure stack trace: ***

@ 0x7f191a3480cd google::LogMessage::Fail()

@ 0x7f191a349f33 google::LogMessage::SendToLog()

@ 0x7f191a347c28 google::LogMessage::Flush()

@ 0x7f191a34a999 google::LogMessageFatal::~LogMessageFatal()

@ 0x7f191aa4d1c7 caffe::caffe_rng_uniform<>()

@ 0x7f191aa893c8 caffe::SampleBBox()

@ 0x7f191aa89720 caffe::GenerateSamples()

@ 0x7f191aa89970 caffe::GenerateBatchSamples()

@ 0x7f191abb3892 caffe::AnnotatedDataLayer<>::load_batch()

@ 0x7f191ab60a7a caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()

@ 0x7f191aa1d205 caffe::InternalThread::entry()

@ 0x7f190e933bcd (unknown)

@ 0x7f18fbccc6db start_thread

@ 0x7f1918a8b88f clone

Aborted (core dumped)

I searched for solutions, and commented the 250th line of math_functions.cpp, which is "CHECK_LE(a, b)", and recompiled caffe-ssd. The problem did solved, but the training stuck on the following step and wouldn't go on:

I1203 10:54:52.244547 3753 solver.cpp:294] Solving VGG_VOC0712_SSD_300x300_train

I1203 10:54:52.244570 3753 solver.cpp:295] Learning Rate Policy: multistep

I1203 10:54:52.247480 3753 blocking_queue.cpp:50] Data layer prefetch queue empty

I searched for solutions again, but did not get much useful information. I prepared custom datasets for mobilenet-ssd and followed the instructions in chuanqi305's repo, and when I started training, I got the same problem. I tested mnist training and cifar training in official tutorial of caffe, and it went smoothly. Have anyone got the same problem and solved it? Wish for a possible hint of solution and I will try it. Many thanks.

Mc Neill Ivan

unread,

Feb 3, 2019, 8:32:16 AM2/3/19

to Caffe Users

Hi @crete,

Were you able to solve the issue? I am also facing the same issue. Any ideas would be helpful.

Thanks,

Ivan

Tom Deblauwe

unread,

Feb 13, 2019, 4:24:09 PM2/13/19

to Caffe Users

Hello,

I got the EXACT same problem. I followed this guide:

https://tolotra.com/2018/09/15/how-to-retrain-ssd-mobilenet-for-real-time-object-detection-using-a-raspberry-pi-and-movidius-neural-compute-stick

So I already tried CPU/GPU and I'm also using the nvidia 1060, and did the same fix you mention with the check_le macro. I have really the same problem and can't find a solution...

I already created a script to check my lmdb is ok, and it seems so:

import caffe
import lmdb

import PIL.Image
from StringIO import StringIO
import numpy as np

lmdb_env = lmdb.open("trainval_lmdb")
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()

datum = caffe.proto.caffe_pb2.AnnotatedDatum()
for key, value in lmdb_cursor:
    print("Key: {}", key)
    datum.ParseFromString(value)
    for ann in datum.annotation_group:
	print("Annotation ", ann.group_label)
	for a in ann.annotation:
	    print("  instance_id:", a.instance_id)
	    print("  bbox:", a.bbox.xmin, a.bbox.xmax, a.bbox.ymin, a.bbox.ymax, a.bbox.label)

I got bounding boxes in my annotations, so yeah, the data is good.

So if anyone has a solution, please post it :)

Best regards,

Tom,

Tom Deblauwe

unread,

Feb 14, 2019, 5:33:12 AM2/14/19

to Caffe Users

Hi!

If I apply this fix mentioned here, then it starts to train!

https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120

Best regards,

Tom,

Mc Neill Ivan

unread,

Feb 14, 2019, 6:59:50 AM2/14/19

to Caffe Users

Hi Tom,

Just to reconfirm, doing the changes in $CAFFE_ROOT/src/caffe/util/math_functions.cpp, we are also solving the "Data layer prefetch queue empty" issue?

Thanks,

Ivan

Mc Neill Ivan

unread,

Feb 14, 2019, 2:10:52 PM2/14/19

to Caffe Users

Hi Tom,

Thanks a lot! The changes you had suggested actually fixed the Data layer prefetch queue empty issue

regards,

Ivan

Tom Deblauwe

unread,

Feb 14, 2019, 5:18:28 PM2/14/19

to Caffe Users

Yes indeed! And the results are good of the training with the fix.
Glad to help,
Best regards
Tom

Mark

unread,

Apr 25, 2020, 2:18:40 AM4/25/20

to Caffe Users

Hey Tom,

I am also following the tutorial you have mentioned above. I am also getting the "Data layer prefetch queue empty... Killed", I already edited $caffe/src/caffe/util/math_functions.cpp but I am still unable to create any snapshot/mobilenet_iter_2035.caffemodel therefore I could not continue to the next step of the tutorial. I was wondering If you could help me solve my issue?

What checks did you perform? I ran your python script to check my lmdb and the results were something like this:

(' instance_id:', 0)

(' bbox:', 0.5465949773788452, 0.6953405141830444, 0.17891374230384827, 0.28753992915153503, 0)

('Key: {}', '00000006_Images/1A9R00.jpg')

('Annotation ', 3)

(' instance_id:', 0)

(' bbox:', 0.6666666865348816, 0.8333333134651184, 0.0702875405550003, 0.3865814805030823, 0)

(' instance_id:', 1)

(' bbox:', 0.6146953701972961, 0.740143358707428, 0.00319488812237978, 0.17571884393692017, 0)

(' instance_id:', 2)

(' bbox:', 0.5035842061042786, 0.616487443447113, 0.00319488812237978, 0.17891374230384827, 0)

(' instance_id:', 3)

(' bbox:', 0.3333333432674408, 0.45519712567329407, 0.00319488812237978, 0.23961661756038666, 0)