caffe-ssd (weiliu89) and mobilenet-ssd (chuanqi305): Undefined problems at training

50 views
Skip to first unread message

Tamas Nemes

unread,
May 4, 2020, 9:31:06 AM5/4/20
to Caffe Users
So I want to re-train MobileNet-SSD provided by chuangqi305 (GitHub) with my own dataset. I have CPU-only caffe (GitHub) installed and already created my LMDBs. But when I start the training process (by starting 'caffe' in 'tools'), the program gets aborted, not showing any error messages or else which would tell me what the problem is. So any idea what can cause it to fail?

  • The output of the training process: (Every time it looks a bit different, but it ends up the same way)
...
I0504
06:17:07.864032 49845 net.cpp:100] Creating Layer conv11/dw/relu
I0504
06:17:07.864058 49845 net.cpp:434] conv11/dw/relu <- conv11/dw
I0504
06:17:07.864082 49845 net.cpp:395] conv11/dw/relu -> conv11/dw (in-place)
I0504
06:17:07.864109 49845 net.cpp:150] Setting up conv11/dw/relu
I0504
06:17:07.864133 49845 net.cpp:157] Top shape: 24 512 19 19 (4435968)
I0504
06:17:07.864158 49845 net.cpp:165] Memory required for data: 3523031072
I0504
06:17:07.864182 49845 layer_factory.hpp:77] Creating layer conv11
I0504
06:17:07.864212 49845 net.cpp:100] Creating Layer conv11
I0504
06:17:07.864236 49845 net.cpp:434] conv11 <- conv11/dw
I0504
06:17:07.864284 49845 net.cpp:408] conv11 -> conv11
I0504
06:17:07.866336 49845 net.cpp:150] Setting up conv11
   
@     0x7fc23ce246db  start_thread
   
@     0x7fc241add88f  clone
Aborted (core dumped)
  • I checked my LMDB using the Python script from this thread, could it be because of an error in my database? The output reads:
...
Key: {} b'00013697_Images/2011_002492.jpg'
Annotation  19
  instance_id
: 0
  bbox
: 0.0020000000949949026 0.9279999732971191 0.26726725697517395 0.9699699878692627 0
Key: {} b'00013698_Images/2011_004465.jpg'
Annotation  15
  instance_id
: 0
  bbox
: 0.20108695328235626 0.7119565010070801 0.15600000321865082 0.9539999961853027 0
  instance_id
: 1
  bbox
: 0.5380434989929199 1.0 0.515999972820282 0.8019999861717224 0
Key: {} b'00013699_Images/2011_004709.jpg'
Annotation  15
  instance_id
: 0
  bbox
: 0.42800000309944153 0.7099999785423279 0.19733333587646484 1.0 0
Seems pretty normal to me, please tell if you can find something suspicious.
  • Also, my solver.prototxt is attached, if something in there could cause this.
  • My best guess is that there is a lack of memory, causing the program to abort the process of creating layers. Whenever it stops, it's in the middle of building the convolution layers, so I suppose there must be something in this process that causes it to unexpectingly fail why there's no error message.
If you know what could cause this problem and how I can fix it, please share it! Thanks for your help!
solver_train.prototxt
Message has been deleted

Tamas Nemes

unread,
Jul 10, 2020, 11:46:42 AM7/10/20
to Caffe Users
After spending way too much time on this problem and trying endless solutions, I finally found what causes this issue. This error is particularly treacherous as in the most cases, it decides simply not to give an error message.

See the original thread here: https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120

Before compiling, you must edit the source code a little bit. Go to caffe/src/caffe/util/math_functions.cpp and in line 247, you find this function, which you should edit to look like this:

void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
  CHECK_GE(n, 0);
  CHECK(r);
 
  if (a > b) {
    Dtype c = a;
    a = b;
    b = c;
  }
  CHECK_LE(a, b);
  boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
  boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
      variate_generator(caffe_rng(), random_distribution);
  for (int i = 0; i < n; ++i) {
    r[i] = variate_generator();
  }
}


Note that I just added an if statement (that switches the variables a and b if a is larger than b) and removed the const flag in the parameter's line from Dtype a and Dtype b.
Then simply do:

$ make clean
$ make -j$(nproc)
$ make py -j$(nproc)
$ make test -j$(nproc)
$ make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.


For me, this worked very well!
Reply all
Reply to author
Forward
0 new messages