Difficulty replicating performance on ILSCVR with CaffeNet

85 views
Skip to first unread message

dea...@uw.edu

unread,
Jun 24, 2016, 6:06:09 PM6/24/16
to Caffe Users


Caffe-nators
Using an AWS instance G2.2x large GRID K520 (it has 4 gb ram), below is my performance as a function of training iterations and below that is my solver prototxt. I use the same values in the solver as is included in Caffe for training CaffeNet. In the figure I also put Jeff Donahues reference training performance (in teal), my own training ( purple then green) matches his at the beginning but then drops below and begins oscillating, and the drop in learning rate at 100,000 does not boost performance as it does in his network training. 

There are only two differences I know of between our networks training (does anyone have the train and solver for his network?):
1. I began training with batches of 227, as I got an out of memory error initially, and trained with this smaller batch for a bit but increased it later when I realized performance was diverging and I could just reduce the size of the test set (128 to 64) to save memory, so I switched up to 256. This is annotated in the figure.
2. Following https://github.com/BVLC/caffe/issues/430 I set all my biases to 0.1 



solver prototxt:

net: '/data/imagenet/caffe_dat/train_val.prototxt'

test_iter: 1000

test_interval: 1000

base_lr: 0.01

lr_policy: "step"

gamma: 0.1

stepsize: 100000

display: 50

max_iter: 450000

momentum: 0.9

weight_decay: 0.0005

snapshot: 4000

snapshot_prefix: "/data/imagenet/caffe_dat/net_stages_trainin$

solver_mode: GPU


Does anyone have ideas of what might be causing this divergence in performance for such similar networks? I'll probably restart with everything exactly as the current defaults in caffe to make sure those two differences aren't whats causing the problem, they just seemed unlikely, and experimentation is costly on an AWS instance : )>


Thanks in advance,

Dean

Vijay Kumar

unread,
Jun 25, 2016, 4:54:46 AM6/25/16
to Caffe Users
I'm not an expert but I recently noticed similar issue when trying to reproduce a results of a paper.  Caffe expects the labels to be numbered from 0..C-1 (for C classes). So, do you have the labels as 0-999 for Imagenet? If you have already ensured this, ignore this message. 

dea...@uw.edu

unread,
Jul 6, 2016, 10:15:04 PM7/6/16
to Caffe Users
Hi Vijay,
Thanks for the advice. As a quick fix I tried subtract one from the labels:
layer {
  name: "fixlabel"
  type: "Power"
  bottom: "label"
  top: "label"
  power_param {
    shift: -1
  }
}
}

Here is the training log:



It diverged for a little bit then dropped down. So an off by negative one error doesn't seem to be the problem.

How did you go about checking whether the labels were correct?

Thanks,
Dean

dea...@uw.edu

unread,
Jul 13, 2016, 8:55:26 PM7/13/16
to Caffe Users
I checked my LMDB database and its indexing is in fact from 0-999, so indexing is not the problem. I am surprised when I added one to the labels, it did not run into an indexing problem, (maybe caffe uses modulo indexing?).
I've run exactly the solver and training text as described in:
In teal you see Jeff Donahues training, and in red my own. You can see there is an initial large qualitative difference, in that it takes mine a while to ramp up. This hang up was what led me initially to take the advice of https://github.com/BVLC/caffe/issues/430 and set all the biases to 0.1. 
I'm very curious why Donahue with the exact same settings wouldn't get this hang up...
Then of course after enough iterations it is clear my own net is decreasing performance while Donahues continues to improve.
At this point (assuming computers are deterministic : ) ) there must be a difference in the training data. Though I followed http://caffe.berkeleyvision.org/gathered/examples/imagenet.html as best I could.

The hang up at the beginning makes me suspect I messed up calculating the image means, does anyone want to share a couple values from their own calculations for which training caffenet worked?

Any other ideas of what the problem might be?




Thanks in advance for any advice.

Best,

Dean

Reply all
Reply to author
Forward
0 new messages