Why use VGG over GoogLeNet?

Steven Clark

unread,

Sep 2, 2015, 5:21:52 PM9/2/15

to Caffe Users

I see numerous people talking about fine-tuning VGG, etc.

Is there a particular reason people are using VGG over GoogLeNet? Unless I'm mistaken, GoogLeNet is both faster (https://github.com/BVLC/caffe/issues/1317) and more accurate (https://github.com/BVLC/caffe/wiki/Models-accuracy-on-ImageNet-2012-val)

So, what am I missing here?

thecro...@gmail.com

unread,

Sep 3, 2015, 9:24:18 AM9/3/15

to Caffe Users

My 2 cents, since I'm one of the people who adopted VGGNet.
It is true that GoogLeNet require less parameters (less memory) and less computation. This has been a design choice of them.

Anyway, VGGNet is more accurate than GoogLeNet when considering a single Convolutional Network (7.0% top-5 error vs 7.9 %).
Moreover, GoogLeNet radically changed the original LeNet/AlexNet architecture. So, it is not trivial to understand how to properly exploit it. For example, when finetuning it, there will be three losses instead of one.

In my opinion, VGGNet extracts features that are slightly more general and more effective for datasets other than ILSVRC. I did tests on the dataset I'm working on, and they confirmed it. Still, we are talking about a small difference in accuracy and a pretty heavy difference in computational effort. I think it's not worth in some scenarios.

Steven Clark

unread,

Sep 3, 2015, 11:26:33 AM9/3/15

to Caffe Users

Thanks, I appreciate your thoughts. Do you have a link for the 7.9% - 7.0% numbers? I believe you, just haven't seen them myself.

Does the size of VGG mean that 4GB+ GPUs are required? I have been able to get by with smaller batchsizes / gpus for googlenet.

It's true that the multiple losses (1 primary classifier, 2 aux classifiers) threw me for a loop when I first attempted to fine tune GoogLeNet. I tried fine-tuning from ILSVRC weights 2 ways: removing the 2 aux classifiers, and leaving them in but decreasing the learning weight of everything but the final classifier's FC layer. Both methods actually converged nicely and gave good classification results.

My application needs to do detection across very large (gigapixel) images, using a sliding window approach. So, number of computations / execution speed is pretty important for my use case.

Cheers

Steven Clark

unread,

Sep 3, 2015, 11:44:24 AM9/3/15

to Caffe Users

Ok, I see the quoted numbers you refer to in http://arxiv.org/pdf/1409.1556.pdf in section 4.5. Thanks again.

If you can stomach the slightly weirder network architecture of GoogLeNet, I still think that it strikes a better sweet-spot of performance vs. speed than VGG. But, now I can appreciate that others aren't entirely crazy for choosing VGG :)

thecro...@gmail.com

unread,

Sep 3, 2015, 12:32:36 PM9/3/15

to Caffe Users

Il giorno giovedì 3 settembre 2015 12:26:33 UTC-3, Steven Clark ha scritto:

Does the size of VGG mean that 4GB+ GPUs are required? I have been able to get by with smaller batchsizes / gpus for googlenet.

Luckily for us, Caffe now allows you to use smaller batch sizes and accumulate the gradients across several batches. Thus, I use a 4GB GPU with batch_size: 20, but the gradients are computed across 200 images.

It's true that the multiple losses (1 primary classifier, 2 aux classifiers) threw me for a loop when I first attempted to fine tune GoogLeNet. I tried fine-tuning from ILSVRC weights 2 ways: removing the 2 aux classifiers, and leaving them in but decreasing the learning weight of everything but the final classifier's FC layer. Both methods actually converged nicely and gave good classification results.

Very interesting, I will definitely try it out! Which approach would you recommend?

Steven Clark

unread,

Sep 4, 2015, 12:49:47 PM9/4/15

to Caffe Users

I'm using the 2nd approach currently (leave the aux classifiers in). Don't forget to change num_output: 1000 in all 3 classifiers according to the number of classes in your application.