Referencing both:
Stanford's cs231 class and
Caffe's Layer's Docs they note that the size of a feature map resulting from a convolution is:
(W - F +2P)/S + 1
where W = incoming feature map spatial dimension, F = convolution filter dimension, P = padding and S = stride. However, when training a standard AlexNet where F = 11 and S = 4 on 64x64 image patches, Caffe does not throw an error, and we train successfully. Similar calculations can be done for feature maps resulting from pooling layers.
Please explain this discrepancy...