Hi,
Thanks for the reply.
I've seen this, and understand that this gives us a dense feature map as output. But the paper I linked to originally trains its network by sliding a window over the entire image, and sending the patch contained by the window at each pixel to the convnet for classification.
In fact, the net surgery tutorial mentions this at the end:
"Note that this model isn't totally appropriate for sliding-window detection since it was trained for whole-image classification. Nevertheless it can work just fine. Sliding-window training and finetuning can be done by defining a sliding-window ground truth and loss such that a loss map is made for every location and solving as usual."
The above sentence is what has stumped me. The sliding window ground truth would be as large as the image. How do a get a convnet, albeit a fully convolutional one, to give an output that is the same size as the input? And how do I define a loss function that takes the loss at each pixel and back propagates it accordingly?
Apologies in advance if my questions are too naive, I'm just starting out in this area.