Hello,
I am trying to understand the SGD.
I took a large network (like Alexnet) and just used 20 images with just 2 classes. (I know it will overfit). Batch size is kept as 20, so that each iteration is one epoch.
My understanding was that SGD will always try to optimize the parameters so that final loss is minimized. So I am expecting, in every iteration, loss should monotonically reduce.
But when I run it, I see lot of oscillation. Overall it is reducing, but I can see some jumps to higher loss value here and there. I know there will be oscillations if we take a mini-batch, but here weight is updated over entire dataset. So shouldn't it be decreasing monotonically?
Any idea what I am missing here?
(I tried with different learning rates from 0.01 to 0.0001. Also I checked with momentum=0, 0.9, 0.99 etc. All the cases, this is observed)
Thanks
Iaz