Ok, I found the catch.
I was actually using a data batch generator to generate my data batch-by-batch. After debugging a bit, I found that, I
was shuffling the data at the end of every epoch, rather than at the beginning. Since my image file
paths are saved in a text file in class order, and I have some
class imbalanced data (1:10), the training was going down for couple of
batches (so long as the same class images were being processed, i guess)
and then jumping up to something very abnormal, (when it starts to
process new class images, i guess) and was never going down to normal again.
Once I modified my code to shuffle the
data at the beginning of each epoch, everything started to work as usual, loss going down and not jumping to some very high value
Considering this behaviour of the optimizer (SGD) on un-shuffled data, I am concerned about the fact that, in the case where the training data are heavily
imbalanced (in my case, it could go up to 1:20) and there exists a huge amount of training data (it could be 1,00,000 in my case), then even after shuffling the data at the beginning of each epoch, it might be the case that quite a good number of images from the majority class might appear consecutively, thereby, causing the above behaviour of breaking the
training to recur.
Therefore, can anyone please advise me what might be the solution to tackle such numerical issues during training?
Thanks.
Atique