I guess you have non-smooth learning rate setup between pretraining and finetuning. here by "learning rate" I mean the combined effect of batch size & lrate & num_jobs etc.
imagine this is a non-parallel training, the curve seems like you increase learning rate suddenly and starting decaying the learning rate.
Just a guess, not quite sure about this.