i.e. we break the single deep network above to two constituent ones below. The first one tries to minimize a loss which is the difference between the conv3 and conv3' layers, the second one minimizes the usual loss between image labels and predictions.
The algorithm looks like:
So I am first forward propping the first net, then forward and back propping the second one, and then back propping the first one once some data is produced.
The thing is, I obtain very odd convergence behavior for the second net. The first one converges very fast, even exponentially, with not much noise. The second one, though, converges very, very slowly, with a lot of noise. The learning curves are:
Net12 is a reference - it is the complete net equivalent of the dual network. My question is, then, is there any obvious reason for such erratic behavior? I would be happy to provide more architectural details if needed.
Thanks!