I'm wondering if there are any tips/tricks for using spatial transformers that I should be aware of. I've spent a while now trying them out, but they are very hit-or-miss from my experience.
I tried it out using the distorted mnist dataset
like the one in the examples, and it worked well. However, when attempting to use it on colored images from real photographs, that's where I start to run into issues.
Typically what happens is one of two things:
1. In the first few iterations the spatial transformer layer zooms *way too much*, and the convolutional network below then is stuck trying to learn on incomplete images. (e.g. 200px x 200px input image, the network below is doomed to learn from the same 10px x 10px corner of every image). When this happens no learning occurs even if you let it keep running for hundreds of epochs to see if it'll get unstuck (it never does).
2. The spatial transform doesn't zoom/rotate/skew at all, and you're left with a fancy downsampling layer in your network.
I assumed that maybe my learning rates were too high/small, but no matter what my initial rate is (0.01 to 0.00001), the same 2 outcomes occur. I have also noticed that the adam update rule seems to be the only way to even get it to outcome 2, otherwise it always ends up getting stuck in outcome 1.
I've proven to myself that it *can* work, since I've been able to get it to zoom/rotate/skew well on the distorted mnist set, but outside of black-and-white images, I'm stuck. I'm willing to try whatever tricks/diagnostics others may have as spatial transformers could be terrifically useful for a few datasets I have.