This is quite complex to explain, but I'll try. Imagine a 100x100 input and a 7x7 conv filter with a stride of 4, no padding. The output shape, as we follow
simple formulas, will be (100-7)/4+1 = 24.25, but this is impossible (shape must be integer). Instead, the layer fits only as many convolutions as it can - in this case the output will be of shape 24x24. Now, if we convolve that with a 3x3 filter with stride 2, you would get 11.5, which is impossible so your output will be cropped to largest smaller integer, in this case 11x11.
Look what happens if we reshape this input to 102x102, leaving convolutions as they were. First layer will try to output 24.75, which will be cropped to 24x24 like in the previous case. So in the end you'll get 11x11 as before. Reshaping down is also interesting: consider a 95x95 input. It will be convolved cleanly into 23x23 blob, which after the second convolution is 11x11 again! See, 95x95 and 102x102 are pretty much the same - they will both produce a 11x11 blob, which can be supplied to the same FC layer (with 121 inputs). The deeper your network and the larger strides you use, the more noticeable this effect becomes (the more you can vary your input with no influence on the final shape). I can imagine your net allowed you to go as far as 368 to 552 (though it's 50%, so quite a lot), but after crossing some threshold (656 must've been above it) the last conv layer also reshapes which causes your FC input mismatch (
K_ == new_K). In our case if we did 103x103 the last conv would output 12x12, which is 144 elements - if we had an FC after that which was trained on 95x95 images and hence expecting 121 inputs, it would throw this error.