Using non-square inputs is possible in Caffe. You just have twice as much numbers to watch out for, I'm referring to correct blob sizes (Andrej Karpathy's
instructions to CS231n explain this awesomely - check summary for section Convolutional Layer, particularly the equations describing conv output size). You can even use non-square convolution kernels - simply instead of stating
kernel_size use
kernel_h and
kernel_w in your conv layer definition (the same applies to pooling layer and stride sizes - see Caffe
layer catalogue for details).
By 16 and 10 pixels they refer to a convolution stride, that is by how many pixels you shift your kernel between two locations on the data (links I posted above explain the idea better). 16 pixels is a stride on a 600 px wide image that was created by upscaling a 375 px image by a factor of 1.6 - before resizing this would correspond to 16/1.6 = 10 pixels.