Image mean = the "average image" of the training dataset. It is calculated with the compute_image_mean C++ binary. Think of it as a DC offset for every image, that you can strip away and keep only the meaningful "AC" variations from that offset.
Image scale = the input size of an image for a network. If a network has been trained in MxN images, you should provide such a size of images for testing.
To my understanding, that is a user-selected parameter. You can try and set a different size.
Since the ILSVRC challenge contains a variety of object classes (dog, cat, chair, bike), many nets for that task use a "medium" image size , 200-something x 200-something.
I have no experience with MNIST, but I am guessing that since it's a digit recognition challenge, you can afford using small image sizes.Thus the 28x28 scales.
Of course, it would be interesting to see the performance-accuracy trade off of smaller scales on imagenet.