I haven't re-read SqueezeNet article for a long time but I remember that they have stated that the model can reach <0.5MB only under specific compression techniques. Under this compressed form, you can transfer the model but it is not directly usable.
Anyway, I believe that at these size ranges (a few Mb), what matters the most is not the size of the model itself, but rather the amount of (GPU) memory they need during inference. And this value can change a lot depending on which layers you use. For example, a model with a lot of fully convolutional layers use a lot of disk space, but will not necessarily use a lot more memory. On the contrary, batch norm layer parameters are lightweight to store as models but once applied to a specific image, they require some space in memory.
I can't really tell you which architecture uses the least amount of memory, you should try it by yourself as it may depend on the input image sizes.
For example, I have a 3.6Mb squeezenetv2 model (modified it a bit for my specific need), which results in 3Gb GPU memory usage during inference on 600x600 images.
How it helps.
Victor