That 1ms number is for a heavily parallel inference, I think it's possible for a batch of 128 or something like this. In an industrial setting, where you care about latency and thus can't wait to acquire another image (and your batch size is 1), it will be hard to get as low as that. For example (my own work, coincidentally an industrial application), a Titan Z (single core) needs about 10ms to process a slightly optimized AlexNet on a single image.