Currently, TFLite Converter has 4 types of quantizations available via MLIR based TFLite Converter (Float32, Float16, INT8, ACTIVATIONS_INT16_WEIGHTS_INT8). I am experimenting with them for image classification and object detection task.
Image classification was pretty straight forward and results observed are in line with the expectations, however I found object detection inference a bit complicated.
I followed following steps
to run tflite for object detection:
1. Exported frozen graph for inference (export_tflite_sdd_graph.py file inside the object_detection directory)
2. Converted model to tflite using TFLite converter python api. Used 4 quantization flags (Float32, Float16, INT8, ACTIVATIONS_INT16_WEIGHTS_INT8)
3. Ran inference on the above 4 models.
- Used ssd_mobilenet_v2 trained on COCO,
- Processor: 2 Xeon(R) Gold 5115 CPU @ 2.40GHz (40 core)
- GPU 2 x RTX 2080 with Turing ML cores
Experiment Results: Model's [average inference time for single image(in ms), size(in MB)]
TensorFlow Model Original: [33 milliseconds, 67 MB]
TensorFlow Lite Float32: [90 milliseconds, 65 MB]
TensorFlow Lite Float16: [60 milliseconds, 37 MB]
TensorFlow Lite INT8: [4710 milliseconds, 17 MB]
TensorFlow Lite ACTIVATIONS_INT16_WEIGHTS_INT8: [4709 milliseconds, 17 MB]
- Why inference time of TFLite model (No optimization i.e float32) is 3 times slower than the original tensorflow?
- Why INT8 and ACTIVATIONS_INT16_WEIGHTS_INT8 are incredibly slow for object detection, however for image classification they improved latency?
Any insights/suggestions/improvements to these observations would be really helpful.