Quantization flags for Object Detection in TFLite [ssd_mobilenet_v2 for COCO]

462 views
Skip to first unread message

Yashi Gupta

unread,
Feb 10, 2021, 4:30:09 AM2/10/21
to TensorFlow Lite
Hello Team,

Currently, TFLite Converter has 4 types of quantizations available via MLIR based TFLite Converter (Float32, Float16, INT8, ACTIVATIONS_INT16_WEIGHTS_INT8). I am experimenting with them for image classification and object detection task.
Image classification was pretty straight forward and results observed are in line with the expectations, however I found object detection inference a bit complicated. 

I followed following steps to run tflite for object detection:
1. Exported frozen graph for inference (export_tflite_sdd_graph.py file inside the object_detection directory)
2. Converted model to tflite using TFLite converter python api. Used 4 quantization flags (Float32, Float16, INT8, ACTIVATIONS_INT16_WEIGHTS_INT8)
3. Ran inference on the above 4 models.
Experiment Setup: 
  • Used ssd_mobilenet_v2 trained on COCO,
  • Processor: 2 Xeon(R) Gold 5115 CPU @ 2.40GHz (40 core)
  • GPU 2 x RTX 2080 with Turing ML cores
Experiment Results: Model's [average inference time for single image(in ms), size(in MB)]
TensorFlow Model Original: [33 milliseconds, 67 MB]
TensorFlow Lite Float32: [90 milliseconds, 65 MB]
TensorFlow Lite Float16: [60 milliseconds, 37 MB]
TensorFlow Lite INT8: [4710 milliseconds, 17 MB]
TensorFlow Lite ACTIVATIONS_INT16_WEIGHTS_INT8:  [4709 milliseconds, 17 MB]


Observations/Questions:
  1. Why inference time of TFLite model (No optimization i.e float32) is 3 times slower than the original tensorflow?
  2. Why INT8 and ACTIVATIONS_INT16_WEIGHTS_INT8 are incredibly slow for object detection, however for image classification they improved latency?
Any insights/suggestions/improvements to these observations would be really helpful.

Thanks,
Yashi Gupta



Jared Duke

unread,
Feb 10, 2021, 6:26:27 PM2/10/21
to Yashi Gupta, TensorFlow Lite
Hi Yashi, 
  1. Why inference time of TFLite model (No optimization i.e float32) is 3 times slower than the original tensorflow?
TFLite is primarily optimized for mobile deployment, which generally means that most kernel optimizations are for ARM CPUs. We are working to improve this for x86, and will be enabling some substantial optimizations in the near future for float models on x86, but some discrepancy is expected. Another thing to note is that TFLite is single-threaded by default, whereas TF is multi-threaded by default. That alone could account for the difference, and you can tweak the number of threads when using TFLite if that is critical for your deployment (including from the Python Interpreter).
 
  1. Why INT8 and ACTIVATIONS_INT16_WEIGHTS_INT8 are incredibly slow for object detection, however for image classification they improved latency?
It's hard to say without more details of the model. Are you able to provide a link to the float32 and int8 variants of your model? Again, quantized execution has primarily been optimized for ARM CPU deployment, and we have observed slowdowns using int8 execution vs float for x86 deployment. However, the difference should not be anywhere near as significant as what was observed, so that looks quite suspect and we can help investigate.

It would also be useful to know which version of TensorFlow you used to author/convert the models. Thanks.

Jared



--
You received this message because you are subscribed to the Google Groups "TensorFlow Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tflite+un...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tflite/f106bd23-ed8c-4624-9b2e-b3bb6328bf21n%40tensorflow.org.

Chen-Kim Who

unread,
Aug 16, 2022, 2:25:01 PM8/16/22
to TensorFlow Lite, jdd...@google.com, TensorFlow Lite
hi,
Recently I have been benchmarking the inference speed between the quantized and non-quantized tflite models. They have been converted from the same pre-trained Tensorflow model (a *.pb file).
The thing is that when I compared the inference speed between these two on Android phones, the (int) quantized version is always slower than the non-quantized one. I tried the benchmark both on Tensorflow 2.8.1 and 2.9.1, and got the same results. Do you have any suggestions to help me identify the root-causes? Any feedback is appreciated!

Thanks,
Zhenqing Hu

Chen-Kim Who

unread,
Aug 16, 2022, 2:57:22 PM8/16/22
to TensorFlow Lite, Chen-Kim Who, jdd...@google.com, TensorFlow Lite
hi, all,
Sorry to bother you in a short time.
Just now, I figured out something that could be some initial input for my questions. I just set one command line option --use_xnnpack=false for doing inference both on quantized and non-quantized tflite models. This time, it looks normal that the quantized version uses less time than the non-quantized version. As far as I know, XNNPACK could help to speed up inferences in the floating point model, but does it support the int model as well? According to this post https://github.com/google/XNNPACK/issues/999#issuecomment-848314091, it is disabled by default. Do we have the latest information about the support for int?

Any feedback is appreciated!

Thanks,
Zhenqing Hu



Marat Dukhan

unread,
Aug 16, 2022, 8:16:23 PM8/16/22
to TensorFlow Lite, huz...@gmail.com, Jared Duke, TensorFlow Lite
Hi Zhenqing,

You need to build TFLite with --define tflite_with_xnnpack_qs8=true Bazel option (tflite_with_xnnpack_qu8=true if you use unsigned quantization) to enable quantized inference by default.
See the XNNPACK delegate README for the latest information on supported operators.

Regards,
Marat

Chen-Kim Who

unread,
Aug 16, 2022, 8:26:15 PM8/16/22
to TensorFlow Lite, mar...@google.com, Chen-Kim Who, jdd...@google.com, TensorFlow Lite
hi, Marat,
Thank you for your quick response! I think it makes sense by doing this way. And I just want to do a quick double-confirm with you to enable the flag "tflite_with_xnnpack_qs8". Per my understanding, I need to pass this flag when I build Tensorflow Lite from the TF source code, right?

Thanks,
Zhenqing Hu

Marat Dukhan

unread,
Aug 16, 2022, 8:28:08 PM8/16/22
to Chen-Kim Who, TensorFlow Lite, jdd...@google.com
Right, you need to pass --define tflite_with_xnnpack_qs8=true when building TFLite with Bazel.

Chen-Kim Who

unread,
Aug 16, 2022, 8:35:55 PM8/16/22
to TensorFlow Lite, mar...@google.com, TensorFlow Lite, jdd...@google.com, Chen-Kim Who
Thank you so much for your clarification! I will have a try based on your suggestion and get back to here once I got some results. Many thanks once again!

Chen-Kim Who

unread,
Aug 17, 2022, 2:34:05 PM8/17/22
to TensorFlow Lite, Chen-Kim Who, mar...@google.com, TensorFlow Lite, jdd...@google.com
hi,  Marat,
Yesterday I rebuilt the TF source (r2.9.0), set  --define tflite_with_xnnpack_qs8=true and  --define tflite_with_xnnpack_qu8=true in the bazel building command line, but looks like the quantized version is still slower than the non-quantized version (I pass --use_xnnpack=true during inference). I even wrote those two flags in .bazelrc file and rebuilt the source code, but I got the same story.

I rethink what I have done until now. I have the training framework that is based on TF 1.x code, and I converted that framework to TF 2.x with the tool tf_upgrade_v2. Then I trained the model and saved it as *.pb, converting this model to the tflite one with both quantized and non-quantized versions. I am wondering if there is any problem with this procedure.

As always, any feedback is welcomed!

Thanks,
Zhenqing Hu

Marat Dukhan

unread,
Aug 17, 2022, 2:37:27 PM8/17/22
to Chen-Kim Who, TensorFlow Lite, jdd...@google.com
You may find XNNPACK profiling useful to detect which operators are slow.

Regards,
Marat

Chen-Kim Who

unread,
Aug 17, 2022, 2:42:01 PM8/17/22
to TensorFlow Lite, mar...@google.com, TensorFlow Lite, jdd...@google.com, Chen-Kim Who
Thank you! I will do this profiling to take a look what's going on. Many thanks!
Reply all
Reply to author
Forward
0 new messages