Legitimate method to run quantized model on server?

t kevin

unread,

May 24, 2021, 9:35:31 PM5/24/21

to TensorFlow Lite

hi guys,

I’m trying to optimize my model with 8bit integer quantization for performance.
From what I learned from Post-training quantization | TensorFlow Model Optimization 1
the only way for TF to run a integer quantized model is through the tflite runtime.
I’m trying to deploy the service on the cloud with a powerful CPU server and a bunch of HW accelerators.
Right now we are running with native TF runtime and tfserving. it’s working well.
It sounds that the tflite is not designed for this scenario. also in some article it says the tflite implementation of cpu kernels are not best fit for server.
Please let me know what is the legitimate method to run quantized model on cloud.

Thank you very much.

Kevin

Hyeonjong Ryu

unread,

May 25, 2021, 7:43:22 PM5/25/21

to TensorFlow Lite, kevi...@gmail.com

Hi Kevin,

We've recently worked on the x86 acceleration issue, I believe now many of the CPU kernel issues are fixed. If you still have the performance issue with your model please share more details with your model and running environment so we can investigate.

Thanks,

Hyeonjong

T.J. Alumbaugh

unread,

May 25, 2021, 8:19:55 PM5/25/21

to TensorFlow Lite, Hyeonjong Ryu, kevi...@gmail.com

Hi Kevin,

You are correct that the mobile platform continues to be our primary use case. However, as Hyeonjong mentions, we have improved performance here recently. One important note about getting good performance on x86: not every compiler has good support for the necessary AVX intrinsics on x86. I would advise you to use as follows:

Clang: version 8 or above

GCC: version 9 or higher

MSVC: _MSC_VER 1920 or higher (Visual Studio 2019 Version 16.0)

If you compile TF Lite with any of the above, you will likely get the best performance on x86. Good luck!

-T.J.

t kevin

unread,

May 25, 2021, 11:29:03 PM5/25/21

to T.J. Alumbaugh, TensorFlow Lite, Hyeonjong Ryu

Hi Hyeonjong and T.J.

Thanks for you reply.
I have no problem with "that the mobile platform continues to be
our(tflite) primary use case."

Probably my question is more about design perspective.
Cloud inference service could also benefit from quantization (both
performance and power consumption) but why TF choose to bind
quantization inference to tflite and not supporting it with native
runtime.

I'll see what I can do with tflite.
Thanks again for your clarification.

Kevin

T.J. Alumbaugh <talu...@google.com> 于2021年5月26日周三上午8:19写道：

Reply all

Reply to author

Forward