Rounding differences between TF and gemmlowp

420 views

Skip to first unread message

Parviz Palangpour

unread,

Nov 7, 2017, 12:16:59 PM11/7/17

to gemmlowp

I am interested in building a library for new hardware that is compatible with quantized TensorFlow/gemmlowp.

Some of the quantization code differs from the TensorFlow quantization code. Should we always assume gemmlowp has the newest/best practices for deep learning quantization?

For instance, during requantization of accumulator values TensorFlow appears to round-half-up: https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/core/kernels/quantization_utils.h#L263

While gemmlowp appears to round away from zero: https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L263

Also, can you comment more generally about success with 8bit quantization in practice?

In a July GitHub thread (https://github.com/tensorflow/models/issues/1879#issuecomment-314152798), Pete Warden said

“We haven't tried quantization with this model, and I suspect that ResNet-style architectures may not tolerate quantization well from an accuracy standpoint (since they're so deep)”.

Can you comment on any success with gemmlowp and large high-accuracy models? Is the above concern about accuracy just because it wasn’t trained using FakeQuant operations? I know in a thread from August (https://groups.google.com/forum/#!topic/gemmlowp/LS4Q-mwxoqw) you said

“In practice, retraining for quantization has been key to achieving low enough accuracy degradation in order for quantization to be shippable in applications that I know about.”

So is gemmlowp only targeted at smaller “mobile-friendly” models like MobileNet or is it intended to be used for larger, more expensive/accurate models as well?

Thanks!

Benoit Jacob

unread,

Nov 7, 2017, 12:32:40 PM11/7/17

to Parviz Palangpour, gemmlowp

On Tue, Nov 7, 2017 at 12:16 PM, Parviz Palangpour <parvizpa...@gmail.com> wrote:

I am interested in building a library for new hardware that is compatible with quantized TensorFlow/gemmlowp.

Some of the quantization code differs from the TensorFlow quantization code. Should we always assume gemmlowp has the newest/best practices for deep learning quantization?

Current best practices are these explained in this doc:

https://github.com/google/gemmlowp/blob/master/doc/quantization.md

See the associated example code.

The gist of it is that the only entry point which people should use is GemmWithOutputPipeline, and this doc/example explains how to build and output pipeline that implements in a principled way the quantization scheme that we recommend here.

Older entry points such as Gemm and EightBitIntGemm should be considered legacy/depecated.

For instance, during requantization of accumulator values TensorFlow appears to round-half-up: https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/core/kernels/quantization_utils.h#L263

Indeed, looks like it. Rounding half-up instead of fully implementing round-to-nearest (i.e. rounding negative values half-down) has been observed to result in significant loss of end-to-end accuracy. Recently I noticed that the ARM NEON "rounding shift right" instructions (RHSL with negative offset; RSHR), which we had been relying on, were also rounding half-up, and that was found to be a cause of end-to-end accuracy loss. We got help from ARM on that, they confirmed this and contributed an optimized fix-up so we have correct round-to-nearest behavior by still applying a RSHL but combining it with "fixup" arithmetic around it:

https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint_neon.h#L146-L152

While gemmlowp appears to round away from zero: https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L263

Right, this is another place where correct round-to-nearest is very important to have. This means always breaking ties away from zero. (This is still "round to nearest", not "round away from zero" in general). This is why we use SQRDMULH and not SQDMULH in the ARM NEON implementation:

https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint_neon.h#L141-L144

Also, can you comment more generally about success with 8bit quantization in practice?

We hope to be able to share more about that in the future.

In a July GitHub thread (https://github.com/tensorflow/models/issues/1879#issuecomment-314152798), Pete Warden said
“We haven't tried quantization with this model, and I suspect that ResNet-style architectures may not tolerate quantization well from an accuracy standpoint (since they're so deep)”.
Can you comment on any success with gemmlowp and large high-accuracy models? Is the above concern about accuracy just because it wasn’t trained using FakeQuant operations? I know in a thread from August (https://groups.google.com/forum/#!topic/gemmlowp/LS4Q-mwxoqw) you said
“In practice, retraining for quantization has been key to achieving low enough accuracy degradation in order for quantization to be shippable in applications that I know about.”
So is gemmlowp only targeted at smaller “mobile-friendly” models like MobileNet or is it intended to be used for larger, more expensive/accurate models as well?

Thanks!

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/5f47a5db-4aa3-4040-abff-fd8673a79e9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages