GEMM kernel benchmark results, and new fast 8bit kernels

650 views
Skip to first unread message

Benoit Jacob

unread,
May 15, 2017, 4:37:12 PM5/15/17
to gemmlowp
Hi list,

This is to share some benchmark results, obtained with the stand-alone kernel benchmark that we have had for a while in standalone/neon-gemm-kernel-benchmark.cc .


I would also like to take this occasion to give a heads-up about new 8bit kernels that offer much improved performance (see that spreadsheet).

The old kernel, now known as the 'wide' kernel, is what you still get with DefaultL8R8BitDepthParams. Its contract remains that you can pass arbitrary uint8 values for both LHS and RHS operands.

However, lately we found a way to design a much faster kernel if you can guarantee that the LHS uint8 values are never 0, i.e. are all in the range [1, 255]. Indeed, in that case, we can reinterpret these values as int8 values in the range [-127, 127] by subtracting 128 (which is internally just adjusting the lhs_offset), and then this allows the kernel to compute int8*int8+int8*int8 within int16 accumulators instead of int32 accumulators without any risk of overflow (the only overflowing case is -128*-128 + -128*-128, which is averted by guaranteeing that the LHS never takes the value 0).

You enable the faster kernel (and the contract that LHS uint8 values are nonzero) by using L8R8WithLhsNonzeroBitDepthParams.

The spreadsheet shows the performance difference --- now achieving ~ 23 Gop/s on a single CPU core of various shipping phones. (And yes, perhaps surprisingly, the original Nexus 5 is the king of the hill).

Cheers,
Benoit

beeblebrox

unread,
Oct 9, 2017, 3:30:12 AM10/9/17
to gemmlowp
Hi Benoit,

Can you clarify the behavior if we use L8R8WithLhsNonzeroBitDepthParams but the LHS still contains some zero values?

Thanks

Benoit Jacob

unread,
Oct 10, 2017, 11:01:44 AM10/10/17
to beeblebrox, gemmlowp
Hello,

The answer to this is going to be different depending on whether you're asking about the API contract, or the actual behavior.

The API contract is that it's undefined behavior -- so we make no guarantees at all. At some point we might want to add assertions guarding against this, but at the moment there are no assertions (these would have significant overhead).

The actual behavior if LHS still contains zero values is as follows. If two LHS values on the same LHS row, exactly 8 columns apart, are both zero, and the corresponding RHS values that they are multiplied against in the GEMM (in the same RHS column, exactly 8 RHS rows apart) are also both zero, then an incorrect result is produced (because integer overflow occurs).    (This is an incorrect result, not undefined behavior as in the C++ language semantics for signed integer overflow, because we use assembly, not C++, to perform this arithmetic). In every other case, the result is correct. In particular, RHS and LHS actually play symmetric roles, so for instance if RHS only contains nonzero values, then it does not actually matter if there are zero values in LHS. Also, if LHS contains only one value that is zero, then nothing bad can happen.  But this is an implementation detail and we make no guarantees about this. The current API contract is intentionally much stricter than we need, so as to avoid entrenching implementation details in the API.

The idea of requiring specifically the LHS values to be nonzero is that we typically use gemmlowp in NN inference applications, where LHS is the weights and RHS is the activations. The weights matrix being a constant, is easy to control to avoid zero values there.

Let me know if a change to that would help.

Cheers,
Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/5160321f-0ff2-4183-b760-8c2c5e129d8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

beeblebrox

unread,
Oct 10, 2017, 12:14:30 PM10/10/17
to gemmlowp
Thanks for the detailed reply, Benoit. 

The idea of requiring specifically the LHS values to be nonzero is that we typically use gemmlowp in NN inference applications, where LHS is the weights and RHS is the activations. The weights matrix being a constant, is easy to control to avoid zero values there.

In practice, what does this mean for the float-to-int8 conversion for LHS? If, for example, I follow the method in quantization_example.cc, will it suffice to use qmin=1 (instead of 0) while quantizing LHS?

Thanks

Benoit Jacob

unread,
Oct 10, 2017, 12:17:39 PM10/10/17
to beeblebrox, gemmlowp
No, that would not suffice: there is still nothing that would clamp values at 1 to prevent a sufficiently low real input value from being quantized as 0.

In your float-to-uint8 quantization code, as in this line,

instead of clamping at 0, you need to clamp at 1.

Benoit
 

Thanks

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.

Lili LI

unread,
Jun 11, 2020, 12:16:12 AM6/11/20
to gemmlowp
When I use L8R8WithLhsNonzeroBitDepthParams to replace DefaultL8R8BitDepthParams, the running time has not become shorter

在 2017年5月16日星期二 UTC+8上午4:37:12,Benoit Jacob写道:
Reply all
Reply to author
Forward
0 new messages