TODO: Port the ARMv7 (32bit) less-than-8-bit GEMM kernel to (ARMv8 64bit)

Kang

unread,

Oct 9, 2016, 2:42:17 AM10/9/16

to gemmlowp

Hi,

I want to pick this to-do item and contribute to this project.

As suggested by the to-do doc, I want to split this job into two stages:

1st Stage: Make a trivial port of the existing ARMv7 assembly code.

the output of this stage will be a new kernel in internal/kernel_neon.h named as NEON_64_Kernel12x4Depth2Assuming12BitProducts.

After finishing this stage including thoroughly testing this new kernel, I will start a pull request .

2nd Stage: Make use of the larger register space availble on ARMv8.

the output of this stage will be a new kernel in internal/kernel_neon.h named as NEON_64_Kernel12x8Depth2Assuming12BitProducts.

I want to ask that except for adding new code to internal/kernel_neon.h,

are there other things should be done, like adding new testing code in the test folder ?

Any other advice is also welcome :)

Best

Kang

Benoit Jacob

unread,

Oct 10, 2016, 3:33:44 PM10/10/16

to Kang, gemmlowp

Hi Kang,

Nice to hear that you'll take a stab at this. Looking forward to seeing the results!

You should only need to make changes to internal/kernel_neon.h, except for one thing: you will have to teach gemmlowp to use your new kernel, by editing kernel_default.h to pick it, similarly to how the existing 7bit/5bit ARM32 kernel is registered there:

https://github.com/google/gemmlowp/blob/master/internal/kernel_default.h#L49

Note that only the sum of LHS + RHS bits matters. Here, 7+5=12.

No need for new tests: the current tests are already running 7bit/5bit paths. Just make sure that your kernel is actually being used, and check that if you intoduce a bug in it, the test actually fails.

Good luck!

Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/80e790a0-3c47-42f9-bfd4-1f23b2ee4620%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kang

unread,

Oct 10, 2016, 11:21:20 PM10/10/16

to gemmlowp, zk5...@gmail.com

Hi Benoit,

Thanks for your advice. I have noticed the macro in kernel_default.h for specifying default kernel. And it is glad to hear that there is no need to write new unit test :)

Once I have finished the kernel or stuck with some hard bug, I will post in this thread.

Best

Kang

unread,

Oct 19, 2016, 7:01:07 AM10/19/16

to gemmlowp, zk5...@gmail.com

Hi Benoit,

I have finished the first stage and started test the new kernel. After running the test program, I got the following log:

PASS: 32x32x32 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets -75/-91/74980, mult 123, shift 18

FAIL: 33x33x33 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets 0/0/0, mult 1, shift 8

number of matrix entries: 1089

median value: 59

median unsigned diff: 196 (tolerating 7)

max unsigned diff: 255 (tolerating 29)

median signed diff: -196 (tolerating 2)

mean signed diff: -180 (tolerating 2)

No error: 0.00 % of entries

Error in 1..1 range: 0.00 % of entries

Error in 2..3 range: 0.00 % of entries

Error in 4..7 range: 0.00 % of entries

Error in 8..15 range: 0.00 % of entries

Error in 16..31 range: 0.55 % of entries

Error in 32..63 range: 4.13 % of entries

Error in 64..127 range: 16.71 % of entries

Error in 128..255 range: 78.60 % of entries

bad coeff: at (0, 0), expected 255, got 45

bad coeff: at (1, 0), expected 255, got 72

bad coeff: at (2, 0), expected 255, got 109

bad coeff: at (3, 0), expected 255, got 32

The red test case was added by myself since I have found that the new kernel will fail for operands large than 32x32x32. I was stuck on this for one day, I carefully reread the code but did not find any bugs. I had tried to use gdb on this test case but it was hard to dig into the asm code.

I want to ask that can you provide some idea on which part of the code can lead to this problem ? so that I can focus on that point.

Best

Kang

On Tuesday, October 11, 2016 at 3:33:44 AM UTC+8, Benoit Jacob wrote:

Benoit Jacob

unread,

Oct 19, 2016, 11:11:16 AM10/19/16

to Kang, gemmlowp

Hi Kang,

There is a specific way that the size 32 is special in this kernel, so we might have a lead here.

The way that this kernel takes advantage of the lower bit depths of the LHS and RHS, adding up to 12 bits instead of 16 bits, is that since each product is 12 bits, one may sum as many as 16=2^4 such products and still fit in 16 bits. That allows to use uint16 local accumulators for 16 consecutive iterations along the 'depth' dimension. See this comment:

https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L295

This kernel goes one bit further: it has two sets of accumulators: one for even depth indices, one for odd depth indices. Since each can accumulate 16 iterations, together they can accumulate 32 iterations along the depth dimension.

See this comment:

https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L398

Thus the size 32 is the size of the local accumulation loop with 16bit local accumulators. It is only after 32 depth values that one has to accumulate those back into the global 32bit accumulators. See here:

https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L517

I hope that helps.

Also, I would like to share some thoughts about how this code will be used going forward, which will likely be different from how it's initially be positioned (but does not affect you directly in this work).

Initially we had low accuracy requirements with quantized NNs so we were happy to quantized float neural networks to 8bit, then have gemmlowp requantize that down to less than 8bit if that could help performance. As time passes and quantized NN technology becomes more 'productionized', accuracy requirements are going up. I expect that in the future, the norm will be to train specifically for one specific bit-depth and never quantize/requantize anything after the fact.

On the one hand, that means that the requantization logic in pack.h is probably becoming obsolete.

On the other hand, the less-than-16bit-products kernels (which you're working one) are independent of these considerations. I just expect that in the future, one will be training directly for less than 8 bits, and passing less-than-8-bit-depth buffers to gemmlowp directly.

Cheers,

Benoit

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/0d22537e-d658-44c8-85da-52cb2ba19407%40googlegroups.com.

Kang

unread,

Oct 20, 2016, 2:21:57 AM10/20/16

to gemmlowp, zk5...@gmail.com

Hi Benoit,

Thanks for your reply.

I understand the reason for using 32 depth, macros and 32, 24, 16, 8 labels. I will take your advice and pay special attention to those code.

For the less than 8 bit, indeed, our needs are directly quantilizing the NN to 7bit, so I think the next step is combining the new L7R5 kernel with the original 8bit packing if my data is already quantilise to 7+5 bits.

Best

Kang

unread,

Oct 20, 2016, 3:20:31 AM10/20/16

to gemmlowp, zk5...@gmail.com

The bug is solved. I misused the vpaddl and vpadal instructions (on 64bit uaddlp and uadalp), so that after the main loop, the register values were not accumulated into the global_accumulators.

Best

Kang

unread,

Oct 20, 2016, 3:20:32 AM10/20/16

to gemmlowp, zk5...@gmail.com

The bug is solved. I misused the vpaddl and vpadal instructions (on 64bit uaddlp and uadalp), so that after the main loop, the register values were not accumulated into the global_accumulators.

Best

Kang

Benoit Jacob

unread,

Oct 20, 2016, 9:26:25 AM10/20/16

to Kang, gemmlowp

On Thu, Oct 20, 2016 at 2:21 AM, Kang <zk5...@gmail.com> wrote:

Hi Benoit,

Thanks for your reply.

I understand the reason for using 32 depth, macros and 32, 24, 16, 8 labels. I will take your advice and pay special attention to those code.

For the less than 8 bit, indeed, our needs are directly quantilizing the NN to 7bit, so I think the next step is combining the new L7R5 kernel with the original 8bit packing if my data is already quantilise to 7+5 bits.

Excellent: that is also how I envision using this in the future.

Should be very easy to hook up in gemmlowp - it's what you'd get if you edited kernel_default.h and made it use the 7+5 kernel instead of the current one for the 8+8 case.

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/012cd2bb-4af4-445b-b083-675e3a1bb347%40googlegroups.com.

Benoit Jacob

unread,

Oct 20, 2016, 9:26:53 AM10/20/16

to Kang, gemmlowp

Great to hear you solved the issue!

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/c9c9a615-56d6-4bdd-b734-63f84b749290%40googlegroups.com.

Kang Zhang

unread,

Oct 28, 2016, 6:21:11 AM10/28/16

to gemmlowp, zk5...@gmail.com

Hi Benoit,

After finishing the stage one job, I am wondering should I make a pull request for this first stage ?

Besides this, I have two other issues to discuss with you.

The first issue is about what to do next, the current NEON_64_12x4_12bits kernel using local accumulators from memory. I think transferring data between memory and register in the inner computation loop is a waste of time. I understand that there is not that many registers on NEON 32 platform. In the todo doc, you mentioned the next stage is to implement a NEON_64_12x8_12bits kernel. From my point of view, for the NEON_64_12x8_12bits kernel also need to utilize local accumulators from memory. So I have two directions, one is NEON_64_12x4_12bits without memory accumulators, the ohter is NEON_64_12x8_12bits with memory accumulators. Which one is better ?

The second issue is about one specific function for convolutional NN in computer vision, i.e. im2col(). During our test of small CNN models on arm, the im2col() takes about 1/3 time of a convolution layer. So I think we can use NEON load/save instructions to accelerate this function. Though this is not part of gemmlowp, I want to ask that do you have any experiences on this thing ?

Best

Kang

Benoit Jacob

unread,

Oct 28, 2016, 8:06:07 AM10/28/16

to Kang Zhang, gemmlowp

On Fri, Oct 28, 2016 at 6:21 AM, Kang Zhang <zk5...@gmail.com> wrote:

Hi Benoit,

After finishing the stage one job, I am wondering should I make a pull request for this first stage ?

That would be great!

Besides this, I have two other issues to discuss with you.

The first issue is about what to do next, the current NEON_64_12x4_12bits kernel using local accumulators from memory. I think transferring data between memory and register in the inner computation loop is a waste of time. I understand that there is not that many registers on NEON 32 platform. In the todo doc, you mentioned the next stage is to implement a NEON_64_12x8_12bits kernel. From my point of view, for the NEON_64_12x8_12bits kernel also need to utilize local accumulators from memory. So I have two directions, one is NEON_64_12x4_12bits without memory accumulators, the ohter is NEON_64_12x8_12bits with memory accumulators. Which one is better ?

That's a good question! I hadn't thought of it, but going memory-accumulators-free is another interesting take on what to do with the extra register space.

Only measurements will tell what goes faster. I'll happily take whichever kernel goes faster in your experience.

The second issue is about one specific function for convolutional NN in computer vision, i.e. im2col(). During our test of small CNN models on arm, the im2col() takes about 1/3 time of a convolution layer. So I think we can use NEON load/save instructions to accelerate this function. Though this is not part of gemmlowp, I want to ask that do you have any experiences on this thing ?

I guess that will depend on the layers size and shape. With what I have over here (lots of 1x1 layers not requiring any im2col, the rest mostly 3x3), im2col isn't a big concern. Also, our im2col implementation uses memcpy, so it's making as good a use of SIMD as we'll be able to get already.

Cheers,

Benoit

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/fcf5572f-682c-4951-b44f-c10786f85478%40googlegroups.com.

Benoit Jacob

unread,

Dec 16, 2016, 10:22:08 AM12/16/16

to Kang Zhang, gemmlowp

FYI:

- new TODO item on this:

https://github.com/google/gemmlowp/blob/master/todo/less-than-8-bit-without-requantization.txt