TODO: Port the ARMv7 (32bit) less-than-8-bit GEMM kernel to (ARMv8 64bit)

234 views
Skip to first unread message

Kang

unread,
Oct 9, 2016, 2:42:17 AM10/9/16
to gemmlowp
Hi,

I want to pick this to-do item and contribute to this project.

As suggested by the to-do doc, I want to split this job into two stages:

1st Stage: Make a trivial port of the existing ARMv7 assembly code.

the output of this stage will be a new kernel in internal/kernel_neon.h named as NEON_64_Kernel12x4Depth2Assuming12BitProducts.
After finishing this stage including thoroughly testing this new kernel, I will start a pull request .

2nd Stage: Make use of the larger register space availble on ARMv8.

the output of this stage will be a new kernel in internal/kernel_neon.h named as NEON_64_Kernel12x8Depth2Assuming12BitProducts.


I want to ask that except for adding new code to internal/kernel_neon.h,
are there other things should be done, like adding new testing code in the test folder ?

Any other advice is also welcome :)


Best

Kang

Benoit Jacob

unread,
Oct 10, 2016, 3:33:44 PM10/10/16
to Kang, gemmlowp
Hi Kang,

Nice to hear that you'll take a stab at this. Looking forward to seeing the results!

You should only need to make changes to internal/kernel_neon.h, except for one thing: you will have to teach gemmlowp to use your new kernel, by editing kernel_default.h to pick it, similarly to how the existing 7bit/5bit ARM32 kernel is registered there:


Note that only the sum of LHS + RHS bits matters. Here, 7+5=12.

No need for new tests: the current tests are already running 7bit/5bit paths. Just make sure that your kernel is actually being used, and check that if you intoduce a bug in it, the test actually fails.

Good luck!
Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/80e790a0-3c47-42f9-bfd4-1f23b2ee4620%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kang

unread,
Oct 10, 2016, 11:21:20 PM10/10/16
to gemmlowp, zk5...@gmail.com
Hi Benoit, 

Thanks for your advice. I have noticed the macro in kernel_default.h for specifying default kernel. And it is glad to hear that there is no need to write new unit test :)

Once I have finished the kernel or stuck with some hard bug, I will post in this thread.

Best

Kang

Kang

unread,
Oct 19, 2016, 7:01:07 AM10/19/16
to gemmlowp, zk5...@gmail.com
Hi Benoit,

I have finished the first stage and started test the new kernel. After running the test program, I got the following log:

PASS: 32x32x32 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets -75/-91/74980, mult 123, shift 18
FAIL: 33x33x33 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets 0/0/0, mult 1, shift 8
    number of matrix entries: 1089
    median value: 59
    median unsigned diff: 196 (tolerating 7)
    max unsigned diff: 255 (tolerating 29)
    median signed diff: -196 (tolerating 2)
    mean signed diff: -180 (tolerating 2)
No error: 0.00 % of entries
Error in 1..1 range: 0.00 % of entries
Error in 2..3 range: 0.00 % of entries
Error in 4..7 range: 0.00 % of entries
Error in 8..15 range: 0.00 % of entries
Error in 16..31 range: 0.55 % of entries
Error in 32..63 range: 4.13 % of entries
Error in 64..127 range: 16.71 % of entries
Error in 128..255 range: 78.60 % of entries
bad coeff: at (0, 0), expected 255, got 45
bad coeff: at (1, 0), expected 255, got 72
bad coeff: at (2, 0), expected 255, got 109
bad coeff: at (3, 0), expected 255, got 32

The red test case was added by myself since I have found that the new kernel will fail for operands large than 32x32x32.  I was stuck on this for one day, I carefully reread the code but did not find any bugs. I had tried to use gdb on this test case but it was hard to dig into the asm code.

I want to ask that can you provide some idea on which part of the code can lead to this problem ? so that I can focus on that point.

Best
Kang


On Tuesday, October 11, 2016 at 3:33:44 AM UTC+8, Benoit Jacob wrote:

Benoit Jacob

unread,
Oct 19, 2016, 11:11:16 AM10/19/16
to Kang, gemmlowp
Hi Kang,

There is a specific way that the size 32 is special in this kernel, so we might have a lead here.

The way that this kernel takes advantage of the lower bit depths of the LHS and RHS, adding up to 12 bits instead of 16 bits, is that since each product is 12 bits, one may sum as many as 16=2^4 such products and still fit in 16 bits. That allows to use uint16 local accumulators for 16 consecutive iterations along the 'depth' dimension. See this comment:


This kernel goes one bit further: it has two sets of accumulators: one for even depth indices, one for odd depth indices. Since each can accumulate 16 iterations, together they can accumulate 32 iterations along the depth dimension.

See this comment:


Thus the size 32 is the size of the local accumulation loop with 16bit local accumulators. It is only after 32 depth values that one has to accumulate those back into the global 32bit accumulators. See here:


I hope that helps.

Also, I would like to share some thoughts about how this code will be used going forward, which will likely be different from how it's initially be positioned (but does not affect you directly in this work).

Initially we had low accuracy requirements with quantized NNs so we were happy to quantized float neural networks to 8bit, then have gemmlowp requantize that down to less than 8bit if that could help performance. As time passes and quantized NN technology becomes more 'productionized', accuracy requirements are going up. I expect that in the future, the norm will be to train specifically for one specific bit-depth and never quantize/requantize anything after the fact.

On the one hand, that means that the requantization logic in pack.h is probably becoming obsolete.

On the other hand, the less-than-16bit-products kernels (which you're working one) are independent of these considerations. I just expect that in the future, one will be training directly for less than 8 bits, and passing less-than-8-bit-depth buffers to gemmlowp directly.

Cheers,
Benoit

Kang

unread,
Oct 20, 2016, 2:21:57 AM10/20/16
to gemmlowp, zk5...@gmail.com
Hi Benoit,

Thanks for your reply.

I understand the reason for using 32 depth, macros and 32, 24, 16, 8 labels. I will take your advice and pay special attention to those code.

For the less than 8 bit, indeed, our needs are directly quantilizing the NN to 7bit, so I think the next step is combining the new L7R5 kernel with the original 8bit packing if my data is already quantilise to 7+5 bits.


Best
Kang

Kang

unread,
Oct 20, 2016, 3:20:31 AM10/20/16
to gemmlowp, zk5...@gmail.com
The bug is solved. I misused the vpaddl and vpadal instructions (on 64bit uaddlp and uadalp), so that after the main loop, the register values were not accumulated into the global_accumulators.


Best
Kang

Kang

unread,
Oct 20, 2016, 3:20:32 AM10/20/16
to gemmlowp, zk5...@gmail.com
The bug is solved. I misused the vpaddl and vpadal instructions (on 64bit uaddlp and uadalp), so that after the main loop, the register values were not accumulated into the global_accumulators.


Best
Kang

Benoit Jacob

unread,
Oct 20, 2016, 9:26:25 AM10/20/16
to Kang, gemmlowp
On Thu, Oct 20, 2016 at 2:21 AM, Kang <zk5...@gmail.com> wrote:
Hi Benoit,

Thanks for your reply.

I understand the reason for using 32 depth, macros and 32, 24, 16, 8 labels. I will take your advice and pay special attention to those code.

For the less than 8 bit, indeed, our needs are directly quantilizing the NN to 7bit, so I think the next step is combining the new L7R5 kernel with the original 8bit packing if my data is already quantilise to 7+5 bits.

Excellent: that is also how I envision using this in the future.

Should be very easy to hook up in gemmlowp - it's what you'd get if you edited kernel_default.h and made it use the 7+5 kernel instead of the current one for the 8+8 case.
 

Benoit Jacob

unread,
Oct 20, 2016, 9:26:53 AM10/20/16
to Kang, gemmlowp
Great to hear you solved the issue!

Kang Zhang

unread,
Oct 28, 2016, 6:21:11 AM10/28/16
to gemmlowp, zk5...@gmail.com
Hi Benoit,

After finishing the stage one job, I am wondering should I make a pull request for this first stage ?

Besides this, I have two other issues to discuss with you.

The first issue is about what to do next, the current NEON_64_12x4_12bits kernel using local accumulators from memory. I think transferring data between memory and register in the inner computation loop is a waste of time. I understand that there is not that many registers on NEON 32 platform. In the todo doc, you mentioned the next stage is to implement a NEON_64_12x8_12bits kernel. From my point of view, for the NEON_64_12x8_12bits kernel also need to utilize local accumulators from memory. So I have two directions, one is NEON_64_12x4_12bits without memory accumulators, the  ohter is NEON_64_12x8_12bits with memory accumulators. Which one is better ?

The second issue is about one specific function for convolutional NN in computer vision, i.e. im2col(). During our test of small CNN models on arm, the im2col() takes about 1/3 time of a convolution layer. So I think we can use NEON load/save instructions to accelerate this function. Though this is not part of gemmlowp, I want to ask that do you have any experiences on this thing ?

Best
Kang

Benoit Jacob

unread,
Oct 28, 2016, 8:06:07 AM10/28/16
to Kang Zhang, gemmlowp
On Fri, Oct 28, 2016 at 6:21 AM, Kang Zhang <zk5...@gmail.com> wrote:
Hi Benoit,

After finishing the stage one job, I am wondering should I make a pull request for this first stage ?

That would be great!
 

Besides this, I have two other issues to discuss with you.

The first issue is about what to do next, the current NEON_64_12x4_12bits kernel using local accumulators from memory. I think transferring data between memory and register in the inner computation loop is a waste of time. I understand that there is not that many registers on NEON 32 platform. In the todo doc, you mentioned the next stage is to implement a NEON_64_12x8_12bits kernel. From my point of view, for the NEON_64_12x8_12bits kernel also need to utilize local accumulators from memory. So I have two directions, one is NEON_64_12x4_12bits without memory accumulators, the  ohter is NEON_64_12x8_12bits with memory accumulators. Which one is better ?

That's a good question! I hadn't thought of it, but going memory-accumulators-free is another interesting take on what to do with the extra register space.
Only measurements will tell what goes faster. I'll happily take whichever kernel goes faster in your experience.
 
The second issue is about one specific function for convolutional NN in computer vision, i.e. im2col(). During our test of small CNN models on arm, the im2col() takes about 1/3 time of a convolution layer. So I think we can use NEON load/save instructions to accelerate this function. Though this is not part of gemmlowp, I want to ask that do you have any experiences on this thing ?

I guess that will depend on the layers size and shape. With what I have over here (lots of 1x1 layers not requiring any im2col, the rest mostly 3x3), im2col isn't a big concern. Also, our im2col implementation uses memcpy, so it's making as good a use of SIMD as we'll be able to get already.

Cheers,
Benoit

Benoit Jacob

unread,
Dec 16, 2016, 10:22:08 AM12/16/16
to Kang Zhang, gemmlowp
Reply all
Reply to author
Forward
0 new messages