how about sparse or binary matrix multiplication?

Mou Li

unread,

Jun 7, 2017, 3:31:03 AM6/7/17

to gemmlowp

HI all,

I've run the gemmlowp new non-zero kernel on ARMv8 devices and it looks pretty fast .

Recently , i encountered a problem in matrix multiplication A*B=C , matrix A is sparse(contains many zeros) , and the value in A are only {1,-1} , but i have little idea about how to optimaze with this condition. i've tried a new kernel , traverse elements of A using switch() operation but it does not run faster. So I wonder if anyone can give me some ideas.

thks~

Benoit Jacob

unread,

Jun 7, 2017, 7:59:21 AM6/7/17

to Mou Li, gemmlowp

Sparse is out of the current scope of gemmlowp. As you point out, it is not trivial to achieve higher performance with sparse, and that is typically only achieved when the sparsity is very high (certainly over 90% zeros) due to the inherent inefficiency of branching and of sparse memory accesses.

Binary matrices are also out of the current scope of gemmlowp. While we would like to generalize gemmlowp to lower-than-8-bit depths, going down to the extreme case of 1 bit may require deeper changes to fully take advantage of it. I haven't really thought about it much.

Cheers,

Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/8b68c869-a429-4fde-8a47-bbed633ce65e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yaman Umuroglu

unread,

Aug 1, 2017, 5:14:43 PM8/1/17

to gemmlowp, lxg...@gmail.com

Hi Benoit, Mou,

Inspired by gemmlowp, I started working on a library for few-bit integer matrix multiplication to support deep neural networks with highly quantized (typically <3 bits) weights and activations:

https://github.com/maltanar/gemmbitserial

The core idea is vectorized bit serial, using AND-popcount operations to process one bit position at a time but across multiple columns of the matrix, as described here:

http://www.idi.ntnu.no/~yamanu/2017-cases-wip-quantizedmm-preprint.pdf

I have NEON and "generic" (using gcc's builtin_popcount) paths somewhat optimized so far, and the performance is currently around 150 GOPS / (bits per lhs element * bits per rhs element) on a single Cortex-A57 core, so this is faster than gemmlowp when the product of bitwidths is less than ~7. There's currently some conversion overhead for packing/transposing the bits though, so the practical speedups are somewhat less.

If this sounds interesting, let me know! I understand that gemmlowp focuses on 8-bit operations, but it would be very cool to support <8 bit ops efficiently under the same roof as well.

Regards,

Yaman

Benoit Jacob

unread,

Aug 1, 2017, 5:36:30 PM8/1/17

to Yaman Umuroglu, gemmlowp, Mou Li

Sounds intriguing; here is a spreadsheet listing Gop/s for gemmlowp 8bit kernels on various ARM cores,

https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing

I'm primarily interested in row 17 (L8R8WithLhsNonzeroBitDepthParams on ARM 64bit); as you can see in cell K17, that gives 22 Gop/s on a Nexus 5X Cortex-A57 core (1.82 GHz), or > 12 ops/cycle (cell K50).

For what bit depth are you getting higher performance than that?

(Note - I am counting a multiply-add as 2 ops. Multiplication of two NxN matrices is counted as 2N^3 ops).

From investigations around here, it seems that at least 7bit, and possibly to some extent even just 5bit, may be enough precision for key vision CNNs that we care about. I have not yet seen an actual practical application of 4bit-and-less bit depths, on actual production mobile models (there are many papers on arxiv about NN inference with 4bits and less, but they tend to either evaluate only large academic/server models which are inherently less sensitive to bit depth, or accept unreasonably large accuracy degradation --- keep in mind that in many application domains, one can only tolerate accuracy degradations of the order of 1% before lower bit depths cease to be worthwhile).

Benoit

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/1bdac252-b700-4ce3-b709-16596a28badc%40googlegroups.com.

Benoit Jacob

unread,

Aug 1, 2017, 5:55:46 PM8/1/17

to Yaman Umuroglu, gemmlowp, Mou Li

I see that your benchmark results in the preprint are on actual GEMMs, not just GEMM kernel micro-benchmarks like my spreadsheet.

Please still confirm if you've benchmarked gemmlowp with L8R8WithLhsNonzeroBitDepthParams.

Benchmarking actual GEMMs as you know, introduces 3 new parameters (GEMM shape, which you call M, K, N in Table 2). This table refers to a very large neural network, "AlexNet", where these GEMM shapes are much larger than we typically see in the kind of mobile NNs that gemmlowp is being optimized for. I would find it very interesting if you could benchmark some actual production mobile NNs, such as MobileNets (that still leaves quite a range of sizes, from ~ 0.01 to ~ 1 billion ops per inference. Maybe you could benchmark a mid-size MobileNet at ~ 0.1 billion ops).

https://arxiv.org/abs/1704.04861

A typical matrix size there is between 50 and 500... it would be very interesting already if you could benchmark just (M,K,N)==(100,100,100).

Benoit

Yaman Umuroglu

unread,

Aug 1, 2017, 8:18:17 PM8/1/17

to Benoit Jacob, gemmlowp, Mou Li

The 150 GOPS for single core was for binary bitwidths on both matrices (and I also counted ops as 2*m*n*k). The previous version of the spreadsheet you linked to was where I got my baseline for the paper for the Nexus 5x Cortex-A57. The bit serial approach slows down proportional to the product of the bitwidths of the matrices, so e.g. if binary is 7x faster than 8-bit ops, anything beyond e.g 2-bit weights and 3-bit activations is better to do with gemmlowp anyway.

I've worked with highly quantized (say 1-bit weights 2-bit activations) for a while, and for e.g. ImageNet the accuracy drop is still prominent (at least 5-6%) as you say. Though I'm not sure why 1% is the limit of tolerance for accuracy drop, do you have any good reading/references on why 1%?

Yaman

Benoit Jacob

unread,

Aug 2, 2017, 10:02:54 AM8/2/17

to Yaman Umuroglu, gemmlowp, Mou Li, Dmitry Kalenichenko, Andrew Howard

CC'd a couple of team members & coauthors of the MobileNet paper, just FYI / for the record.

Dmitry, Andrew -- this is a thread on the gemmlowp mailing list, which is public. I'm trying to summarize some things we know from the MobileNet paper about accuracy/cost tradeoffs for mobile models. I'm sharing some additional thoughts that may or may not have been publicly shared before, hope it's fine, and hope I'm not too far off (I'm not a specialist of NNs). Correct me as needed.

On Tue, Aug 1, 2017 at 8:18 PM, Yaman Umuroglu <malt...@gmail.com> wrote:

The 150 GOPS for single core was for binary bitwidths on both matrices (and I also counted ops as 2*m*n*k). The previous version of the spreadsheet you linked to was where I got my baseline for the paper for the Nexus 5x Cortex-A57. The bit serial approach slows down proportional to the product of the bitwidths of the matrices, so e.g. if binary is 7x faster than 8-bit ops, anything beyond e.g 2-bit weights and 3-bit activations is better to do with gemmlowp anyway.

I've worked with highly quantized (say 1-bit weights 2-bit activations) for a while, and for e.g. ImageNet the accuracy drop is still prominent (at least 5-6%) as you say. Though I'm not sure why 1% is the limit of tolerance for accuracy drop, do you have any good reading/references on why 1%?

Section 4 in the MobileNet paper https://arxiv.org/pdf/1704.04861.pdf, especially Figure 4 "ImageNet accuracy vs Mult-Adds", is the clearest data I know of on this subject. There may be a lot more data out there, I wouldn't know -- I'm not a researcher, I don't follow the literature.

(Since then another group outside of Google has pushed further in a similar direction -- ShuffleNet from Face++, https://arxiv.org/abs/1707.01083)

Either way, the general idea is that once you care about a production application, you want to pick the best compromise between accuracy and cost (esp. computational cost expressed as a count of ops). So you have these production mobile NN architectures such as MobileNet, which have a couple of hyperparameters that you can adjust for each application, to select the best such compromise. That means that MobileNet is not one NN model, but a whole collection of NN models, which one can put on a xy-plot diagram, where the x-axis is cost, and the y-axis is accuracy. That is what this Figure 4 in the MobileNet paper does. Once you have that, you can think of a degradation of accuracy by a certain amount, as being equivalent to an increase of NN cost, just moving along the xy-plot.

It so happens (see how that Figure 4 uses a log-scale, showing a log-dependence of accuracy on cost) that once you have models that are optimized to offer a competitive accuracy/cost compromise, then typically a small change of accuracy corresponds to a large change of cost on that xy-plot. For example, on that Figure 4, you can see that a change of ImageNet accuracy of 5 percentage points corresponds to a change of cost by a factor between, roughly, 2x and 5x, depending on where you are in the graph. That means that quantization approaches that result in a loss of 5 percentage points of ImageNet accuracy must offer a large performance boost by at least that factor, to make up for it. Of course, it can still be worthwhile if the performance boost is sufficiently large. On a higher level: accuracy is performance; to be less accurate is to be slower (once you look at it from an application's perspective).

Thus, there is no hard "1%" rule. But looking at data such as that Figure 4 in the MobileNet paper, a rough rule-of-thumb over here is this (and keep in mind this is very application-domain-dependent so anything I say here can only be very rough):

- A 1% drop in accuracy is typically OK;

- A 2% drop in accuracy is a concern, possibly still OK-ish;

- A 5% drop in accuracy is a big problem; the associated change in model cost is typically 2x or higher, so must be factored into any performance analysis.

However, when a given *inference-only* quantization approach turns out to result in a too large degradation of accuracy, not all is lost: the next question is how much accuracy can be reclaimed by retraining the model specifically for that quantization scheme, typically by using (or simulating) the same quantization scheme during the forward pass of training. (Training is not my area, so I can't say more). In practice, retraining for quantization has been key to achieving low enough accuracy degradation in order for quantization to be shippable in applications that I know about.

Benoit

Yaman Umuroglu

unread,

Aug 2, 2017, 10:54:51 AM8/2/17

to Benoit Jacob, gemmlowp, Mou Li, Dmitry Kalenichenko, Andrew Howard

Thanks for the clarifications regarding the accuracy drop vs model complexity -- this was how I thought about it as well. But comparing MACs across networks that use w-bit weights and a-bit activations (possibly different bitwidths in different layers) the operation cost in MACs becomes a bit harder, i.e. from a hardware perspective, 1 million multiply-adds on 8-bit numbers and 1 million multiply-adds on 2-bit numbers don't have the same cost. I guess one could use the product of activation and weight bitwidths per layer as a "scaling factor" when summing the MACs for fairer comparison (at least this is how it works out in bit serial).

I've read about the MobileNets family of networks, and I remember the efficiency (accuracy per MAC or per parameter) was quite impressive -- are there any benchmarks on using gemmlowp to deploy those on ARM cores? This would be a strong baseline for comparison.

Regarding matrix sizes vs achievable performance, I'm planning to run a sweep across many different sizes with both gemmlowp and my bit serial approach, I can share the results here when ready. One advantage from using quantization is the arithmetic intensity, so it should be possible to achieve a higher % of the peak with smaller matrix sizes when working with highly-quantized networks.

Benoit Jacob

unread,

Aug 2, 2017, 11:32:18 AM8/2/17

to Yaman Umuroglu, gemmlowp, Mou Li, Dmitry Kalenichenko, Andrew Howard

Of course, we agree that the cost of multiplication depends on the bit-depth of numbers. Accounting for that is one of the things that is implicitly achieved by looking at actual benchmark numbers for specific implementations on specific hardware.

We, too, would like very much to be able to share benchmarks results for MobileNets running on gemmlowp. Unfortunately, the neural network inference stack that we use, that gives compelling performance characteristics, is not currently public, so at this point I wouldn't be able to share more than latency numbers that would not be reproducible outside of Google --- probably not worthwhile. What I can say is that it is based on gemmlowp's L8R8WithLhsNonzeroBitDepthParams, and on that, for 100x100 matrix multiplication on a single big core of Nexus 5x (Cortex A57), gemmlowp achieves roughly 9.5 Gop/s (this is only 40% of the kernel efficiency of 22.5 Gop/s in the aforementioned spreadsheet, which underlines the difficulty of being efficient on small GEMMs).

Reply all

Reply to author

Forward