Faster matrix multiplication in Go

2,453 views
Skip to first unread message

Harry de Boer

unread,
Mar 29, 2012, 6:25:18 PM3/29/12
to golan...@googlegroups.com
I have been toying a bit with matrix multiplication in Go.
Two things helped to speed things up:
  1. For large matrices use the Strassen algorithm, or better, Winograd's variant with Douglas memory placement.
  2. For small matrices, use the level 1 BLAS Daxpy function from https://github.com/ziutek/blas

Below is a selection of my benchmarks (on amd64). More and code at https://github.com/harrydb/go/tree/master/matrix

Note it is not my intention to write a full matrix package, gomatrix is great. I just wanted to teach myself how to implement stuff like this and see if I could get things faster.

Cheers,
Harry

	BenchmarkMulDouglas__1024	       1	1876168000 ns/op
BenchmarkMulStrassPar1024-2 1 1296087000 ns/op

BenchmarkMulDouglas__512 5 264489000 ns/op
BenchmarkMulStrassPar512-2 10 199333100 ns/op
BenchmarkMulStrassen_512 5 298305000 ns/op
BenchmarkMulBLAS_____512 5 562627800 ns/op
BenchmarkMulSimple___512 5 636601200 ns/op
BenchmarkMulGomatrix_512-2 5 783658800 ns/op
BenchmarkMulGomatrix_512 5 659225400 ns/op

BenchmarkMulDouglas__256 50 36059340 ns/op
BenchmarkMulStrassPar256-2 100 34300640 ns/op
BenchmarkMulStrassen_256 50 41638480 ns/op
BenchmarkMulBLAS_____256 50 54319880 ns/op
BenchmarkMulSimple___256 20 75649650 ns/op
BenchmarkMulGomatrix_256-2 20 93713050 ns/op
BenchmarkMulGomatrix_256 20 78700000 ns/op

BenchmarkMulBLAS______32 20000 95448 ns/op
BenchmarkMulSimple____32 10000 147572 ns/op
BenchmarkMulGomatrix__32 10000 146323 ns/op
BenchmarkMulGomatrix__32-2 10000 311446 ns/op


John Asmuth

unread,
Mar 29, 2012, 8:00:27 PM3/29/12
to golan...@googlegroups.com
On Thursday, March 29, 2012 6:25:18 PM UTC-4, Harry de Boer wrote:

Note it is not my intention to write a full matrix package, gomatrix is great.

I have it on good authority that the gomatrix author welcomes contributions. :) 

Harry de Boer

unread,
Mar 29, 2012, 8:47:13 PM3/29/12
to golan...@googlegroups.com

Hehe, I suspected that :-). I have lots of other stuff to do but 'll see if I can create a patch for you. 

By the way, there seems to be something wrong with the parallel support in gomatrix, it actually runs (a lot) slower than with a single thread.

Harry

Harry de Boer

unread,
Apr 1, 2012, 7:52:27 PM4/1/12
to golan...@googlegroups.com
Hi John,

I made a patch for gomatrix here:
https://code.google.com/r/harrydb1984-gomatrix/source/detail?r=84769ea496a863439878b1cb6fac5bb6c6fd3ea2
I am not really sure how google code works, but I thought this would be easiest.

A couple of things to consider:
1) Are you willing to depend on another package (the blas package)? It does speed up things a lot on 64 bit and only one function is used.
2) The Strassen algorithms use some additional memory. The variant with smart memory placement (timesDouglas) not so much (< 2/3 N^2), the parallel strassen one does need quite a lot more so it will run out of memory earlier. I don't know of a good way to make that better. The current parallel things in gomatrix seem to perform badly anyway (see first post)...

Some benchmarks:
BenchmarkMulGomatrix_1024           1    2081065000 ns/op
BenchmarkMulGomatrix_1024-2         5    1168112600 ns/op
BenchmarkMulGomatrix__512          10     289222700 ns/op
BenchmarkMulGomatrOld_512           5     659225400 ns/op
BenchmarkMulGomatrix__512-2        20     172936750 ns/op
BenchmarkMulGomatrOld_512-2         5     783658800 ns/op
BenchmarkMulGomatrix__256         100      39155540 ns/op
BenchmarkMulGomatrOld_256          20      78700000 ns/op
BenchmarkMulGomatrix__256-2       100      28732680 ns/op
BenchmarkMulGomatrOld_256-2        20      93713050 ns/op
BenchmarkMulGomatrix__128        1000       4757531 ns/op
BenchmarkMulGomatrix__128-2      1000       4721023 ns/op
BenchmarkMulGomatrix___64       10000        551217 ns/op
BenchmarkMulGomatrix___64-2     10000        547711 ns/op
BenchmarkMulGomatrix___32       50000         87238 ns/op
BenchmarkMulGomatrOld__32       10000        146323 ns/op
BenchmarkMulGomatrix___32-2     50000         85021 ns/op
BenchmarkMulGomatrOld__32-2     10000        311446 ns/op

I noticed this in the tests:
https://code.google.com/p/gomatrix/source/browse/matrix/dense_test.go#163
It only sets a local variable, you probably meant to call runtime.GOMAXPROCS() there right?

Cheers,
Harry


On Friday, March 30, 2012 2:00:27 AM UTC+2, John Asmuth wrote:

John Asmuth

unread,
Apr 2, 2012, 12:57:00 PM4/2/12
to golan...@googlegroups.com
Probably best would be to make a separate package with a type that satisfies the matrix.Matrix interface. For now that'd be best in a distinct repository.

I'm taking a look at https://github.com/ziutek/blas - looks really useful. The fact that it doesn't use cgo is a huge plus, too.

If you create a separate package with a type BLASMatrix or something, we can talk about moving it into the gomatrix repository. It will have to stay a separate package, though - gomatrix has no 3rd party deps on purpose. Maybe ziutek wants to add his code in though, who knows?

Could be as simple as type BLASMatrix struct { *DenseMatrix }, with a new .Times() method overwriting the old one.
Reply all
Reply to author
Forward
0 new messages