matrix multiplication in go.matrix (dense_arithmatic.go)

1,353 views
Skip to first unread message

bernhard...@gmail.com

unread,
Jul 11, 2014, 1:52:08 PM7/11/14
to golan...@googlegroups.com
Hello,

I am trying to multiply matrices with the go.matrix package from skelterjohn (John Asmuth?) on github https://github.com/skelterjohn/go.matrix, which works nicely.
BUT I need to multiply lots of big matrices and I find this:

Multiplying two 1000x9 float64 matrices about
14 seconds

Multiplying two 2000x9 float64 matrices about
2 minutes

It seems to me that the main culprit for this non-linear behavior is the multiplication function 'parTimes2(A, B, C *DenseMatrix)' in dense_arithmetic.go. Here is the non-functional code on the playground: http://play.golang.org/p/gziWHDEo7m

I have read the discussion between Dmitry Vyukov and John Asmuth here: https://groups.google.com/forum/#!msg/golang-nuts/RVX-BhHcugM/cEAYKvN4AFYJ.
It does not provide any clues as to what 'parTimes2(A, B, C *DenseMatrix)' does on a conceptual/abstract level. I do understand that it is somehow scheduler and cache optimized. But it does not help to understand the sharp performance degradation above.

Help is appreciated, for example cache diagnostics (with go profiling) etc.. If anybody can explain the idea/algorithm behind the parTimes2 function, that would be great.

Bernhard

Dan Kortschak

unread,
Jul 11, 2014, 5:19:26 PM7/11/14
to bernhard...@gmail.com, golan...@googlegroups.com
This isn't an answer to you question, but have you tried one of the other matrix packages either with or without C backend?
Message has been deleted

bernhard...@gmail.com

unread,
Jul 14, 2014, 7:59:13 AM7/14/14
to golan...@googlegroups.com, bernhard...@gmail.com
Never mind, I appreciate inspirational tangential (well meaning) thinking :-). And inspirational you were, believe it or not, I did not think of this option. Here is the preliminary result (adding value for other people reading this post). 
1) I searched and found that a guy named Harry de Boer had written a patch for dense_arithmetic.go https://groups.google.com/forum/#!topic/golang-nuts/Cl_D7PIiCns. Here: https://code.google.com/r/harrydb1984-gomatrix/source/browse/matrix#matrix. It works well, doubling the size of matrices I can factorize. QR-factoring two 2000x9 float64 now takes about 9 secs. Also Harry introduced me to Strassen's and Douglas' methods for multiplying matrices. Which is useful and gives me words to hunt for on google for better matrix-multiplication. Although be aware that Harry's patch was probably written in a hurry, so he does not check for the second multiplicand to be square (Strassen and Douglas only work for square matrices -> same number of rows and columns). For rectangular matrices, harry's code has to fall through to the TimesBlas method, which he also provides, using code from https://github.com/ziutek/blas.

2) Unfortunately Harry's patch performance falls off beyond 2000 rows, so as a second option, I am trying a C-backed option, using your bindings here: https://code.google.com/p/biogo/source/browse/blas.go?repo=blas.

I did not think that efficient matrix multiplication is so hard. But it really is a show-stopper, when you try to work with large matrices.
Bernhard

Dan Kortschak

unread,
Jul 14, 2014, 8:08:03 AM7/14/14
to bernhard...@gmail.com, golan...@googlegroups.com
I was thinking more of gonum/matrix than my bindings in biogo which I no
longer maintain (I care for the gonum package now). There is a QR
decomposition in that package.

On Mon, 2014-07-14 at 04:52 -0700, bernhard...@gmail.com wrote:
> Never mind, I appreciate inspirational tangential (well meaning)
> thinking :-). And inspirational you were, believe it or not, I did not
> think of this option. Here is the preliminary result (adding value for
> other people reading this post).
> 1) I searched and found that a guy named Harry de Boer had written a
> patch for dense_arithmetic.go https://groups.google.com/forum/#!
> topic/golang-nuts/Cl_D7PIiCns.
> Here: https://code.google.com/r/harrydb1984-gomatrix/source/browse/matrix#matrix. It works well, doubling the size of matrices I can factorize. QR-factorizing two 2000x9 float64 now takes about 9 secs. Also Harry introduced me to Strassen's and Douglas' methods for multiplying matrices. Which is useful and gives me words to hunt for on google for better matrix-multiplication. Although be aware that Harry's patch was probably written in a hurry, so he does not check for the second multiplicand to be square (Strassen and Douglas only work for square matrices -> same number of rows and columns). For rectangular matrices, harry's code has to fall through to the TimesBlas method, which he also provides, using code from https://github.com/ziutek/blas.
>
>
> 2) Unfortunately Harry's patch performance falls off beyond 2000 rows,
> so as a second option, I am trying a C-backed option, using your
> bindings
> here: https://code.google.com/p/biogo/source/browse/blas.go?repo=blas.
>
>
> I did not think that efficient matrix multiplication is so hard. But
> it really is a show-stopper, when you try to work with large matrices.
> Bernhard

--
Omnes mundum facimus.

Dan Kortschak <dan.ko...@adelaide.edu.au>
F9B3 3810 C4DD E214 347C B8DA D879 B7A7 EECC 5A40
10C7 EEF4 A467 89C9 CA00 70DF C18F 3421 A744 607C

bernhard...@gmail.com

unread,
Jul 14, 2014, 8:30:13 AM7/14/14
to golan...@googlegroups.com, bernhard...@gmail.com
Ok, thanks, did not come across this one. Can you provide some top-of-your-head explanations please. Like:
1) What are the dependencies when I use gonum/matrix? I prefer to have a minimal version, and there are lots of packages under gonum.
2) Your code is binding to C libs in the blas package. I guess your - only - interface towards the c-libs is here cblas.h and blas.go?
3) Does your Blas package provide all three levels?
4) I hope, guess, assume that you are using the concurrency tools (goroutines) that golang is providing, i.e. your package is NOT single-threaded, right?

Bernhard

Dan Kortschak

unread,
Jul 14, 2014, 8:54:55 AM7/14/14
to bernhard...@gmail.com, golan...@googlegroups.com
On Mon, 2014-07-14 at 05:30 -0700, bernhard...@gmail.com wrote:
> 1) What are the dependencies when I use gonum/matrix? I prefer to have
> a minimal version, and there are lots of packages under gonum.

github.com/gonum/matrix/mat64

then choose a driver package (there are a few within that repo).

> 2) Your code is binding to C libs in the blas package. I guess your -
> only - interface towards the c-libs is here cblas.h and blas.go?

the cgo driver package is implemented via is blas.go and cblas.h to
whatever C BLAS implementation you want.

> 3) Does your Blas package provide all three levels?

depends on the driver used. cblas covers all three since it just passes
work onto the underlying lib. others are not so complete, but are being
worked on by others.

> 4) I hope, guess, assume that you are using the concurrency tools
> (goroutines) that golang is providing, i.e. your package is NOT
> single-threaded, right?

like many components in the Go ecosystem, they are designed to
facilitate use in concurrent systems. we do not do parallel
multiplication, but it is trivial to implement a parallel mul using the
mat64 primitives that we provide (views allow this easily). measure
before you try; the current cblas implementation is competitive.


Harry de Boer

unread,
Jul 16, 2014, 11:43:49 AM7/16/14
to golan...@googlegroups.com, bernhard...@gmail.com
Hey Bernard!

I didn't do much go since I started working last year but I plan to pick it up again. You finding my old stuff finally sparked me to post a message again :)

I'm not quite sure about the current state of gomatrix but I remember that when I played with it for big matrices the parallel version was a lot slower than the singlethread version. If I remember correctly it had a subtle bug that made it actually really cache unfriendly. I did spend some time to fix this, not sure if I ever finished it but I'll try to dig it up.

Until then: try the non-parallel version of times on your big matrix, my guess is that it is a lot faster ;-)

Regards,
Harry

bernhard...@gmail.com

unread,
Jul 16, 2014, 5:11:33 PM7/16/14
to golan...@googlegroups.com, bernhard...@gmail.com
Hello Kortschak,

I am trying to use your wrapper for the sample code below (simply multiplying two small matrices) but I always get this error message when building './gonum_tst02.go:56: not enough arguments in call to blas.Float64.Dgemm'. I am using cblas from atlas with the compilation of cblas.go like so, '#cgo linux LDFLAGS: -L/usr/lib64/atlas-sse3 -lcblas'.  'go install ./...' runs without problems.

I am pulling my hair out. All definitions of Dgemm I can find anywhere in your package have exactly thirteen parameters, like my call to Dgemm. Any idea whats wrong.
Bernhard

package main

import (
    "fmt"
)

type DgemmCase struct {
m, n, k     int
alpha, beta float64
a           [][]float64
b           [][]float64
c           [][]float64
aflat       []float64
bflat       []float64
cflat       []float64
}


func main() {
var Agc DgemmCase = DgemmCase{
m:        4, 
n:        3, 
k:        2, 
alpha:    1, 
beta:     1, 
aflat: []float64{1, 2, 4, 5, 7, 8, 10, 11},
bflat: []float64{1, 5, 6, 5, -8, 8},
cflat: []float64{0, 0, 0,0, 0, 0,0, 0, 0},
}
lda := len(Agc.a[0])
ldb := len(Agc.b[0])
ldc := len(Agc.c[0])

//Dgemm(tA, tB Transpose, m, n, k int, alpha float64, a []float64, lda int, b []float64, ldb int, beta float64, c []float64, ldc int)
blas.Float64.Dgemm(blas.NoTrans, blas.NoTrans, Agc.m, Agc.n, Agc.k, 1.0, Agc.aflat, lda, Agc.bflat, ldb, 1.0, Agc.cflat, ldc)
fmt.Print(cFlat)
}

Brendan Tracey

unread,
Jul 16, 2014, 5:16:11 PM7/16/14
to bernhard...@gmail.com, golan...@googlegroups.com
The blas.Float64 type is an interface definition, not an actual method.

you probably want something like


import(
)

func main(){
     ….
cblas.Blas{}.Dgemm(arguments)
}

Let me know if that doesn’t work.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/krg4ES2BHIE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Kortschak

unread,
Jul 16, 2014, 5:54:25 PM7/16/14
to bernhard...@gmail.com, golan...@googlegroups.com, bernhard...@gmail.com
As Brendan says, you are calling the method on an interface definition. I'm surprised at the error message, it could be more helpful, but call cblas.Blas{}.Dgemm(... instead.

What you are doing is this: http://play.golang.org/p/EfGS36LrLn

What you need to be doing is one of these: http://play.golang.org/p/EfGS36LrLn (we normally do the second form)

The first one explains why the error message is what it is; it is possible to call a method as a function (what you are trying to do erroneously) if you provide a value of the type in the first position. You are providing this, so the compiler is complaining of a missing parameter.

Dan Kortschak

unread,
Jul 16, 2014, 6:08:40 PM7/16/14
to Dan Kortschak, bernhard...@gmail.com, golan...@googlegroups.com
As as aside, why interact directly with BLAS, the interface is horrible and much less safe.

bernhard...@gmail.com

unread,
Jul 17, 2014, 12:06:54 PM7/17/14
to golan...@googlegroups.com, bernhard...@gmail.com
Thank you guys. It is working now. @kortschak The reason why I am using dgemm sort of directly is that I wanted to learn about go and calling c-libs in go. You have done great work, and for people where productivity is the most important aspect, your package is great. I also think that more explanations about your package are good too. You can use what I have written below on your github project (no attribution necessary). I think it is great when people get a deeper understanding of your package and can use it with more ease. And maybe you can explain what the triple points in 'go install ./...' mean. At first I was puzzled, then I remembered xpath, where it is used as well, I think, as a way of recursively descending down a tree structure.  

Here is what I did (again value for other readers).
1) Install 'ATLAS libraries for SSE3 extensions' on my fedora 20 linux system: atlas-sse3.x86_64 and the devel package atlas-sse3-devel.x86_64. Contains the cblas.h file.
2) In github.com/gonum/blas/cblas/blas.go, comment out everything starting with #cgo and insert/adapt these lines:
#cgo linux LDFLAGS: -L/usr/lib64/atlas-sse3 -L/usr/include -lcblas
#include <cblas.h>
That way we are exactly matching libs and .h files of the blas implementation we are using.
3) Here is a simple go-file simply multiplying two small matrices.
<start of file>
package main

import (
    "fmt"
)

type DgemmCase struct {
m, n, k     int
alpha, beta float64
aflat       []float64
bflat       []float64
cflat       []float64
}


func main() {
var Agc DgemmCase = DgemmCase{
m:        4, // rows, first multiplicand
n:        3, // cols, second multiplicand
k:        2, // joint dimension
alpha:    1, // scalar multiplicand for matrix A. Here set to 1 as we do not need it. 
beta:     1, // scalar multiplicand for result matrix C. Here set to 1 as we do not need it.
aflat: []float64{1, 2, 4, 5, 7, 8, 10, 11}, // represents a 4 rows x 2 cols matrix
bflat: []float64{1, 5, 6, 5, -8, 8}, // represents a 2 rows x 3 cols matrix
cflat: make([]float64, 4*3), // represents the 4 rows x 3 cols result matrix
}
lda := Agc.k
ldb := Agc.n
ldc := Agc.n

//Dgemm(tA, tB Transpose, m, n, k int, alpha float64, a []float64, lda int, b []float64, ldb int, beta float64, c []float64, ldc int)
// 111 means 'the matrix is not transposed'
cblas.Blas{}.Dgemm(111, 111, Agc.m, Agc.n, Agc.k, Agc.alpha, Agc.aflat, lda, Agc.bflat, ldb, Agc.beta, Agc.cflat, ldc)
fmt.Printf("%#v\n", Agc.cflat)
}
<end of file>

Dan Kortschak

unread,
Jul 17, 2014, 5:22:57 PM7/17/14
to bernhard...@gmail.com, golan...@googlegroups.com, bernhard...@gmail.com
I'm glad you have it working now and thanks for the offer. I won't include examples in cblas (although we have put some work into making the build instructions easier to understand) - the intention is that people either use a front end API like mat64 or already have experience with the BLAS interface (or be able to muddle trough on their own) since it's too big a job to provide support for that interface; I would not be able to do it justice.

Comment on style, if you import gonum/blas, you get named constants for NoTrans and others. This makes the code more maintainable/readable.

bernhard...@gmail.com

unread,
Jul 18, 2014, 6:28:05 AM7/18/14
to golan...@googlegroups.com, bernhard...@gmail.com
Hello Harry, 

thanks for your answer. As you can see above, I managed to get the atlas libs working with go. Still, I am curious about the issue and it is a great opportunity to learn more about golang.

What I would like to ask you is maybe you can give me sort of a sketchy overview of how to diagnose 'cache-unfriendlyness' in golang. I noticed that as the matrices linearly get larger, the amount of memory used disproportionately gets larger (maybe garbage collection cannot keep up). I would think that the pprov package provides great data, but there are lots of metrics to look at. Maybe you could give me a top-of-your-head guideline where and what to look for (memory profile or cpu profile etc.)?

Thanks and regards
Bernard   

Harry de Boer

unread,
Jul 23, 2014, 5:15:41 PM7/23/14
to golan...@googlegroups.com, bernhard...@gmail.com
Hi Bernard,

efficient use of cache is not really language dependent, the memory access patterns of the algorithm are more important. You can find a lot of information on this online, for example the video below explains most quite nicely.

http://vimeo.com/97337258

I've been playing a bit with gonum too, it looks pretty nice. I'll definitely play a bit more with it.

Regards,
Harry

Brendan Tracey

unread,
Jul 23, 2014, 5:31:13 PM7/23/14
to Harry de Boer, golan...@googlegroups.com, bernhard...@gmail.com
On Jul 23, 2014, at 2:15 PM, Harry de Boer <ha...@ijscoboer.nl> wrote:

Hi Bernard,

efficient use of cache is not really language dependent, the memory access patterns of the algorithm are more important. You can find a lot of information on this online, for example the video below explains most quite nicely.

http://vimeo.com/97337258

I've been playing a bit with gonum too, it looks pretty nice. I'll definitely play a bit more with it.

I’m glad your first impressions are positive. Clearly it’s still very much a work in progress, but the more people we have using and contributing to the project the faster it’ll get better.


Regards,
Harry

On Friday, July 18, 2014 12:28:05 PM UTC+2, bernhard...@gmail.com wrote:
Hello Harry, 

thanks for your answer. As you can see above, I managed to get the atlas libs working with go. Still, I am curious about the issue and it is a great opportunity to learn more about golang.

What I would like to ask you is maybe you can give me sort of a sketchy overview of how to diagnose 'cache-unfriendlyness' in golang. I noticed that as the matrices linearly get larger, the amount of memory used disproportionately gets larger (maybe garbage collection cannot keep up). I would think that the pprov package provides great data, but there are lots of metrics to look at. Maybe you could give me a top-of-your-head guideline where and what to look for (memory profile or cpu profile etc.)?

Thanks and regards
Bernard   

simon place

unread,
Jul 23, 2014, 10:56:12 PM7/23/14
to golan...@googlegroups.com, bernhard...@gmail.com
isn't this just, twice the number of points each requiring twice the multiplications, ( x4 slower), then at this size you've roughly the cpu cache size, so should expect a rapid decrease in cache hits.

Brendan Tracey

unread,
Oct 27, 2015, 11:12:52 AM10/27/15
to golang-nuts, bernhard...@gmail.com
I stumbled across this old thread, and just noticed the following.


func main() {
var Agc DgemmCase = DgemmCase{
m:        4, // rows, first multiplicand
n:        3, // cols, second multiplicand
k:        2, // joint dimension
alpha:    1, // scalar multiplicand for matrix A. Here set to 1 as we do not need it. 
beta:     1, // scalar multiplicand for result matrix C. Here set to 1 as we do not need it.

Note that this is probably not what you want. Dgemm computes C = beta * C + alpha * A * B. If beta != 0 then you are updating the values in C, not overwriting them, and so you probably want beta = 0.

Also note that since the initial time of this posting, the native blas implementation of Dgemm is parallel. Code at https://github.com/gonum/blas/blob/master/native/dgemm.go
Reply all
Reply to author
Forward
0 new messages