Parallel Matrix Multiplication

2,149 views
Skip to first unread message

Dmitriy Vyukov

unread,
Apr 14, 2011, 8:55:09 AM4/14/11
to golan...@googlegroups.com
Hi,

I want to create a parallel matrix multiplication benchmark for Go. I used to fork-join style parallelism, so I implemented it as follows:

type Matrix [][]float64

func matrix_make(n int) Matrix {
M := make([][]float64, n);
for i := 0; i != n; i += 1 {
M[i] = make([]float64, n)
}
return M
}

const (Threshold = 8)

func matmult_pimpl (sync chan int, A Matrix, B Matrix, C Matrix, i0 int, i1 int, j0 int, j1 int, k0 int, k1 int) {
di := i1 - i0
dj := j1 - j0
dk := k1 - k0
if (di >= dj && di >= dk && di >= Threshold) {
mi := i0 + di / 2
sync0 := make(chan int, 1)
go matmult_pimpl(sync0, A, B, C, i0, mi, j0, j1, k0, k1)
matmult_pimpl(nil, A, B, C, mi, i1, j0, j1, k0, k1)
<- sync0
} else if (dj >= dk && dj >= Threshold) {
mj := j0 + dj / 2
sync0 := make(chan int, 1)
go matmult_pimpl(sync0, A, B, C, i0, i1, j0, mj, k0, k1)
matmult_pimpl(nil, A, B, C, i0, i1, mj, j1, k0, k1)
<- sync0
} else if (dk >= Threshold) {
mk := k0 + dk / 2
matmult_pimpl(nil, A, B, C, i0, i1, j0, j1, k0, mk)
matmult_pimpl(nil, A, B, C, i0, i1, j0, j1, mk, k1)
} else {
for i := i0; i < i1; i += 1 {
for j := j0; j < j1; j += 1 {
for k := k0; k < k1; k += 1 {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
if sync != nil {
sync <- 0
}
}

func matmult_parallel(A Matrix, B Matrix) Matrix {
n := len(A)
C := matrix_make(n)
matmult_pimpl(nil, A, B, C, 0, n, 0, n, 0, n)
return C
}

Is it a good implementation for Go? Is it enough "Go-ish"? What would be a "canonical" way?
TIA

John Asmuth

unread,
Apr 14, 2011, 9:09:12 AM4/14/11
to golan...@googlegroups.com
Here's what I did.



One thing to keep in mind is that unless the parallel components have a lot of work to do, the overhead will be too big. So don't bother doing this for small matrices.

Dmitriy Vyukov

unread,
Apr 14, 2011, 9:32:06 AM4/14/11
to golan...@googlegroups.com
четверг, 14 апреля 2011 г. 17:09:12 UTC+4 пользователь John Asmuth написал:
Hi, 
Thank you for the pointer.
Actually I want to create a benchmark which exactly highlights problems in scheduler/channels/etc. For that purpose I can't use your code because (1) it rolls out own centralized scheduling via own channel, (2) parallel for will cause bottleneck on memory throughput so it can't potentially scale well.
My goal is to create an implementation which ultimately can be transparently made scale well if problems in runtime eliminated.



John Asmuth

unread,
Apr 14, 2011, 9:40:38 AM4/14/11
to golan...@googlegroups.com


On Thursday, April 14, 2011 9:32:06 AM UTC-4, Dmitriy Vyukov wrote:
Actually I want to create a benchmark which exactly highlights problems in scheduler/channels/etc. For that purpose I can't use your code

Not my intention. Just wanted to show you what someone else (me) did :)
 
because (1) it rolls out own centralized scheduling via own channel, (2) parallel for will cause bottleneck on memory throughput so it can't potentially scale well.

I'm not sure I follow. You can put however much you want in the "i" parameter - it's just an interface{}. It so happens that I send only one row/col pair through at a time, but that's not parallel for's fault - that's the fault of the code that invokes it.

- John

Dmitriy Vyukov

unread,
Apr 14, 2011, 9:48:24 AM4/14/11
to golan...@googlegroups.com
четверг, 14 апреля 2011 г. 17:40:38 UTC+4 пользователь John Asmuth написал:
But it uses a centralized queue -> non scalable.

Ryanne Dolan

unread,
Apr 14, 2011, 9:53:42 AM4/14/11
to golan...@googlegroups.com, Dmitriy Vyukov
I did some analysis as part of a homework assignment recently.  Here's my implementation:


I think it's exactly the same as yours.

My benchmarks results were interesting.  The Go code was 2X faster than equivalent C code using pthreads, and 2X faster than a naive Go implementation like John's.

I think your implementation is great.  Who cares if it is "Go-ish" when it is fast and cache-optimized?

Thanks.
Ryanne

--
www.ryannedolan.info

John Asmuth

unread,
Apr 14, 2011, 10:39:25 AM4/14/11
to golan...@googlegroups.com
Ah, you meant scale with processors, while I was thinking scale with data.

Steven

unread,
Apr 14, 2011, 10:57:35 AM4/14/11
to golan...@googlegroups.com, Dmitriy Vyukov
This looks good to me. I just have a few style notes.

1) Functions you want to be visible outside your package need to begin them with a capital letter. You seem to be following the "capitalize types, don't capitalize functions" model from other languages. This doesn't work for Go.
2) While you're changing models to accommodate this, I'll mention that most Go code I've seen follows CamelCaps.
3) Factory functions are normally called MakeMatrix (or NewMatrix if you're returning a pointer) rather than the other way around. If your package is called "matrix", you can just call it Make, since it'll be used as matrix.Make.
4) You have more brackets than necessary. You don't need the brackets declaring Threshhold. You don't need the brackets in the if statements.
5) It is idiomatic, when you have > 2 cases to choose between, to use a switch statement rather than if...else if...else. This helps make your logic for each case stand out above the syntax of the control structure.
6) You could make the signature of mathmult_pimpl:

func matmult_pimpl (sync chan int, A, B, C Matrix, i0, i1, j0, j1, k0, k1 int)

A lot of this is of course subjective, but I thought I'd point them out :)

John Asmuth

unread,
Apr 14, 2011, 11:14:55 AM4/14/11
to golan...@googlegroups.com, Dmitriy Vyukov


On Thursday, April 14, 2011 9:53:42 AM UTC-4, Ryanne Dolan wrote:

My benchmarks results were interesting.  The Go code was 2X faster than equivalent C code using pthreads, and 2X faster than a naive Go implementation like John's.


Then why didn't you push it to gomatrix! :) You do have commit access, after all. 

peterGo

unread,
Apr 14, 2011, 11:27:05 AM4/14/11
to golang-nuts
On Apr 14, 10:57 am, Steven <steven...@gmail.com> wrote:
> 4) You have more brackets than necessary. You don't need the brackets
> declaring Threshhold. You don't need the brackets in the if statements.

Running gofmt -w on the program would cleanup the program.

Peter

On Apr 14, 10:57 am, Steven <steven...@gmail.com> wrote:

Dmitriy Vyukov

unread,
Apr 15, 2011, 1:46:38 AM4/15/11
to golan...@googlegroups.com
четверг, 14 апреля 2011 г. 19:27:05 UTC+4 пользователь peterGo написал:
On Apr 14, 10:57 am, Steven <stev...@gmail.com> wrote:
> 4) You have more brackets than necessary. You don't need the brackets
> declaring Threshhold. You don't need the brackets in the if statements.

Running gofmt -w on the program would cleanup the program.

 
Thank you, I will do that.

Dmitriy Vyukov

unread,
Apr 15, 2011, 2:00:24 AM4/15/11
to golan...@googlegroups.com, Dmitriy Vyukov
четверг, 14 апреля 2011 г. 18:57:35 UTC+4 пользователь Steven написал:

This looks good to me. I just have a few style notes.

Thank you!
It's my first Go program, and I guess it uses C/C++ style with some random inclusions of Go style :)
 

1) Functions you want to be visible outside your package need to begin them with a capital letter. You seem to be following the "capitalize types, don't capitalize functions" model from other languages. This doesn't work for Go.

Do I get you right that functions starting with lowercase won't be visible from outside?
I only have a single package which contains all helper functions and the main(), do I need to rename functions then?

 
2) While you're changing models to accommodate this, I'll mention that most Go code I've seen follows CamelCaps.

Hmmm.... I am looking at /test/bench sources and basically all code uses lowercase functions and variables...

 
3) Factory functions are normally called MakeMatrix (or NewMatrix if you're returning a pointer) rather than the other way around. If your package is called "matrix", you can just call it Make, since it'll be used as matrix.Make.

It seems to contradict with "make" functions that create arrays and channels...
Btw, the package is "main", it's single source benchmark.


 
4) You have more brackets than necessary. You don't need the brackets declaring Threshhold. You don't need the brackets in the if statements.

Is it a bad taste in Go? You know I still feel more comfortable with if (...) :)

 
5) It is idiomatic, when you have > 2 cases to choose between, to use a switch statement rather than if...else if...else. This helps make your logic for each case stand out above the syntax of the control structure.

Yeah, but it makes code inconsistent... and that "I've added third case so I need to rewrite it to switch... oh, no, I do not need that third case, I need to rewrite it back... ah, I need another third case..." :)

 
6) You could make the signature of mathmult_pimpl:

func matmult_pimpl (sync chan int, A, B, C Matrix, i0, i1, j0, j1, k0, k1 int)


Aha!
Actually after adoption from C code I had (... i int, j int, k, int ...), and I was stumbled by how it compiles and why I get errors like "incorrect number of arguments", now I understand it was (... i int, j int, k int, <unnamed> int ...) :)

 
A lot of this is of course subjective, but I thought I'd point them out :)

 Thank you, I appreciate your comments, it's difficult to write code in a new language.

Dmitriy Vyukov

unread,
Apr 15, 2011, 2:06:00 AM4/15/11
to golan...@googlegroups.com, Dmitriy Vyukov
четверг, 14 апреля 2011 г. 17:53:42 UTC+4 пользователь Ryanne Dolan написал:
I did some analysis as part of a homework assignment recently.  Here's my implementation:


I think it's exactly the same as yours.

I guess we've adapted the same C source code (cilk) :)
 

My benchmarks results were interesting.  The Go code was 2X faster than equivalent C code using pthreads, and 2X faster than a naive Go implementation like John's.

Yeah, cache-oblivious algorithms is the way to go (or Go :) )! Here I tried to show that on a simpler example:

Humm... Does the C code really equivalent? I don't believe...
 

I think your implementation is great.  Who cares if it is "Go-ish" when it is fast and cache-optimized?


Well, I want to create a representative benchmark that stresses scheduler (channels and garbage collector), and then optimize the scheduler using the benchmark. If the benchmark is not representative than I am optimizing the wrong case...

 

Steven

unread,
Apr 15, 2011, 3:05:13 AM4/15/11
to golan...@googlegroups.com, Dmitriy Vyukov
On Friday, April 15, 2011, Dmitriy Vyukov <dvy...@google.com> wrote:
> четверг, 14 апреля 2011 г. 18:57:35 UTC+4 пользователь Steven написал:
> This looks good to me. I just have a few style notes.
> Thank you!It's my first Go program, and I guess it uses C/C++ style with some random inclusions of Go style :)

>
>
> 1) Functions you want to be visible outside your package need to begin them with a capital letter. You seem to be following the "capitalize types, don't capitalize functions" model from other languages. This doesn't work for Go.
> Do I get you right that functions starting with lowercase won't be visible from outside?I only have a single package which contains all helper functions and the main(), do I need to rename functions then?

Yes. And, No :). I brought it up more because you were capitalizing
some things and not others based on a pattern that doesn't make sense
in Go. If you wanted to put your Matrix in a library, you would have
to make the change. In your main package it's unnecessary. I generally
follow the visible/non-visible approach in main anyways, because it
means less work if I decide to reorganize my source.

> 2) While you're changing models to accommodate this, I'll mention that most Go code I've seen follows CamelCaps.
> Hmmm.... I am looking at /test/bench sources and basically all code uses lowercase functions and variables...

That's because testing code isn't providing an API, and follows a
different naming convention for use by gotest.

>  3) Factory functions are normally called MakeMatrix (or NewMatrix if you're returning a pointer) rather than the other way around. If your package is called "matrix", you can just call it Make, since it'll be used as matrix.Make.

> It seems to contradict with "make" functions that create arrays and channels...Btw, the package is "main", it's single source benchmark.

NewT is definitely preferred and more common for most types. You see
MakeT sometimes, but rarely. Sometimes, if I'm doing extra work for a
type I'd normally make, I use MakeT. And I was just pointing out the
possibility of using the package name as part of the name for future
reference.

> 4) You have more brackets than necessary. You don't need the brackets declaring Threshhold. You don't need the brackets in the if statements.
> Is it a bad taste in Go? You know I still feel more comfortable with if (...) :)

To each their own. However, gofmt, I'm pretty sure, will remove the
brackets, and it's the definitive style for the Go language. I think
it's just a matter of getting used to it. I generally appreciate not
having the bracket clutter.

>  5) It is idiomatic, when you have > 2 cases to choose between, to use a switch statement rather than if...else if...else. This helps make your logic for each case stand out above the syntax of the control structure.
> Yeah, but it makes code inconsistent... and that "I've added third case so I need to rewrite it to switch... oh, no, I do not need that third case, I need to rewrite it back... ah, I need another third case..." :)

I must say that I've never run into this problem. I generally know
whether it's going to be a simple if or if...else upfront, and if it's
a bunch of options, I go with the switch.

> 6) You could make the signature of mathmult_pimpl:
> func matmult_pimpl (sync chan int, A, B, C Matrix, i0, i1, j0, j1, k0, k1 int)
>
>
>

> Aha!Actually after adoption from C code I had (... i int, j int, k, int ...), and I was stumbled by how it compiles and why I get errors like "incorrect number of arguments", now I understand it was (... i int, j int, k int, <unnamed> int ...) :)

Dmitriy Vyukov

unread,
Apr 15, 2011, 4:32:18 AM4/15/11
to golan...@googlegroups.com, Dmitriy Vyukov
Thanks, Steven! I think I've got more sense of Go style now. I've applied your suggestions and passed it via gofmt.

kem

unread,
Apr 15, 2011, 2:40:13 PM4/15/11
to golang-nuts
I feel obliged to reinforce this sentiment.

John Asmuth

unread,
Apr 15, 2011, 3:39:45 PM4/15/11
to golan...@googlegroups.com
I adapted this code and added it to gomatrix, with some changes.


The non-trivial change I made was to cap the number of goroutines in use at any one time to GOMAXPROCS+2. Since there is no file I/O or any other kind of blocking behavior, having O(log N) goroutines for an NxN matrix wasted overhead.

I did some experiments to verify that this was a good idea. With the product of two 100x100 randomly initialized matrices, my old parallel times implementation took ~.75-.8 times as long as the non-parallel version. And the algorithm Dmitriy posted also took ~.75 times as long. When I limit the number of concurrent goroutines to GOMAXPROCS+2, Dmitriy's version takes ~.5-.55 times as long as the non parallel version. There are two processors on my computer, so I consider that a success.


Dmitriy Vyukov

unread,
Apr 16, 2011, 1:42:52 PM4/16/11
to golan...@googlegroups.com
I don't think it's a good idea, because it will produce very few very coarse grained and uneven tasks.
Btw, here are my performance results:
And here are my current sources:

//why don't we split here, too?
//one answer - at this point we've already got way more goroutines than procs

These 2 partitions can't run in parallel, otherwise they will produce data races.

John Asmuth

unread,
Apr 16, 2011, 1:54:31 PM4/16/11
to golan...@googlegroups.com


On Saturday, April 16, 2011 1:42:52 PM UTC-4, Dmitriy Vyukov wrote:
I don't think it's a good idea, because it will produce very few very coarse grained and uneven tasks.

It simply doesn't run as fast if I don't do this. I added it because it gave no speedup beyond what previous implementation achieved. The code I used to test is demo/demo3, included in the gomatrix repository. Now, I only tested it on my dual-core machine, so maybe as cores go up this is less of an issue.
 
Btw, here are my performance results:
And here are my current sources:

//why don't we split here, too?
//one answer - at this point we've already got way more goroutines than procs

These 2 partitions can't run in parallel, otherwise they will produce data races.

Of course, thanks.

Why not remove that case entirely, then? It breaks up the continuity of memory access without doing anything in parallel.

- Jhon
 

Eric I.

unread,
Apr 16, 2011, 7:54:59 PM4/16/11
to golang-nuts
By reducing the size of the k "dimension", the recursive calls have a
reduced dk, and they might then split on the i or j "dimensions",
which will be done in parallel.

Eric

Ryanne Dolan

unread,
Apr 16, 2011, 7:58:53 PM4/16/11
to John Asmuth, golang-nuts

John,

There are two forces at work in the parallel implemention Dmitriy and I shared. The first is parallelization to multiple cores. The second is partitioning the matrix to optimize cache hits. If you limit the number of branches in the algorithm based on the number of procs, you end up with big partitions that don't fit in cache well. If you use too many goroutines, the overhead of creating each will be substantial.

The solution is to recurse as much as before but only launch goroutines the first few levels down. Pass in an extra depth parameter and switch from the parallel version to the serial version after some max depth.

Ryanne

John Asmuth

unread,
Apr 16, 2011, 11:04:01 PM4/16/11
to golan...@googlegroups.com
That's all well and good, but if you ever reach the case for splitting on k, i and j are already below threshold. The only change that happens is the for loops are separated.

Dmitriy Vyukov

unread,
Apr 18, 2011, 8:19:13 AM4/18/11
to golan...@googlegroups.com
воскресенье, 17 апреля 2011 г. 7:04:01 UTC+4 пользователь John Asmuth написал:
That's all well and good, but if you ever reach the case for splitting on k, i and j are already below threshold. The only change that happens is the for loops are separated.

Nope. The code always splits the largest  dimension.

John Asmuth

unread,
Apr 18, 2011, 10:15:55 AM4/18/11
to golan...@googlegroups.com
I see, I missed that in my adaptation.

Gabriel Pcklub

unread,
Mar 25, 2021, 1:17:37 PM3/25/21
to golang-nuts
Hello, what is Threshold constant in code for? I wanted to try your code, but with Threshold 1 it take all of 16GB RAM I have and crash on sigkill. When I use Threshold 8 it seems ok, but when I use 64, it again take a lot of RAM... I thought it's count of Threads that program will use, while running on CPU. But it does give unexpected or strange results.

Dátum: štvrtok 14. apríla 2011, čas: 14:55:09 UTC+2, odosielateľ: Dmitry Vyukov

jas...@gmail.com

unread,
Mar 25, 2021, 2:21:02 PM3/25/21
to Gabriel Pcklub, golang-nuts
Blast from the past so it's hard to be sure, but I think that was how many rows or columns to pick for parallel sub matrices to multiply.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/RVX-BhHcugM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/32e55537-d2d0-4e9a-8425-c85ad6a790a1n%40googlegroups.com.

Dan Kortschak

unread,
Mar 25, 2021, 2:36:29 PM3/25/21
to golan...@googlegroups.com
On Thu, 2021-03-25 at 14:20 -0400, jas...@gmail.com wrote:
> Blast from the past so it's hard to be sure, but I think that was how
> many rows or columns to pick for parallel sub matrices to multiply.
>
> On Thu, Mar 25, 2021, 1:17 PM Gabriel Pcklub <
> gabriel...@gmail.com> wrote:
> > Hello, what is Threshold constant in code for? I wanted to try your
> > code, but with Threshold 1 it take all of 16GB RAM I have and crash
> > on sigkill. When I use Threshold 8 it seems ok, but when I use 64,
> > it again take a lot of RAM... I thought it's count of Threads that
> > program will use, while running on CPU. But it does give unexpected
> > or strange results.


Yeah, it's essentially the minimum block size for parallel execution.
At 1, it's attempting to work parallel the whole way down.

This thread is almost as old as the language though.

Dan


Gabriel Pcklub

unread,
Mar 27, 2021, 3:23:20 PM3/27/21
to golang-nuts
thanks for explanation, well, how can I change that code, so it will take count of Threads (or gorutines) instead size of matrix for one gorutine?

Dátum: štvrtok 25. marca 2021, čas: 19:21:02 UTC+1, odosielateľ: jas...@gmail.com

Gabriel Pcklub

unread,
Mar 27, 2021, 3:26:30 PM3/27/21
to golang-nuts

So Threshold is Size / Threads?
Dátum: štvrtok 25. marca 2021, čas: 19:21:02 UTC+1, odosielateľ: jas...@gmail.com
Blast from the past so it's hard to be sure, but I think that was how many rows or columns to pick for parallel sub matrices to multiply.

jas...@gmail.com

unread,
Mar 27, 2021, 3:26:55 PM3/27/21
to Gabriel Pcklub, golang-nuts
Don't use that code - instead pick up something modern and well-supported like https://www.gonum.org/

Reply all
Reply to author
Forward
0 new messages