Just a tiny success story about super simple parallelization

646 views
Skip to first unread message

Niklas Schnelle

unread,
Jan 21, 2013, 3:11:28 PM1/21/13
to golan...@googlegroups.com
Hi fellow Gophers,

just wanted to share how easy I got some of my image transformation code parallized with the
help of Goroutines. The code is still far from optimal and it's just a fun project I do, redoing some university excercises
that are originally in C in Go. A few minutes ago I decided to try to parallelize my code in the simplest way possible
that is by just launching one Goroutine per image line.
Some 10 minutes later my program runs in 0.8 seconds of wall clock time instead of > 3.5,
of course this is easy as the problem is pleasently parallel but surely launching hundreds of Goroutines must have some overhead and
so does launching 16 threads (I'm on a 16 core system).

Here is the code:
https://github.com/niklas88/imgtest

and the diff of parallelization:
https://github.com/niklas88/imgtest/compare/parallel

you can see that apart from indentation it's only a couple of changed lines.

Job van der Zwan

unread,
Jan 21, 2013, 4:13:51 PM1/21/13
to golan...@googlegroups.com
I don't see you setting GOMAXPROCS anywhere, so Go should be running single core. What happens if you set it to 16?

Also, you could tile the image into a number of segments equal to the amount of cores, then run a Go loop per segment. Might play nicer with the cache, and then you don't need to start so many goroutines.

Job van der Zwan

unread,
Jan 21, 2013, 4:16:27 PM1/21/13
to golan...@googlegroups.com
Also: yeah, it's really nice that concurrency is so easy that even the naive solutions can be effective, huh? :)

Niklas Schnelle

unread,
Jan 21, 2013, 4:36:03 PM1/21/13
to golan...@googlegroups.com
I set GOMAXPROCS as an environment variable on the command line, but yeah probably should add that.

Niklas Schnelle

unread,
Jan 21, 2013, 4:37:24 PM1/21/13
to golan...@googlegroups.com
Ah and yeah, I will try the segments in the future, just thought I'd start with the most simple implementation first.

Dmitry Vyukov

unread,
Jan 22, 2013, 1:46:40 AM1/22/13
to golan...@googlegroups.com
This can have very bad performance in some cases due to absence of load balancing. Goroutine-per-line is much better in this respect.
 

Job van der Zwan

unread,
Jan 22, 2013, 4:53:04 AM1/22/13
to golan...@googlegroups.com
On Tuesday, 22 January 2013 07:46:40 UTC+1, Dmitry Vyukov wrote:
This can have very bad performance in some cases due to absence of load balancing. Goroutine-per-line is much better in this respect.
 
Could you elaborate, maybe with an example? I'm not about to question you of all people on how to use multiple cores, but I would have expected memory access to be the first rule-of-thumb bottleneck with straightforward image algorithms. So my reasoning was to segment the array and have a straightforward loop.

Dmitry Vyukov

unread,
Jan 22, 2013, 6:23:46 AM1/22/13
to Job van der Zwan, golang-nuts
If you split the image into say 8 equal parts, and then one of the
goroutines/threads/cores accidentally took 2 times more time to
complete, then whole processing is slowed down 2x. The slowdown can be
due to OS scheduling, other processes/interrupts, unfortunate NUMA
memory layout, different amount of processing per part (e.g. ray
tracing) and other reasons.
What you propose is good, but size of a work item must never be
dependent on input data size (in an ideal world), it must be dependent
on overheads of parallelization technology. Currently a reference
number is ~100us-1ms per work item.
So you can split the image into blocks of fixed size (say 64x64) and
then distribute them among threads/goroutines. This has advantages of
both locality and good load balancing.

John Nagle

unread,
Jan 23, 2013, 3:14:51 AM1/23/13
to golan...@googlegroups.com
On 1/21/2013 10:46 PM, Dmitry Vyukov wrote:
>
>
> On Tuesday, January 22, 2013 1:13:51 AM UTC+4, Job van der Zwan wrote:
>>
>> On Monday, 21 January 2013 21:11:28 UTC+1, Niklas Schnelle wrote:
>>
>>> Hi fellow Gophers,
>>>
>>> just wanted to share how easy I got some of my image transformation code
>>> parallized with the
>>> help of Goroutines. The code is still far from optimal and it's just a
>>> fun project I do, redoing some university excercises
>>> that are originally in C in Go. A few minutes ago I decided to try to
>>> parallelize my code in the simplest way possible
>>> that is by just launching one Goroutine per image line.
>>> Some 10 minutes later my program runs in 0.8 seconds of wall clock time
>>> instead of > 3.5,
>>> of course this is easy as the problem is pleasantly parallel but surely
>>> launching hundreds of Goroutines must have some overhead and
>>> so does launching 16 threads (I'm on a 16 core system).
>>>
>>> Here is the code:
>>> https://github.com/niklas88/imgtest
>>>
>>> and the diff of parallelization:
>>> https://github.com/niklas88/imgtest/compare/parallel

Look at
https://github.com/niklas88/imgtest/blob/0d3c5de54e56ef9b1b5e7c07b5a946d46241a950/algorithms/opticflowhornschunk.go

lines 20-40.

Are you getting valid results? The loop in deriveMixed
starts a goroutine for each row, and it sets up a waitGroup and does
an Add for the number of goroutines started. But the goroutine in
deriveMixed never calls wg.Done, and there's no call to wg.Wait in
deriveMixed. So deriveMixed will return while the rows are still
being computed. This is a race condition. You may get valid
results some of the time, depending on CPU loading.

The "flow" function has proper waiting logic, so you know
what to do. But deriveMixed is missing the checks for goroutine
completion are missing.

John Nagle

Niklas Schnelle

unread,
Jan 23, 2013, 6:41:12 AM1/23/13
to golan...@googlegroups.com, na...@animats.com
Good Catch thanks, I guess I got lucky or I did the git commit before I found it, I can't really remember.
I guess I shouldn't do something like parallelization in the middle of waiting for a video to load late at night ;-)

Job van der Zwan

unread,
Jan 23, 2013, 8:46:07 AM1/23/13
to golan...@googlegroups.com, Job van der Zwan
Thank you, that was very clear and enlightening! One thing:

On Tuesday, 22 January 2013 12:23:46 UTC+1, Dmitry Vyukov wrote:
... size of a work item must never be
dependent on input data size (in an ideal world), it must be dependent
on overheads of parallelization technology. Currently a reference
number is ~100us-1ms per work item.

I think I understand, but I don't see how you translate time to memory size...

Also, is there a "cheat sheet" for reference numbers like this somewhere? Sounds like the kind of thing that changes every few years with new hardware.

Dmitry Vyukov

unread,
Jan 23, 2013, 9:42:00 AM1/23/13
to Job van der Zwan, golang-nuts
On Wed, Jan 23, 2013 at 5:46 PM, Job van der Zwan
<j.l.van...@gmail.com> wrote:
> Thank you, that was very clear and enlightening! One thing:
>
> On Tuesday, 22 January 2013 12:23:46 UTC+1, Dmitry Vyukov wrote:
>>
>> ... size of a work item must never be
>> dependent on input data size (in an ideal world), it must be dependent
>> on overheads of parallelization technology. Currently a reference
>> number is ~100us-1ms per work item.
>
>
> I think I understand, but I don't see how you translate time to memory
> size...

I don't.

> Also, is there a "cheat sheet" for reference numbers like this somewhere?
> Sounds like the kind of thing that changes every few years with new
> hardware.

I am not aware of any.
But 100us is pretty safe number, because you most likely don't care if
computation takes +/-100us in the context of heavy computations.

Job van der Zwan

unread,
Jan 23, 2013, 10:01:56 AM1/23/13
to golan...@googlegroups.com, Job van der Zwan
On Wednesday, 23 January 2013 15:42:00 UTC+1, Dmitry Vyukov wrote:
On Wed, Jan 23, 2013 at 5:46 PM, Job van der Zwan
> I think I understand, but I don't see how you translate time to memory
> size...

I don't.

Then it's clear I misunderstand you :). There's the size of a work item in terms of *time* (~100us-1ms), and blocks to be distributed of a fixed size in terms of *data size* (64x64). I feel like I'm missing an intermediate step, or an implicit split between two entirely different concepts.

Stefan Nilsson

unread,
Jan 23, 2013, 3:05:33 PM1/23/13
to golan...@googlegroups.com, Job van der Zwan
The section about parallelization in Effective Go (http://golang.org/doc/effective_go.html#parallel) could use an update that took this into account. Someone?

Dmitry Vyukov

unread,
Jan 24, 2013, 12:24:46 AM1/24/13
to Job van der Zwan, golang-nuts
I've said 64x64 only as an example. I meant that the block size must
be fixed-size rather than N/GOMAXPROCS. But one needs to calculate the
block size for his problem himself, for some problems it will be 1,
for some - 256x256, for some - 8x8x8. It depends on amount of
computations per element.

Niklas Schnelle

unread,
Jan 24, 2013, 4:25:16 AM1/24/13
to golan...@googlegroups.com, Job van der Zwan
Actually I've now tried with a special command line switch that selects the number of image rows per Goroutine.
Rows because the image data is stored in row major order and thus a row is continuous memory while a block wouldn't be.
However it seems like 1 row per Goroutine is still best, on my 16 core system 2 was better in some tries but it's less than the
between run deviations so not worth the change.

Job van der Zwan

unread,
Jan 24, 2013, 1:46:51 PM1/24/13
to golan...@googlegroups.com, Job van der Zwan
On Thursday, January 24, 2013 6:24:46 AM UTC+1, Dmitry Vyukov wrote:
I've said 64x64 only as an example. I meant that the block size must 
be fixed-size rather than N/GOMAXPROCS. But one needs to calculate the 
block size for his problem himself, for some problems it will be 1, 
for some - 256x256, for some - 8x8x8. It depends on amount of 
computations per element. 

Ah, thanks for clarifying.


On Thursday, 24 January 2013 10:25:16 UTC+1, Niklas Schnelle wrote:
Actually I've now tried with a special command line switch that selects the number of image rows per Goroutine.
Rows because the image data is stored in row major order and thus a row is continuous memory while a block wouldn't be.
However it seems like 1 row per Goroutine is still best, on my 16 core system 2 was better in some tries but it's less than the
between run deviations so not worth the change.

So, by that logic, if you have a much wider image the behaviour would change? 

Michael Jones

unread,
Jan 24, 2013, 1:46:49 PM1/24/13
to Dmitry Vyukov, Job van der Zwan, golang-nuts
Comment: 

I just finished a particular case of computation where the execution cost in terms of n = 2**n, with problems that could not be subdivided naturally. My default "biggest sub-problem first" task distributor, using GOMAXPROCS == Num CPU, worked perfectly. The first go routine took time Total/2, and the other (Log2 N)-1 tasks took the same time so closely that after an hour or so of computation, the two halves of the computation finished within a few seconds of each other.

I agree with Dmitry, but wanted to point out that when the tasks are not close to uniform size it can be VERY important to manage the order of task distribution. In the case above, any style of distribution to more than NCPU threads/tasks, or in any other order than launching the big one first, requires more total processing time.

On Wed, Jan 23, 2013 at 9:24 PM, Dmitry Vyukov <dvy...@google.com> wrote:
I've said 64x64 only as an example. I meant that the block size must
be fixed-size rather than N/GOMAXPROCS. But one needs to calculate the
block size for his problem himself, for some problems it will be 1,
for some - 256x256, for some - 8x8x8. It depends on amount of
computations per element. 
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Dmitry Vyukov

unread,
Jan 24, 2013, 11:11:11 PM1/24/13
to Michael Jones, Job van der Zwan, golang-nuts
On Thu, Jan 24, 2013 at 10:46 PM, Michael Jones <m...@google.com> wrote:
> Comment:
>
> I just finished a particular case of computation where the execution cost in
> terms of n = 2**n, with problems that could not be subdivided naturally. My
> default "biggest sub-problem first" task distributor, using GOMAXPROCS ==
> Num CPU, worked perfectly. The first go routine took time Total/2, and the
> other (Log2 N)-1 tasks took the same time so closely that after an hour or
> so of computation, the two halves of the computation finished within a few
> seconds of each other.
>
> I agree with Dmitry, but wanted to point out that when the tasks are not
> close to uniform size it can be VERY important to manage the order of task
> distribution. In the case above, any style of distribution to more than NCPU
> threads/tasks, or in any other order than launching the big one first,
> requires more total processing time.

Agree. That's what frequently missed by build tools like go, make, etc.
Sometimes it's possible to split tasks into subtasks further. This
becomes more important with increasing number of cores. Consider that
you have complexity of the form n = 2**n, N=32, but number of cores is
128.
Reply all
Reply to author
Forward
0 new messages