How to improve the parallelism of go routine?

265 views
Skip to first unread message

颜文泽

unread,
Feb 1, 2021, 12:33:54 PM2/1/21
to golang-nuts

$ go version
go version go1.13 linux/amd64


I'm not sure how to deal with this phenomenon when I find that the parallel performance using go routine is not very good when writing database(olap) code. I have written the following example to verify this:
``` 
package main

import (
    "fmt"
    "hash/crc32"
    "time"
)

const (
    Loop = 10000
)

func main() {                                                                                                                      
    data := make([]byte, 4<<20)
    t := time.Now()
    for i := 0; i < Loop; i++ {
        crc32.ChecksumIEEE(data)
    }
    fmt.Printf("process: %v\n", time.Now().Sub(t))
}
```
```
package main

import (
"fmt"
"hash/crc32"
"sync"
"time"
)

const (
Mcpu = 8
Loop = 10000 / Mcpu
)

func main() {
data := make([]byte, 4<<20)

var wg sync.WaitGroup

t := time.Now()
for i := 1; i < Mcpu; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
tt := time.Now()
for j := 0; j < Loop; j++ {
crc32.ChecksumIEEE(data)
}
fmt.Printf("%v's process: %v\n", idx, time.Now().Sub(tt))
}(i)
}
{
tt := time.Now()
for j := 0; j < Loop; j++ {
crc32.ChecksumIEEE(data)
}
fmt.Printf("0's process: %v\n", time.Now().Sub(tt))

}
wg.Wait()
fmt.Printf("process: %v\n", time.Now().Sub(t))
}
```
My machine has exactly 8 cpu's and I found that the runtime does not decrease linearly when the number of go routines increases.

Ian Lance Taylor

unread,
Feb 1, 2021, 2:13:18 PM2/1/21
to 颜文泽, golang-nuts
On Mon, Feb 1, 2021 at 9:33 AM 颜文泽 <nnsm...@gmail.com> wrote:
>
>
> $ go version
> go version go1.13 linux/amd64
>
>
> I'm not sure how to deal with this phenomenon when I find that the parallel performance using go routine is not very good when writing database(olap) code.

What does runtime.GOMAXPROCS(0) return on your system?

In general the Go runtime is optimized for the case where there are
more goroutines than there are processors. Goroutines that run for a
long time without yielding the processor are preempted. I don't think
that preemption process considers the possibility that there is
nothing else to do.

Do you expect your real program that have long running CPU-bound
goroutines, and to not have any other work to do (i.e., no network
connections and no file I/O)? If so the goroutine scheduler may not
be well tuned for your code.

Also note that the scheduler changed significantly in Go 1.14, so it's
worth testing that. I have no particular reason to think that it will
be better, but it may well be different.

Ian

Wojciech S. Czarnecki

unread,
Feb 1, 2021, 7:07:18 PM2/1/21
to golan...@googlegroups.com
Dnia 2021-02-01, o godz. 11:12:22
Ian Lance Taylor <ia...@golang.org> napisał(a):

> On Mon, Feb 1, 2021 at 9:33 AM 颜文泽 <nnsm...@gmail.com> wrote:

> > go version go1.13 linux/amd64

> Goroutines that run for a long time without yielding the processor are preempted.

Since go1.14 TMK. OP is using 1.13.

> > I'm not sure how to deal with this phenomenon when I find that the parallel performance
> > using go routine is not very good when writing database(olap) code.

First - use recent Go compiler version, current is 1.15, 1.16 is coming soon.

Hope this helps,

--
Wojciech S. Czarnecki
<< ^oo^ >> OHIR-RIPE

Ian Lance Taylor

unread,
Feb 1, 2021, 7:10:12 PM2/1/21
to Wojciech S. Czarnecki, golang-nuts
On Mon, Feb 1, 2021 at 4:07 PM Wojciech S. Czarnecki <oh...@fairbe.org> wrote:
>
> Dnia 2021-02-01, o godz. 11:12:22
> Ian Lance Taylor <ia...@golang.org> napisał(a):
>
> > On Mon, Feb 1, 2021 at 9:33 AM 颜文泽 <nnsm...@gmail.com> wrote:
>
> > > go version go1.13 linux/amd64
>
> > Goroutines that run for a long time without yielding the processor are preempted.
>
> Since go1.14 TMK. OP is using 1.13.

In Go 1.14 and later they are preempted by signals. Before Go 1.14
they were still preempted, it just happened when making a function
call (which meant that the preemption could be arbitrarily delayed).

Ian

颜文泽

unread,
Feb 1, 2021, 9:18:44 PM2/1/21
to golang-nuts
runtime.GOMAXPROCS(0) = 8, I write cpu-intensive olap databases, and the basic principle of routines is smaller than the cpu, which I still control.
However, I found that routines can be very harsh and lead to a linear performance increase, whereas the same implementation in cpp and c did not experience such great difficulties. This is very frustrating for me

颜文泽

unread,
Feb 1, 2021, 9:21:25 PM2/1/21
to golang-nuts
I'll try 1.14, when writing cpu-intensive programs (I'm mainly a database), I found that cache misses from routines switching is also a headache and I don't know how to deal with it.

Robert Engels

unread,
Feb 1, 2021, 9:56:41 PM2/1/21
to 颜文泽, golang-nuts
Having more cpu bound routines than you have physical cpus is not a good idea. 

On Feb 1, 2021, at 8:21 PM, 颜文泽 <nnsm...@gmail.com> wrote:

I'll try 1.14, when writing cpu-intensive programs (I'm mainly a database), I found that cache misses from routines switching is also a headache and I don't know how to deal with it.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/26178bb4-9f60-4fd4-bf3a-9c170e18152bn%40googlegroups.com.

颜文泽

unread,
Feb 1, 2021, 10:02:25 PM2/1/21
to golang-nuts

I don't understand what you mean.

Robert Engels

unread,
Feb 1, 2021, 10:17:55 PM2/1/21
to 颜文泽, golang-nuts
If you look at the “disrupter pattern” you’ll see what I mean. If the tasks are cpu bound - by having more threads/routines than cpus you cause inefficiencies (scheduling overhead, cache locality / invalidation, lock contention). 

On Feb 1, 2021, at 9:02 PM, 颜文泽 <nnsm...@gmail.com> wrote:



颜文泽

unread,
Feb 1, 2021, 11:04:22 PM2/1/21
to golang-nuts
But the number of my routines is smaller than the number of cpu

Kurtis Rader

unread,
Feb 1, 2021, 11:18:08 PM2/1/21
to 颜文泽, golang-nuts
That is not what you told us, but perhaps there is a misunderstanding. In your first message you said:

> My machine has exactly 8 cpu's and I found that the runtime does not decrease linearly when the number of go routines increases.

Are you saying that the runtime does not decrease linearly when increasing the number of goroutines from 1 to 8 (the number of cores on your system)? Or when increasing the number of goroutines beyond 8?

Also, your original code did `for i := 1; i < Mcpu; i++ {` which only creates 7, not 8, goroutines. Is that intentional? It seems wrong given the discussion to this point. Which raises additional questions about your premise.

On Mon, Feb 1, 2021 at 8:04 PM 颜文泽 <nnsm...@gmail.com> wrote:
But the number of my routines is smaller than the number of cpu

在2021年2月2日星期二 UTC+8 上午11:17:55<ren...@ix.netcom.com> 写道:
If you look at the “disrupter pattern” you’ll see what I mean. If the tasks are cpu bound - by having more threads/routines than cpus you cause inefficiencies (scheduling overhead, cache locality / invalidation, lock contention). 

--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Robert Engels

unread,
Feb 1, 2021, 11:21:44 PM2/1/21
to 颜文泽, golang-nuts
You wrote “I found that cache misses from routines switching is also a headache”.

They would not be switching if they are cpu bound and there are less of than number of cpus. Remember too that you need some % of the cpus to execute the runtime GC code and other housekeeping.

颜文泽

unread,
Feb 2, 2021, 1:05:56 AM2/2/21
to golang-nuts
Sorry, my machine has a cpu core of 8. I wrote the code this way because the main routine is also involved in the calculation. So there will be code like i = 1; i < Mcpu.
I considered the main routines when trying to control that the number of routines is not more than the number of cpu. This may be a little misunderstood, sorry about that.

颜文泽

unread,
Feb 2, 2021, 1:07:54 AM2/2/21
to golang-nuts
I don't know much about the internal implementation of golang, sorry. I was a c programmer and I tried to implement the original logic (olap database) by using routine as a thread replacement. But I found that I would encounter bottlenecks, and I don't know how to solve them. Maybe I should study the implementation of routine before I can write the right code.

颜文泽

unread,
Feb 2, 2021, 1:23:51 AM2/2/21
to golang-nuts
It seems that I don't know enough about golang's implementation. When I observed with vtune, I found that even if there was only one routine, the number of threads was 11 (CPI Rate = 1.210), and when I increased the number of routines to 8, the number of threads was 14 (CPI Rate = 2.270).My office computer was used for this test, and the configuration changed from 8 cores to 12 cores:2021-02-02 14-22-05 的屏幕截图.png2021-02-02 14-22-34 的屏幕截图.png

robert engels

unread,
Feb 2, 2021, 1:27:39 AM2/2/21
to 颜文泽, golang-nuts
Unless it is an in memory database, I would expect the IO costs to dwarf the cpu costs, but I guess a lot depends on how you define ‘analytical processing’.

In my experience, “out of the box” performance of Go routines in IO processing is outstanding.

For the cpu bound case, I think with threads, cpu assignments (cpuset), etc. you can probably create a higher performing system in some cases - but it’s a lot of work.

Even without that, I think the scheduler in most Linux systems is more mature than the Go scheduler, and makes better choices for cache affinity, etc. It’s very hard to design a high performance cpu bound system that runs on a general purpose OS or language/platform. Without knowledge of the olap db design it is very hard to make a recommendation.

This is some suggested reading to help you in your journey https://dave.cheney.net/high-performance-go-workshop/dotgo-paris.html

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

颜文泽

unread,
Feb 2, 2021, 1:32:40 AM2/2/21
to golang-nuts
Thanks, it's not memory db, but my current test is not involving io. I'll take time to look at your information, thanks a lot. Also I found that many of the functions with high cpi rate are runtime functions, is the overhead of these functions unavoidable?The following diagram is for a single routine:
2021-02-02 14-25-33 的屏幕截图.png
The following chart is for the 8 routines:
2021-02-02 14-25-56 的屏幕截图.png

颜文泽

unread,
Feb 2, 2021, 1:37:37 AM2/2/21
to golang-nuts
One more question, is it effective to use vtune to tune golang. I am afraid that vtune is not suitable, although intel claims to be effective.

Amnon

unread,
Feb 2, 2021, 2:27:45 AM2/2/21
to golang-nuts
Vtune is very useful for squeezing the ultimate performance out of Go programs, once you have done
the usual optimisation, mimized allocations, io etc. 

pprof is more than adequate for the average programmer. But when you need to super-optimise 
functions which implement math kernels, crypto functions, video codecs etc, then without a HW perfomance
counter based profiler such as vtune or linux perf, (https://perf.wiki.kernel.org/index.php/Main_Page)  you are shooting in the dark.
vtune not only tells you which functions are taking the most time, but WHY these are taking a long time,
how long the code is spending waiting for cache misses, and the different kind of stall cycles which 
kill performance on a modern CPU.

Vtune or perf is also a great tool for teaching us about processors, and helping us understand what influences
the rate at which instructions are executed by them.

The problem with vtune is that it is quite unfriendly and expensive (> $3000 for a single floating license)!
It also does not work on ARM processors (such as Apple M1).

There has been a proposal to add performance counters to pprof.
https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md
If accepted, this would give the power of vtune to the masses for free..

颜文泽

unread,
Feb 2, 2021, 2:48:26 AM2/2/21
to golang-nuts
If it works, it's fine, I'll just keep using vtune. I only work on x86 anyway. That said, I found another miracle, my program has 13 routines as soon as it starts. It's so peculiar. I simply can't understand why this is.

This is my code:

2021-02-02 15-45-01 的屏幕截图.png
And then this is the result, it's amazing.I think I know why my program is slow, the number of routines is too high, but I found that the GOMAXPROCS function doesn't work, it's a really confusing phenomenon for me.
My example did not do anything, my understanding of the number of runtines should be 1 only Ah.
2021-02-02 15-45-49 的屏幕截图.png

颜文泽

unread,
Feb 2, 2021, 2:50:32 AM2/2/21
to golang-nuts
Note: I don't use the init function

颜文泽

unread,
Feb 2, 2021, 3:06:18 AM2/2/21
to golang-nuts
Probably introduced by a third-party package. I'll troubleshoot.

Axel Wagner

unread,
Feb 2, 2021, 3:17:46 AM2/2/21
to 颜文泽, golang-nuts
On Tue, Feb 2, 2021 at 8:48 AM 颜文泽 <nnsm...@gmail.com> wrote:
And then this is the result, it's amazing.I think I know why my program is slow, the number of routines is too high

13 goroutines is certainly not "too high".
 
but I found that the GOMAXPROCS function doesn't work, it's a really confusing phenomenon for me.
My example did not do anything, my understanding of the number of runtines should be 1 only Ah.

GOMAXPROCS specifies the maximum number of goroutines executing go code at the same time - it does not limit the number of threads or the number of goroutines you can start. The extra goroutines are probably both from third-party dependencies and from the runtime itself (e.g. for the GC).
 

颜文泽

unread,
Feb 2, 2021, 3:21:56 AM2/2/21
to golang-nuts
thank you。

Dan Kortschak

unread,
Feb 2, 2021, 3:24:12 AM2/2/21
to 颜文泽, golang-nuts
On Mon, 2021-02-01 at 23:48 -0800, 颜文泽 wrote:
> This is my code:
> 2021-02-02 15-45-01 的屏幕截图.png

Please don't post code as images.


颜文泽

unread,
Feb 2, 2021, 3:48:27 AM2/2/21
to golang-nuts

ok
Reply all
Reply to author
Forward
0 new messages