sync.Mutex encounter large performance drop when goroutine contention more than 3400

473 views
Skip to first unread message

changkun

unread,
Aug 19, 2019, 1:50:21 PM8/19/19
to golang-nuts
I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

The performance comparison visualization is as follows:

sync.Mutex performance (1).png

What are the reasons that 

1. sync.Mutex encounter a large performance drop when the number of goroutines goes higher than roughly 3400?
2. Go channels are pretty stable but slower than sync.Mutex before?



Raw bench data by benchstat (go test -bench=. -count=5):

MutexWrite/goroutines-2400-8  48.6ns ± 1%
MutexWrite/goroutines-2480-8  49.1ns ± 0%
MutexWrite/goroutines-2560-8  49.7ns ± 1%
MutexWrite/goroutines-2640-8  50.5ns ± 3%
MutexWrite/goroutines-2720-8  50.9ns ± 2%
MutexWrite/goroutines-2800-8  51.8ns ± 3%
MutexWrite/goroutines-2880-8  52.5ns ± 2%
MutexWrite/goroutines-2960-8  54.1ns ± 4%
MutexWrite/goroutines-3040-8  54.5ns ± 2%
MutexWrite/goroutines-3120-8  56.1ns ± 3%
MutexWrite/goroutines-3200-8  63.2ns ± 5%
MutexWrite/goroutines-3280-8  77.5ns ± 6%
MutexWrite/goroutines-3360-8   141ns ± 6%
MutexWrite/goroutines-3440-8   239ns ± 8%
MutexWrite/goroutines-3520-8   248ns ± 3%
MutexWrite/goroutines-3600-8   254ns ± 2%
MutexWrite/goroutines-3680-8   256ns ± 1%
MutexWrite/goroutines-3760-8   261ns ± 2%
MutexWrite/goroutines-3840-8   266ns ± 3%
MutexWrite/goroutines-3920-8   276ns ± 3%
MutexWrite/goroutines-4000-8   278ns ± 3%
MutexWrite/goroutines-4080-8   286ns ± 5%
MutexWrite/goroutines-4160-8   293ns ± 4%
MutexWrite/goroutines-4240-8   295ns ± 2%
MutexWrite/goroutines-4320-8   280ns ± 8%
MutexWrite/goroutines-4400-8   294ns ± 9%
MutexWrite/goroutines-4480-8   285ns ±10%
MutexWrite/goroutines-4560-8   290ns ± 8%
MutexWrite/goroutines-4640-8   271ns ± 3%
MutexWrite/goroutines-4720-8   271ns ± 4%

ChanWrite/goroutines-2400-8  158ns ± 3%
ChanWrite/goroutines-2480-8  159ns ± 2%
ChanWrite/goroutines-2560-8  161ns ± 2%
ChanWrite/goroutines-2640-8  161ns ± 1%
ChanWrite/goroutines-2720-8  163ns ± 1%
ChanWrite/goroutines-2800-8  166ns ± 3%
ChanWrite/goroutines-2880-8  168ns ± 1%
ChanWrite/goroutines-2960-8  176ns ± 4%
ChanWrite/goroutines-3040-8  176ns ± 2%
ChanWrite/goroutines-3120-8  180ns ± 1%
ChanWrite/goroutines-3200-8  180ns ± 1%
ChanWrite/goroutines-3280-8  181ns ± 2%
ChanWrite/goroutines-3360-8  183ns ± 2%
ChanWrite/goroutines-3440-8  188ns ± 3%
ChanWrite/goroutines-3520-8  190ns ± 2%
ChanWrite/goroutines-3600-8  193ns ± 2%
ChanWrite/goroutines-3680-8  196ns ± 3%
ChanWrite/goroutines-3760-8  199ns ± 2%
ChanWrite/goroutines-3840-8  206ns ± 2%
ChanWrite/goroutines-3920-8  209ns ± 2%
ChanWrite/goroutines-4000-8  206ns ± 2%
ChanWrite/goroutines-4080-8  209ns ± 2%
ChanWrite/goroutines-4160-8  208ns ± 2%
ChanWrite/goroutines-4240-8  209ns ± 3%
ChanWrite/goroutines-4320-8  213ns ± 2%
ChanWrite/goroutines-4400-8  209ns ± 2%
ChanWrite/goroutines-4480-8  211ns ± 1%
ChanWrite/goroutines-4560-8  213ns ± 2%
ChanWrite/goroutines-4640-8  215ns ± 1%
ChanWrite/goroutines-4720-8  218ns ± 3%


Ian Lance Taylor

unread,
Aug 19, 2019, 3:15:50 PM8/19/19
to changkun, golang-nuts
On Mon, Aug 19, 2019 at 10:50 AM changkun <euryu...@gmail.com> wrote:
>
> I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

Might be interesting to try running your benchmark on a machine with
different hardware.

Ian

changkun

unread,
Aug 19, 2019, 4:18:18 PM8/19/19
to golang-nuts
Hi Ian,

Thanks for the hint. I tried on a Mac mini with i5-8500B, it seems the unexpected performance drop still exists (let GOMAXPROCS(8)) the control condition is the CPU:

sync.Mutex performance (GOMAXPROCS == 8).png

Robert Engels

unread,
Aug 19, 2019, 4:27:19 PM8/19/19
to changkun, golang-nuts
I think you'll find the reason that the Mutex uses the Go scheduler. The chan is controlled by a 'mutex' which eventually defers to the OS futex - and the OS futex is probably more efficient at scheduling in the face of large contention - although you would think it should be the other way around.

I am guessing that if you logged the number of operations done per Go routine, you will see that the Mutex version is very fair, and the chan/futex version is unfair - meaning many are starved.

-----Original Message-----
From: changkun
Sent: Aug 19, 2019 12:50 PM
To: golang-nuts
Subject: [go-nuts] sync.Mutex encounter large performance drop when goroutine contention more than 3400

I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/3275fb21-dfbd-411d-be42-683386e7ebe2%40googlegroups.com.



changkun

unread,
Aug 20, 2019, 4:33:31 AM8/20/19
to golang-nuts
Hi Robert,

Thanks for your explanation. But how could I "logged the number of operations done per Go routine", which particular debug settings you referring to?
It is reasonable that sync.Mutex rely on runtime scheduler but channels do not. However, it is unclear why a significant performance drop appears. Is it possible to determine when the performance will appear?

Best,
Changkun
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Ian Davis

unread,
Aug 20, 2019, 4:55:13 AM8/20/19
to golan...@googlegroups.com

On Tue, 20 Aug 2019, at 9:33 AM, changkun wrote:
Hi Robert,

Thanks for your explanation. But how could I "logged the number of operations done per Go routine", which particular debug settings you referring to?
It is reasonable that sync.Mutex rely on runtime scheduler but channels do not. However, it is unclear why a significant performance drop appears. Is it possible to determine when the performance will appear?

This comment by Dmitry Vyukov on a very old issue might help (I have no idea if it is still valid after 6 years though)

If you are interested why chan does not have the same issue, runtime handles chan's in a
special way (active/passive spinning + thread parking instead of goroutine parking),
because it knowns that the critical section is bounded and small. For sync.Mutex it does
not have any knowledge as to critical section size.



changkun

unread,
Aug 20, 2019, 5:27:40 AM8/20/19
to golang-nuts
Hi Ian Davis, I read the issue before I post this thread. 
I think the old issue is quite different than here. Here the problem discusses sudden performance drop and unexpected regression, 
but the issue#5183 only did experiment on a very limited number of goroutines, and Dmitry's answer is fair in that case.

Best, Changkun

robert engels

unread,
Aug 20, 2019, 9:27:46 AM8/20/19
to changkun, golang-nuts
If you just print the final value of V in both cases you will see: On my iMac:

channels: 215606060
mutex: 651606060

Or the mutex has 3x greater throughput. How the Go routines are scheduled is OS dependent - the OSX code uses the same mechanism in both cases.

You have to remember that that the benchmark harness times the cost of operation when it runs, but it doesn’t always run the same number of invocations - it is based on the the scheduling of the Go routines, and the desired length of the test time.

goos: darwin
goarch: amd64
BenchmarkChanWrite/goroutines-4800-8         3000000       638 ns/op
BenchmarkChanWrite/goroutines-4880-8         2000000       636 ns/op
BenchmarkChanWrite/goroutines-4960-8         3000000       602 ns/op
BenchmarkChanWrite/goroutines-5040-8         2000000       632 ns/op
BenchmarkChanWrite/goroutines-5120-8         2000000       591 ns/op
BenchmarkChanWrite/goroutines-5200-8         3000000       634 ns/op
BenchmarkChanWrite/goroutines-5280-8         3000000       632 ns/op
BenchmarkChanWrite/goroutines-5360-8         3000000       581 ns/op
BenchmarkChanWrite/goroutines-5440-8         3000000       584 ns/op
BenchmarkChanWrite/goroutines-5520-8         3000000       582 ns/op
BenchmarkChanWrite/goroutines-5600-8         3000000       601 ns/op
BenchmarkChanWrite/goroutines-5680-8         3000000       609 ns/op
BenchmarkChanWrite/goroutines-5760-8         3000000       597 ns/op
BenchmarkChanWrite/goroutines-5840-8         3000000       674 ns/op
BenchmarkChanWrite/goroutines-5920-8         3000000       597 ns/op
BenchmarkChanWrite/goroutines-6000-8         3000000       601 ns/op
BenchmarkChanWrite/goroutines-6080-8         2000000       616 ns/op
BenchmarkChanWrite/goroutines-6160-8         3000000       603 ns/op
BenchmarkChanWrite/goroutines-6240-8         2000000       624 ns/op
BenchmarkChanWrite/goroutines-6320-8         3000000       603 ns/op
BenchmarkChanWrite/goroutines-6400-8         3000000       602 ns/op
BenchmarkChanWrite/goroutines-6480-8         2000000       616 ns/op
BenchmarkChanWrite/goroutines-6560-8         2000000       611 ns/op
BenchmarkChanWrite/goroutines-6640-8         3000000       603 ns/op
BenchmarkChanWrite/goroutines-6720-8         2000000       621 ns/op
BenchmarkChanWrite/goroutines-6800-8         3000000       636 ns/op
BenchmarkChanWrite/goroutines-6880-8         2000000       655 ns/op
BenchmarkChanWrite/goroutines-6960-8         3000000       605 ns/op
BenchmarkChanWrite/goroutines-7040-8         3000000       606 ns/op
BenchmarkChanWrite/goroutines-7120-8         2000000       639 ns/op
BenchmarkChanWrite/goroutines-7200-8         3000000       608 ns/op
BenchmarkChanWrite/goroutines-7280-8         3000000       617 ns/op
BenchmarkChanWrite/goroutines-7360-8         3000000       594 ns/op
BenchmarkChanWrite/goroutines-7440-8         3000000       596 ns/op
BenchmarkChanWrite/goroutines-7520-8         2000000       599 ns/op
BenchmarkChanWrite/goroutines-7600-8         2000000       610 ns/op
BenchmarkChanWrite/goroutines-7680-8         3000000       597 ns/op
BenchmarkChanWrite/goroutines-7760-8         2000000       608 ns/op
BenchmarkChanWrite/goroutines-7840-8         3000000       592 ns/op
BenchmarkChanWrite/goroutines-7920-8         3000000       611 ns/op
BenchmarkChanWrite/goroutines-8000-8         2000000       618 ns/op
BenchmarkChanWrite/goroutines-8080-8         3000000       612 ns/op
BenchmarkChanWrite/goroutines-8160-8         2000000       609 ns/op
BenchmarkChanWrite/goroutines-8240-8         3000000       594 ns/op
BenchmarkChanWrite/goroutines-8320-8         3000000       629 ns/op
BenchmarkChanWrite/goroutines-8400-8         2000000       624 ns/op
BenchmarkChanWrite/goroutines-8480-8         2000000       606 ns/op
BenchmarkChanWrite/goroutines-8560-8         2000000       635 ns/op
BenchmarkChanWrite/goroutines-8640-8         2000000       603 ns/op
BenchmarkChanWrite/goroutines-8720-8         3000000       625 ns/op
BenchmarkChanWrite/goroutines-8800-8         2000000       632 ns/op
BenchmarkChanWrite/goroutines-8880-8         2000000       602 ns/op
BenchmarkChanWrite/goroutines-8960-8         3000000       632 ns/op
BenchmarkChanWrite/goroutines-9040-8         2000000       602 ns/op
BenchmarkChanWrite/goroutines-9120-8         3000000       625 ns/op
BenchmarkChanWrite/goroutines-9200-8         3000000       657 ns/op
BenchmarkChanWrite/goroutines-9280-8         3000000       681 ns/op
BenchmarkChanWrite/goroutines-9360-8         2000000       630 ns/op
BenchmarkChanWrite/goroutines-9440-8         2000000       623 ns/op
BenchmarkChanWrite/goroutines-9520-8         2000000       613 ns/op
215606060
BenchmarkMutexWrite/goroutines-4800-8         10000000       634 ns/op
BenchmarkMutexWrite/goroutines-4880-8         20000000       635 ns/op
BenchmarkMutexWrite/goroutines-4960-8         20000000       645 ns/op
BenchmarkMutexWrite/goroutines-5040-8         20000000       665 ns/op
BenchmarkMutexWrite/goroutines-5120-8         20000000       627 ns/op
BenchmarkMutexWrite/goroutines-5200-8         20000000       721 ns/op
BenchmarkMutexWrite/goroutines-5280-8         20000000       697 ns/op
BenchmarkMutexWrite/goroutines-5360-8         20000000       697 ns/op
BenchmarkMutexWrite/goroutines-5440-8         5000000       628 ns/op
BenchmarkMutexWrite/goroutines-5520-8         10000000       647 ns/op
BenchmarkMutexWrite/goroutines-5600-8         3000000       528 ns/op
BenchmarkMutexWrite/goroutines-5680-8         10000000       656 ns/op
BenchmarkMutexWrite/goroutines-5760-8         10000000       649 ns/op
BenchmarkMutexWrite/goroutines-5840-8         20000000       721 ns/op
BenchmarkMutexWrite/goroutines-5920-8         10000000       666 ns/op
BenchmarkMutexWrite/goroutines-6000-8         10000000       655 ns/op
BenchmarkMutexWrite/goroutines-6080-8         20000000       719 ns/op
BenchmarkMutexWrite/goroutines-6160-8         10000000       730 ns/op
BenchmarkMutexWrite/goroutines-6240-8         5000000       524 ns/op
BenchmarkMutexWrite/goroutines-6320-8         5000000       645 ns/op
BenchmarkMutexWrite/goroutines-6400-8         5000000       753 ns/op
BenchmarkMutexWrite/goroutines-6480-8         10000000       755 ns/op
BenchmarkMutexWrite/goroutines-6560-8         10000000       697 ns/op
BenchmarkMutexWrite/goroutines-6640-8         10000000       682 ns/op
BenchmarkMutexWrite/goroutines-6720-8         10000000       698 ns/op
BenchmarkMutexWrite/goroutines-6800-8         10000000       661 ns/op
BenchmarkMutexWrite/goroutines-6880-8         5000000       609 ns/op
BenchmarkMutexWrite/goroutines-6960-8         10000000       689 ns/op
BenchmarkMutexWrite/goroutines-7040-8         5000000       656 ns/op
BenchmarkMutexWrite/goroutines-7120-8         10000000       682 ns/op
BenchmarkMutexWrite/goroutines-7200-8         5000000       578 ns/op
BenchmarkMutexWrite/goroutines-7280-8         3000000       607 ns/op
BenchmarkMutexWrite/goroutines-7360-8         5000000       544 ns/op
BenchmarkMutexWrite/goroutines-7440-8         5000000       589 ns/op
BenchmarkMutexWrite/goroutines-7520-8         20000000       692 ns/op
BenchmarkMutexWrite/goroutines-7600-8         5000000       673 ns/op
BenchmarkMutexWrite/goroutines-7680-8         5000000       618 ns/op
BenchmarkMutexWrite/goroutines-7760-8         5000000       667 ns/op
BenchmarkMutexWrite/goroutines-7840-8         10000000       730 ns/op
BenchmarkMutexWrite/goroutines-7920-8         5000000       628 ns/op
BenchmarkMutexWrite/goroutines-8000-8         10000000       758 ns/op
BenchmarkMutexWrite/goroutines-8080-8         10000000       634 ns/op
BenchmarkMutexWrite/goroutines-8160-8         5000000       640 ns/op
BenchmarkMutexWrite/goroutines-8240-8         5000000       622 ns/op
BenchmarkMutexWrite/goroutines-8320-8         10000000       680 ns/op
BenchmarkMutexWrite/goroutines-8400-8         10000000       687 ns/op
BenchmarkMutexWrite/goroutines-8480-8         5000000       575 ns/op
BenchmarkMutexWrite/goroutines-8560-8         10000000       693 ns/op
BenchmarkMutexWrite/goroutines-8640-8         10000000       657 ns/op
BenchmarkMutexWrite/goroutines-8720-8         10000000       716 ns/op
BenchmarkMutexWrite/goroutines-8800-8         5000000       513 ns/op
BenchmarkMutexWrite/goroutines-8880-8         5000000       543 ns/op
BenchmarkMutexWrite/goroutines-8960-8         5000000       811 ns/op
BenchmarkMutexWrite/goroutines-9040-8         10000000       741 ns/op
BenchmarkMutexWrite/goroutines-9120-8         5000000       683 ns/op
BenchmarkMutexWrite/goroutines-9200-8         5000000       666 ns/op
BenchmarkMutexWrite/goroutines-9280-8         30000000       746 ns/op
BenchmarkMutexWrite/goroutines-9360-8         5000000       638 ns/op
BenchmarkMutexWrite/goroutines-9440-8         5000000       658 ns/op
BenchmarkMutexWrite/goroutines-9520-8         10000000       688 ns/op
651606060
PASS

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/77b8dfc3-53d2-4fbe-9538-cd070d47cd34%40googlegroups.com.

Changkun Ou

unread,
Aug 20, 2019, 10:00:33 AM8/20/19
to robert engels, golang-nuts
Very Interesting observation and point, the higher throughput gets lower performance. Although I am still interesting the particular reason that caused the sudden performance drop, since the throughput doesn’t explain why scheduler made runtime semaphore slower than futex. 
- Is it because of the randomization of work-stealing strategy? The result comparison on two machines doesn’t support this argument (sudden drop appears in two different hardware).
- Is it because of the contention of global goroutine queue lock? The reason is quit unclear to this deterministic observation.
- so on…

---

P.S. My logs:

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                48.7 ns/op
BenchmarkMutexWrite/goroutines-2400-8           50000000                48.2 ns/op
BenchmarkMutexWrite/goroutines-2400-8           50000000                48.9 ns/op
BenchmarkMutexWrite/goroutines-2400-8           50000000                48.8 ns/op
BenchmarkMutexWrite/goroutines-2400-8           50000000                48.4 ns/op
BenchmarkMutexWrite/goroutines-2480-8           50000000                49.1 ns/op
BenchmarkMutexWrite/goroutines-2480-8           50000000                48.9 ns/op
BenchmarkMutexWrite/goroutines-2480-8           50000000                49.3 ns/op
BenchmarkMutexWrite/goroutines-2480-8           50000000                49.2 ns/op
BenchmarkMutexWrite/goroutines-2480-8           50000000                48.9 ns/op
BenchmarkMutexWrite/goroutines-2560-8           50000000                49.8 ns/op
BenchmarkMutexWrite/goroutines-2560-8           50000000                50.0 ns/op
BenchmarkMutexWrite/goroutines-2560-8           50000000                49.9 ns/op
BenchmarkMutexWrite/goroutines-2560-8           50000000                49.6 ns/op
BenchmarkMutexWrite/goroutines-2560-8           50000000                49.4 ns/op
BenchmarkMutexWrite/goroutines-2640-8           50000000                51.0 ns/op
BenchmarkMutexWrite/goroutines-2640-8           50000000                50.1 ns/op
BenchmarkMutexWrite/goroutines-2640-8           50000000                51.4 ns/op
BenchmarkMutexWrite/goroutines-2640-8           50000000                49.2 ns/op
BenchmarkMutexWrite/goroutines-2640-8           50000000                50.8 ns/op
BenchmarkMutexWrite/goroutines-2720-8           50000000                50.9 ns/op
BenchmarkMutexWrite/goroutines-2720-8           50000000                50.9 ns/op
BenchmarkMutexWrite/goroutines-2720-8           50000000                50.5 ns/op
BenchmarkMutexWrite/goroutines-2720-8           50000000                51.8 ns/op
BenchmarkMutexWrite/goroutines-2720-8           50000000                50.4 ns/op
BenchmarkMutexWrite/goroutines-2800-8           50000000                51.2 ns/op
BenchmarkMutexWrite/goroutines-2800-8           50000000                51.8 ns/op
BenchmarkMutexWrite/goroutines-2800-8           50000000                53.1 ns/op
BenchmarkMutexWrite/goroutines-2800-8           50000000                51.9 ns/op
BenchmarkMutexWrite/goroutines-2800-8           50000000                51.0 ns/op
BenchmarkMutexWrite/goroutines-2880-8           50000000                52.5 ns/op
BenchmarkMutexWrite/goroutines-2880-8           50000000                53.7 ns/op
BenchmarkMutexWrite/goroutines-2880-8           50000000                52.0 ns/op
BenchmarkMutexWrite/goroutines-2880-8           50000000                52.1 ns/op
BenchmarkMutexWrite/goroutines-2880-8           50000000                52.2 ns/op
BenchmarkMutexWrite/goroutines-2960-8           50000000                54.3 ns/op
BenchmarkMutexWrite/goroutines-2960-8           50000000                52.3 ns/op
BenchmarkMutexWrite/goroutines-2960-8           50000000                54.2 ns/op
BenchmarkMutexWrite/goroutines-2960-8           50000000                56.1 ns/op
BenchmarkMutexWrite/goroutines-2960-8           50000000                53.7 ns/op
BenchmarkMutexWrite/goroutines-3040-8           50000000                55.5 ns/op
BenchmarkMutexWrite/goroutines-3040-8           50000000                53.8 ns/op
BenchmarkMutexWrite/goroutines-3040-8           50000000                55.6 ns/op
BenchmarkMutexWrite/goroutines-3040-8           50000000                53.5 ns/op
BenchmarkMutexWrite/goroutines-3040-8           50000000                53.9 ns/op
BenchmarkMutexWrite/goroutines-3120-8           50000000                57.1 ns/op
BenchmarkMutexWrite/goroutines-3120-8           50000000                55.3 ns/op
BenchmarkMutexWrite/goroutines-3120-8           50000000                55.1 ns/op
BenchmarkMutexWrite/goroutines-3120-8           50000000                55.6 ns/op
BenchmarkMutexWrite/goroutines-3120-8           50000000                57.6 ns/op
BenchmarkMutexWrite/goroutines-3200-8           50000000                60.0 ns/op
BenchmarkMutexWrite/goroutines-3200-8           50000000                60.7 ns/op
BenchmarkMutexWrite/goroutines-3200-8           50000000                65.6 ns/op
BenchmarkMutexWrite/goroutines-3200-8           50000000                63.6 ns/op
BenchmarkMutexWrite/goroutines-3200-8           50000000                66.2 ns/op
BenchmarkMutexWrite/goroutines-3280-8           50000000                79.3 ns/op
BenchmarkMutexWrite/goroutines-3280-8           50000000                75.1 ns/op
BenchmarkMutexWrite/goroutines-3280-8           50000000                76.5 ns/op
BenchmarkMutexWrite/goroutines-3280-8           50000000                82.4 ns/op
BenchmarkMutexWrite/goroutines-3280-8           50000000                74.2 ns/op
BenchmarkMutexWrite/goroutines-3360-8           50000000               133 ns/op
BenchmarkMutexWrite/goroutines-3360-8           50000000               140 ns/op
BenchmarkMutexWrite/goroutines-3360-8           50000000               145 ns/op
BenchmarkMutexWrite/goroutines-3360-8           50000000               137 ns/op
BenchmarkMutexWrite/goroutines-3360-8           50000000               150 ns/op
BenchmarkMutexWrite/goroutines-3440-8           50000000               230 ns/op
BenchmarkMutexWrite/goroutines-3440-8           50000000               231 ns/op
BenchmarkMutexWrite/goroutines-3440-8           50000000               237 ns/op
BenchmarkMutexWrite/goroutines-3440-8           50000000               259 ns/op
BenchmarkMutexWrite/goroutines-3440-8           50000000               237 ns/op
BenchmarkMutexWrite/goroutines-3520-8           50000000               241 ns/op
BenchmarkMutexWrite/goroutines-3520-8           50000000               246 ns/op
BenchmarkMutexWrite/goroutines-3520-8           50000000               251 ns/op
BenchmarkMutexWrite/goroutines-3520-8           50000000               246 ns/op
BenchmarkMutexWrite/goroutines-3520-8           50000000               254 ns/op
BenchmarkMutexWrite/goroutines-3600-8           50000000               257 ns/op
BenchmarkMutexWrite/goroutines-3600-8           50000000               254 ns/op
BenchmarkMutexWrite/goroutines-3600-8           50000000               249 ns/op
BenchmarkMutexWrite/goroutines-3600-8           50000000               254 ns/op
BenchmarkMutexWrite/goroutines-3600-8           50000000               257 ns/op
BenchmarkMutexWrite/goroutines-3680-8           50000000               259 ns/op
BenchmarkMutexWrite/goroutines-3680-8           50000000               258 ns/op
BenchmarkMutexWrite/goroutines-3680-8           50000000               252 ns/op
BenchmarkMutexWrite/goroutines-3680-8           50000000               256 ns/op
BenchmarkMutexWrite/goroutines-3680-8           50000000               253 ns/op
BenchmarkMutexWrite/goroutines-3760-8           50000000               260 ns/op
BenchmarkMutexWrite/goroutines-3760-8           50000000               266 ns/op
BenchmarkMutexWrite/goroutines-3760-8           50000000               261 ns/op
BenchmarkMutexWrite/goroutines-3760-8           50000000               260 ns/op
BenchmarkMutexWrite/goroutines-3760-8           50000000               259 ns/op
BenchmarkMutexWrite/goroutines-3840-8           50000000               267 ns/op
BenchmarkMutexWrite/goroutines-3840-8           50000000               270 ns/op
BenchmarkMutexWrite/goroutines-3840-8           50000000               258 ns/op
BenchmarkMutexWrite/goroutines-3840-8           50000000               267 ns/op
BenchmarkMutexWrite/goroutines-3840-8           50000000               267 ns/op
BenchmarkMutexWrite/goroutines-3920-8           50000000               273 ns/op
BenchmarkMutexWrite/goroutines-3920-8           50000000               269 ns/op
BenchmarkMutexWrite/goroutines-3920-8           50000000               275 ns/op
BenchmarkMutexWrite/goroutines-3920-8           50000000               283 ns/op
BenchmarkMutexWrite/goroutines-3920-8           50000000               278 ns/op
BenchmarkMutexWrite/goroutines-4000-8           50000000               272 ns/op
BenchmarkMutexWrite/goroutines-4000-8           50000000               273 ns/op
BenchmarkMutexWrite/goroutines-4000-8           50000000               280 ns/op
BenchmarkMutexWrite/goroutines-4000-8           50000000               287 ns/op
BenchmarkMutexWrite/goroutines-4000-8           50000000               280 ns/op
BenchmarkMutexWrite/goroutines-4080-8           50000000               271 ns/op
BenchmarkMutexWrite/goroutines-4080-8           50000000               293 ns/op
BenchmarkMutexWrite/goroutines-4080-8           50000000               291 ns/op
BenchmarkMutexWrite/goroutines-4080-8           50000000               288 ns/op
BenchmarkMutexWrite/goroutines-4080-8           50000000               289 ns/op
BenchmarkMutexWrite/goroutines-4160-8           50000000               281 ns/op
BenchmarkMutexWrite/goroutines-4160-8           50000000               297 ns/op
BenchmarkMutexWrite/goroutines-4160-8           50000000               293 ns/op
BenchmarkMutexWrite/goroutines-4160-8           50000000               300 ns/op
BenchmarkMutexWrite/goroutines-4160-8           50000000               293 ns/op
BenchmarkMutexWrite/goroutines-4240-8           50000000               296 ns/op
BenchmarkMutexWrite/goroutines-4240-8           50000000               299 ns/op
BenchmarkMutexWrite/goroutines-4240-8           50000000               288 ns/op
BenchmarkMutexWrite/goroutines-4240-8           50000000               294 ns/op
BenchmarkMutexWrite/goroutines-4240-8           50000000               296 ns/op
BenchmarkMutexWrite/goroutines-4320-8           50000000               259 ns/op
BenchmarkMutexWrite/goroutines-4320-8           50000000               294 ns/op
BenchmarkMutexWrite/goroutines-4320-8           50000000               257 ns/op
BenchmarkMutexWrite/goroutines-4320-8           50000000               298 ns/op
BenchmarkMutexWrite/goroutines-4320-8           50000000               293 ns/op
BenchmarkMutexWrite/goroutines-4400-8           50000000               303 ns/op
BenchmarkMutexWrite/goroutines-4400-8           50000000               299 ns/op
BenchmarkMutexWrite/goroutines-4400-8           50000000               269 ns/op
BenchmarkMutexWrite/goroutines-4400-8           50000000               302 ns/op
BenchmarkMutexWrite/goroutines-4400-8           50000000               299 ns/op
BenchmarkMutexWrite/goroutines-4480-8           50000000               293 ns/op
BenchmarkMutexWrite/goroutines-4480-8           50000000               299 ns/op
BenchmarkMutexWrite/goroutines-4480-8           50000000               297 ns/op
BenchmarkMutexWrite/goroutines-4480-8           50000000               256 ns/op
BenchmarkMutexWrite/goroutines-4480-8           50000000               280 ns/op
BenchmarkMutexWrite/goroutines-4560-8           50000000               294 ns/op
BenchmarkMutexWrite/goroutines-4560-8           50000000               280 ns/op
BenchmarkMutexWrite/goroutines-4560-8           50000000               308 ns/op
BenchmarkMutexWrite/goroutines-4560-8           50000000               267 ns/op
BenchmarkMutexWrite/goroutines-4560-8           50000000               301 ns/op
BenchmarkMutexWrite/goroutines-4640-8           50000000               279 ns/op
BenchmarkMutexWrite/goroutines-4640-8           50000000               276 ns/op
BenchmarkMutexWrite/goroutines-4640-8           50000000               266 ns/op
BenchmarkMutexWrite/goroutines-4640-8           50000000               267 ns/op
BenchmarkMutexWrite/goroutines-4640-8           50000000               265 ns/op
BenchmarkMutexWrite/goroutines-4720-8           50000000               271 ns/op
BenchmarkMutexWrite/goroutines-4720-8           50000000               281 ns/op
BenchmarkMutexWrite/goroutines-4720-8           50000000               269 ns/op
BenchmarkMutexWrite/goroutines-4720-8           50000000               260 ns/op
BenchmarkMutexWrite/goroutines-4720-8           50000000               273 ns/op


BenchmarkChanWrite/goroutines-2400-8            10000000               154 ns/op
BenchmarkChanWrite/goroutines-2400-8            10000000               161 ns/op
BenchmarkChanWrite/goroutines-2400-8            10000000               161 ns/op
BenchmarkChanWrite/goroutines-2400-8            10000000               156 ns/op
BenchmarkChanWrite/goroutines-2400-8            10000000               158 ns/op
BenchmarkChanWrite/goroutines-2480-8            10000000               160 ns/op
BenchmarkChanWrite/goroutines-2480-8            10000000               157 ns/op
BenchmarkChanWrite/goroutines-2480-8            10000000               159 ns/op
BenchmarkChanWrite/goroutines-2480-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2480-8            10000000               157 ns/op
BenchmarkChanWrite/goroutines-2560-8            10000000               159 ns/op
BenchmarkChanWrite/goroutines-2560-8            10000000               160 ns/op
BenchmarkChanWrite/goroutines-2560-8            10000000               160 ns/op
BenchmarkChanWrite/goroutines-2560-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2560-8            10000000               165 ns/op
BenchmarkChanWrite/goroutines-2640-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2640-8            10000000               159 ns/op
BenchmarkChanWrite/goroutines-2640-8            10000000               161 ns/op
BenchmarkChanWrite/goroutines-2640-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2640-8            10000000               163 ns/op
BenchmarkChanWrite/goroutines-2720-8            10000000               164 ns/op
BenchmarkChanWrite/goroutines-2720-8            10000000               164 ns/op
BenchmarkChanWrite/goroutines-2720-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2720-8            10000000               161 ns/op
BenchmarkChanWrite/goroutines-2720-8            10000000               162 ns/op
BenchmarkChanWrite/goroutines-2800-8            10000000               170 ns/op
BenchmarkChanWrite/goroutines-2800-8            10000000               165 ns/op
BenchmarkChanWrite/goroutines-2800-8            10000000               163 ns/op
BenchmarkChanWrite/goroutines-2800-8            10000000               165 ns/op
BenchmarkChanWrite/goroutines-2800-8            10000000               165 ns/op
BenchmarkChanWrite/goroutines-2880-8            10000000               170 ns/op
BenchmarkChanWrite/goroutines-2880-8            10000000               167 ns/op
BenchmarkChanWrite/goroutines-2880-8            10000000               167 ns/op
BenchmarkChanWrite/goroutines-2880-8            10000000               168 ns/op
BenchmarkChanWrite/goroutines-2880-8            10000000               169 ns/op
BenchmarkChanWrite/goroutines-2960-8            10000000               181 ns/op
BenchmarkChanWrite/goroutines-2960-8            10000000               182 ns/op
BenchmarkChanWrite/goroutines-2960-8            10000000               172 ns/op
BenchmarkChanWrite/goroutines-2960-8            10000000               170 ns/op
BenchmarkChanWrite/goroutines-2960-8            10000000               174 ns/op
BenchmarkChanWrite/goroutines-3040-8            10000000               175 ns/op
BenchmarkChanWrite/goroutines-3040-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3040-8            10000000               174 ns/op
BenchmarkChanWrite/goroutines-3040-8            10000000               175 ns/op
BenchmarkChanWrite/goroutines-3040-8            10000000               176 ns/op
BenchmarkChanWrite/goroutines-3120-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3120-8            10000000               182 ns/op
BenchmarkChanWrite/goroutines-3120-8            10000000               182 ns/op
BenchmarkChanWrite/goroutines-3120-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3120-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3200-8            10000000               178 ns/op
BenchmarkChanWrite/goroutines-3200-8            10000000               182 ns/op
BenchmarkChanWrite/goroutines-3200-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3200-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3200-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3280-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3280-8            10000000               185 ns/op
BenchmarkChanWrite/goroutines-3280-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3280-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3280-8            10000000               179 ns/op
BenchmarkChanWrite/goroutines-3360-8            10000000               183 ns/op
BenchmarkChanWrite/goroutines-3360-8            10000000               187 ns/op
BenchmarkChanWrite/goroutines-3360-8            10000000               180 ns/op
BenchmarkChanWrite/goroutines-3360-8            10000000               182 ns/op
BenchmarkChanWrite/goroutines-3360-8            10000000               184 ns/op
BenchmarkChanWrite/goroutines-3440-8            10000000               192 ns/op
BenchmarkChanWrite/goroutines-3440-8            10000000               192 ns/op
BenchmarkChanWrite/goroutines-3440-8            10000000               186 ns/op
BenchmarkChanWrite/goroutines-3440-8            10000000               183 ns/op
BenchmarkChanWrite/goroutines-3440-8            10000000               188 ns/op
BenchmarkChanWrite/goroutines-3520-8            10000000               190 ns/op
BenchmarkChanWrite/goroutines-3520-8            10000000               193 ns/op
BenchmarkChanWrite/goroutines-3520-8            10000000               188 ns/op
BenchmarkChanWrite/goroutines-3520-8            10000000               188 ns/op
BenchmarkChanWrite/goroutines-3520-8            10000000               190 ns/op
BenchmarkChanWrite/goroutines-3600-8            10000000               192 ns/op
BenchmarkChanWrite/goroutines-3600-8            10000000               196 ns/op
BenchmarkChanWrite/goroutines-3600-8            10000000               192 ns/op
BenchmarkChanWrite/goroutines-3600-8            10000000               194 ns/op
BenchmarkChanWrite/goroutines-3600-8            10000000               189 ns/op
BenchmarkChanWrite/goroutines-3680-8            10000000               196 ns/op
BenchmarkChanWrite/goroutines-3680-8            10000000               201 ns/op
BenchmarkChanWrite/goroutines-3680-8            10000000               196 ns/op
BenchmarkChanWrite/goroutines-3680-8            10000000               194 ns/op
BenchmarkChanWrite/goroutines-3680-8            10000000               192 ns/op
BenchmarkChanWrite/goroutines-3760-8            10000000               195 ns/op
BenchmarkChanWrite/goroutines-3760-8            10000000               203 ns/op
BenchmarkChanWrite/goroutines-3760-8            10000000               196 ns/op
BenchmarkChanWrite/goroutines-3760-8            10000000               200 ns/op
BenchmarkChanWrite/goroutines-3760-8            10000000               199 ns/op
BenchmarkChanWrite/goroutines-3840-8            10000000               206 ns/op
BenchmarkChanWrite/goroutines-3840-8            10000000               203 ns/op
BenchmarkChanWrite/goroutines-3840-8            10000000               202 ns/op
BenchmarkChanWrite/goroutines-3840-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-3840-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-3920-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-3920-8            10000000               206 ns/op
BenchmarkChanWrite/goroutines-3920-8            10000000               211 ns/op
BenchmarkChanWrite/goroutines-3920-8            10000000               211 ns/op
BenchmarkChanWrite/goroutines-3920-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4000-8            10000000               205 ns/op
BenchmarkChanWrite/goroutines-4000-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4000-8            10000000               205 ns/op
BenchmarkChanWrite/goroutines-4000-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4000-8            10000000               202 ns/op
BenchmarkChanWrite/goroutines-4080-8            10000000               207 ns/op
BenchmarkChanWrite/goroutines-4080-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4080-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-4080-8            10000000               214 ns/op
BenchmarkChanWrite/goroutines-4080-8            10000000               205 ns/op
BenchmarkChanWrite/goroutines-4160-8            10000000               211 ns/op
BenchmarkChanWrite/goroutines-4160-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4160-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4160-8            10000000               205 ns/op
BenchmarkChanWrite/goroutines-4160-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4240-8            10000000               207 ns/op
BenchmarkChanWrite/goroutines-4240-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4240-8            10000000               214 ns/op
BenchmarkChanWrite/goroutines-4240-8            10000000               206 ns/op
BenchmarkChanWrite/goroutines-4240-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4320-8            10000000               216 ns/op
BenchmarkChanWrite/goroutines-4320-8            10000000               218 ns/op
BenchmarkChanWrite/goroutines-4320-8            10000000               212 ns/op
BenchmarkChanWrite/goroutines-4320-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-4320-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4400-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-4400-8            10000000               213 ns/op
BenchmarkChanWrite/goroutines-4400-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4400-8            10000000               206 ns/op
BenchmarkChanWrite/goroutines-4400-8            10000000               207 ns/op
BenchmarkChanWrite/goroutines-4480-8            10000000               214 ns/op
BenchmarkChanWrite/goroutines-4480-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4480-8            10000000               210 ns/op
BenchmarkChanWrite/goroutines-4480-8            10000000               208 ns/op
BenchmarkChanWrite/goroutines-4480-8            10000000               214 ns/op
BenchmarkChanWrite/goroutines-4560-8            10000000               212 ns/op
BenchmarkChanWrite/goroutines-4560-8            10000000               211 ns/op
BenchmarkChanWrite/goroutines-4560-8            10000000               216 ns/op
BenchmarkChanWrite/goroutines-4560-8            10000000               209 ns/op
BenchmarkChanWrite/goroutines-4560-8            10000000               215 ns/op
BenchmarkChanWrite/goroutines-4640-8            10000000               212 ns/op
BenchmarkChanWrite/goroutines-4640-8            10000000               215 ns/op
BenchmarkChanWrite/goroutines-4640-8            10000000               216 ns/op
BenchmarkChanWrite/goroutines-4640-8            10000000               217 ns/op
BenchmarkChanWrite/goroutines-4640-8            10000000               215 ns/op
BenchmarkChanWrite/goroutines-4720-8            10000000               214 ns/op
BenchmarkChanWrite/goroutines-4720-8            10000000               220 ns/op
BenchmarkChanWrite/goroutines-4720-8            10000000               224 ns/op
BenchmarkChanWrite/goroutines-4720-8            10000000               215 ns/op
BenchmarkChanWrite/goroutines-4720-8            10000000               215 ns/op

Robert Engels

unread,
Aug 20, 2019, 12:10:32 PM8/20/19
to changkun, golang-nuts
I am assuming that there is an internal Go structure/process that when there is less than N Go routines it fits in the L1 CPU cache, and beyond a certain point it spills to the L2 or higher - thus the nearly order of magnitude performance decrease, yet consistent times within a range.

Since the worker code is so trivial, you are seeing this. Most worker code is not as trivial so the overhead of the locking/scheduler constructs have far less effect (or the worker is causing L1 evictions anyway - so you never see the optimum performance possible of the scheduler).

changkun

unread,
Aug 21, 2019, 5:51:04 PM8/21/19
to golang-nuts
"less than N Go routines it fits in the L1 CPU cache," I am guessing that you are thinking of local queues on each M, the scheduler's local queue size is strict to 256 goroutines. However, in our case, all blocking goroutines don't go the run queue but blocked and stored on semtable, which is a forest and each tree is an unlimited balanced tree. When a lock is released, only a single goroutine will be detached and put into the local queue (so scheduler only schedules runq with a single goroutine without content to globalq). 
How could an L1/L2 problem appear here? Do you think this is still some kind of "limited L1 cache to store large mount of goroutines" ?

What interests me is a newly created issue, I am not sure if this question is relevant to https://github.com/golang/go/issues/33747
The issue talked about small contention on large Ps, but a full scale of my benchmark is shown as follows:

performance_ channel v.s. sync.Mutex v.s. atomic.png



Robert Engels

unread,
Aug 21, 2019, 7:31:54 PM8/21/19
to changkun, golang-nuts
I don't think you've posted code for the atomic version...

Each Go routine has its own stack. So when you cycle through many Go routines you will be destroying the cache as each touches N memory addresses (that are obviously not shared).

That's my guess anyway - the performance profile certainly looks like a cache issue to me. Once the cache is exhausted - the kernel based scheduler is more efficient - so it does suggest to me that there are some optimizations that can be done in the Go scheduler.

I will look at a few things this evening.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/28298078-9aa1-4a9d-8e99-0b4f261cbb47%40googlegroups.com.



robert engels

unread,
Aug 22, 2019, 11:44:46 PM8/22/19
to Robert Engels, changkun, golang-nuts
As I expected, the test is not testing what you think it is. Many of the Go routines created do not perform the same number of iterations. The benchmark harness is only designed to try and perform enough iterations to get a time per op while “running N routines”.

You need to rework the test so that all Go routines run the same number of iterations - and time the entire process - then you can see how the concurrency/scheduling latency affects things (or the time of the operation when fully contended). Then the time per is = total time / (iterations * nRoutines)

Here is the code that shows the number of iterations per routine: https://play.golang.org/p/LkAvB39X3_Z

changkun

unread,
Aug 26, 2019, 12:27:14 PM8/26/19
to golang-nuts
Sorry for late response. Do you mean the total execution was not the same? If so then it is not true, you see below two bench are executed both 50000000 times:

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                46.5 ns/op

Type: cpu
Time: Aug 26, 2019 at 6:19pm (CEST)
Duration: 2.50s, Total samples = 5.47s (218.62%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 5.14s, 93.97% of 5.47s total
Dropped 15 nodes (cum <= 0.03s)
Showing top 20 nodes out of 48
      flat  flat%   sum%        cum   cum%
     1.77s 32.36% 32.36%      1.77s 32.36%  runtime.procyield
     1.08s 19.74% 52.10%      2.20s 40.22%  sync.(*Mutex).Lock
     0.40s  7.31% 59.41%      1.34s 24.50%  sync.(*Mutex).Unlock
     0.32s  5.85% 65.27%      2.06s 37.66%  runtime.lock
     0.23s  4.20% 69.47%      1.53s 27.97%  runtime.findrunnable
     0.20s  3.66% 73.13%      0.25s  4.57%  runtime.unlock
     0.18s  3.29% 76.42%      0.19s  3.47%  runtime.pidleput
     0.13s  2.38% 78.79%      0.13s  2.38%  runtime.cansemacquire
     0.12s  2.19% 80.99%      0.12s  2.19%  runtime.futex
     0.10s  1.83% 82.82%      0.11s  2.01%  sync.runtime_nanotime
     0.09s  1.65% 84.46%      3.64s 66.54%  _/home/changkun/dev/tests_test.BenchmarkMutexWrite.func1.1
     0.09s  1.65% 86.11%      0.09s  1.65%  runtime.casgstatus
     0.08s  1.46% 87.57%      0.08s  1.46%  runtime.(*semaRoot).dequeue
     0.08s  1.46% 89.03%      0.94s 17.18%  runtime.semrelease1
     0.07s  1.28% 90.31%      0.07s  1.28%  runtime.gopark
     0.06s  1.10% 91.41%      0.97s 17.73%  runtime.semacquire1
     0.04s  0.73% 92.14%      0.04s  0.73%  runtime.runqempty
     0.04s  0.73% 92.87%      0.04s  0.73%  sync.runtime_canSpin
     0.03s  0.55% 93.42%      0.03s  0.55%  runtime.(*guintptr).cas
     0.03s  0.55% 93.97%      0.03s  0.55%  runtime.gogo


--------------------

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-4800-8           50000000               317 ns/op
PASS
ok      _/home/changkun/dev/tests       16.020s

Type: cpu
Time: Aug 26, 2019 at 6:18pm (CEST)
Duration: 16.01s, Total samples = 17.03s (106.35%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 14640ms, 85.97% of 17030ms total
Dropped 19 nodes (cum <= 85.15ms)
Showing top 20 nodes out of 51
      flat  flat%   sum%        cum   cum%
    2130ms 12.51% 12.51%     2160ms 12.68%  runtime.gopark
    1940ms 11.39% 23.90%     7210ms 42.34%  sync.(*Mutex).Lock
    1640ms  9.63% 33.53%     1810ms 10.63%  sync.runtime_nanotime
    1490ms  8.75% 42.28%     4560ms 26.78%  runtime.semacquire1
    1130ms  6.64% 48.91%     1130ms  6.64%  runtime.casgstatus
     800ms  4.70% 53.61%     3740ms 21.96%  runtime.semrelease1
     590ms  3.46% 57.08%      590ms  3.46%  runtime.(*guintptr).cas
     560ms  3.29% 60.36%      560ms  3.29%  runtime.futex
     530ms  3.11% 63.48%      610ms  3.58%  runtime.lock
     520ms  3.05% 66.53%      520ms  3.05%  runtime.unlock
     440ms  2.58% 69.11%      440ms  2.58%  runtime.semroot
     430ms  2.52% 71.64%      430ms  2.52%  runtime.usleep
     430ms  2.52% 74.16%     4210ms 24.72%  sync.(*Mutex).Unlock
     370ms  2.17% 76.34%     1320ms  7.75%  runtime.ready
     370ms  2.17% 78.51%     4930ms 28.95%  sync.runtime_SemacquireMutex
     340ms  2.00% 80.50%      340ms  2.00%  runtime.(*semaRoot).dequeue
     270ms  1.59% 82.09%    11820ms 69.41%  _/home/changkun/dev/tests_test.BenchmarkMutexWrite.func1.1
     220ms  1.29% 83.38%      220ms  1.29%  runtime.cansemacquire
     220ms  1.29% 84.67%      290ms  1.70%  runtime.releaseSudog
     220ms  1.29% 85.97%     2490ms 14.62%  runtime.schedule


changkun

unread,
Aug 26, 2019, 12:37:00 PM8/26/19
to golang-nuts
And it looks like the `semacquire1` executed too many `gopark`, which means indicating that `cansemacquire` failed a lot when too much contention happens.

changkun

unread,
Aug 26, 2019, 12:49:32 PM8/26/19
to golang-nuts
According to your formula let's sample three points:

2400 goroutines: 2.508s/(50000000*2400) = 2.09 × 10^-11 s
3600 goroutines: 12.219s/(50000000*3600) = 6.78833333 × 10-11 seconds
4800 goroutines: 16.020s/(50000000*4800) = 6.67500 × 10^-11 s

One can observe that 3600 and 4800 mostly equal to each other, but they both three times slower than 2400.

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                46.5 ns/op
PASS
ok      _/home/changkun/dev/tests       2.508s

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-3600-8           50000000               240 ns/op
PASS
ok      _/home/changkun/dev/tests       12.219s

Robert Engels

unread,
Aug 26, 2019, 2:08:27 PM8/26/19
to changkun, golang-nuts
Which is what I would expect - once the number of routines exhaust the cache, it will take the next level (or never since its main memory) to see an massive increase in time. 4800 is 30% slower than 3600 - so it is increasing linearly with the number of Go routines.


-----Original Message-----
From: changkun
Sent: Aug 26, 2019 11:49 AM
To: golang-nuts
Subject: Re: [go-nuts] sync.Mutex encounter large performance drop when goroutine contention more than 3400

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/6dd6ec66-b0cc-4c8e-a341-94bff187214f%40googlegroups.com.



changkun

unread,
Aug 26, 2019, 3:23:21 PM8/26/19
to golang-nuts
Your cache theory is very reasonable, but this was clear in the beginning post:  "before or after the massive increase, performance drops linearly".
Your hypothesis is reasonable, but how can you prove your hypothesis? By host machine cache usage monitoring? 
Matching of a concept is still not persuasive.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Robert Engels

unread,
Aug 26, 2019, 3:26:05 PM8/26/19
to changkun, golang-nuts
You can run the process under 'perf' and monitor the cpu cache hit/miss ratio.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/495e22e8-4a5f-4a1d-88f8-59ff2b0a4006%40googlegroups.com.



Robert Engels

unread,
Aug 26, 2019, 3:28:22 PM8/26/19
to changkun, golang-nuts
Another method would be to change the operation under test, to touch and read a large memory block - turning off the timer while this happens - so that only the concurrency ops are timed - I think then you will no longer see the 'super high performance' with the low routine counts. That is, do 'more work' yet constant that destroys the cache - and back into the concurrency times.

changkun

unread,
Aug 26, 2019, 4:15:59 PM8/26/19
to golang-nuts
Did I do anything wrong, the cache hint ratio decrease linearly, is it an expected result? I thought the cache hint ratio would have a significant drop:

chart.png

Raw data:

#goroutinescache-referencescache-misseshint/(hint+miss)
2400697103572176416860.9753175194
3200798160789541697840.9364451004
336013879724731484156780.9033996208
360018245410622721663550.8701934506
400020537794013935865010.8391795437
480018856222754618728990.8032486268

Robert Engels

unread,
Aug 26, 2019, 4:41:42 PM8/26/19
to changkun, golang-nuts
You might want to try 'perf mem' to report the access delays - it may be contention on the memory controller as well.

Thinking about it again, I wouldn't expect a large jump if things were fair - for example, if at 100 they all fit in the cache, at 110, some are still in the cache, but some operations are slower, etc. so I would expect a jump but not as large as you see.

Still, most linux context switches are 3-4 us, and you are talking about 300 ns, so you're still doing pretty good, and at approx 40 ns, there are so many aspects that come into play, i'm not sure you or anyone has the time to figure out - maybe the HFT guys are interested...

Like I said, on my OSX machine the times are very similar with both approaches, so it is OS dependent, and probably OS and hardware configuration dependent - so I think I've probably reached the end of being able to help.

And finally, it probably doesn't matter at all - if the Go routine is doing anything of value, 300 ns is probably an insignificant cost.


To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/0c2b4dc1-3774-4fe6-9cd6-a841dd7265c4%40googlegroups.com.



changkun

unread,
Aug 26, 2019, 4:56:21 PM8/26/19
to golang-nuts
I also tested many times with `go tool pprof`, and it reproducible reports the following difference:

Here is for 2400 goroutines:

2400.png


Here is for 4800 goroutines:

4800.png


The difference here is: 4800 goroutines heavily call `gopark` and  2400 goroutines heavily calls runtime.procyield, have you notice this difference? Are they normal?
In attachment, you find the SVG graphs.
2400.svg
4800.svg

changkun

unread,
Aug 26, 2019, 5:08:31 PM8/26/19
to golang-nuts
Based on the pprof graph, I would rather believe that the massive performance drop happens because of the `semacquire1` implementation.
When the number of goroutines is small, most of the `semacquire1` success in the `cansemacquire ` fast path, or a middle path where a lock was required but then `cansemacquire` success again.
The drop happens in the case that goroutines are failed for fast path and middle path, and therefore needs to be parked, which involves runtime schedule costs.
How do you refute to this argument?

Robert Engels

unread,
Aug 26, 2019, 6:34:25 PM8/26/19
to changkun, golang-nuts
I said in my very first response to you, that the mechanisms of the implementation are different, with the in-kernel futex of the channel implementation faster that the Go. Much of this is probably because the thread is dedicated at this point. All that means is that up to a certain point - the CAS works, but then due to contention that path no longer works.

So you get better far performance up to N routines and slightly worse performance after. Seems like a decent design decision for a lot of workloads.

Still, you keep ignoring this aspect - in the context of actual workloads the difference is negligible.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/ddffa01f-a156-431a-8a3a-0dc247112723%40googlegroups.com.



Michael Jones

unread,
Aug 26, 2019, 7:52:25 PM8/26/19
to Robert Engels, changkun, golang-nuts
This is a perennial issue. 

We all want to know and understand; it is the sign of true intellect to feel a need to own the details. It is also true that measuring “out of context” is literally to measure something different and often that difference is bigger than the measurement. It can be very difficult to map results from the measured situation back to the real world. 

This has grown to be especially true over the last decade. As the size of caches increase, the levels increase, CPU speeds outpace RAM, with NUMA architectures at many levels, with on chip connection inconsistencies, multiple thread instruction dispatch, wide execution units, and hardware hazard stalls — all of these are in play and any one can make micro-benchmarks moot. 

“Back in the old days”, so to speak, a cycle was a cycle and timing was simple. We’re 100x to 1000x faster now, but not by making the old model 1000x faster, but by doing 100 things at once, each 10x faster. That’s the problem with micro-benchmarks, they do one thing, back to back. 

This is obvious in CPU benchmarks. Integer divide is very slow compared to everything else. It is also rare in the instruction stream compared to other things. In real code, it is rarely the bottleneck; in back to back divide micro-benchmarks, it always is. Both are true, but only one case is meaningful to understand the design choices or estimate performance. 



--
Michael T. Jones
michae...@gmail.com

changkun

unread,
Aug 27, 2019, 1:11:25 AM8/27/19
to golang-nuts
Hi Robert, you misunderstand my point. Your first response was talking about the difference between chan and mutex implementation, here I am comparing mutex with difference number of goroutines.
Basically what you suspected doesn't match what was observed from statistics.

Robert Engels

unread,
Aug 27, 2019, 7:36:45 AM8/27/19
to changkun, golang-nuts
Ok maybe it wasn’t the cache issue, so then try this, below a certain number of go routines given the workload the spin and cas works, beyond a certain point it is forced into the semaphore mode which is much slower - it is also slower than the futex used with channels. 

I’ll repeat it is OS and hardware dependent so beyond this you’ll need to do your own more detailed investigation. I referred you to some tools that may help. Good luck. 
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/229dc928-b297-4e9d-aedc-9e51c176d86e%40googlegroups.com.

Changkun Ou

unread,
Aug 27, 2019, 8:52:35 AM8/27/19
to Robert Engels, golang-nuts
Many thank for the perf tool, it is pretty awesome.
Reply all
Reply to author
Forward
0 new messages