sync.Mutex encounter large performance drop when goroutine contention more than 3400

changkun

unread,

Aug 19, 2019, 1:50:21 PM8/19/19

to golang-nuts

I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

The performance comparison visualization is as follows:

sync.Mutex performance (1).png

What are the reasons that

1. sync.Mutex encounter a large performance drop when the number of goroutines goes higher than roughly 3400?

2. Go channels are pretty stable but slower than sync.Mutex before?

Raw bench data by benchstat (go test -bench=. -count=5):

MutexWrite/goroutines-2400-8  48.6ns ± 1%
MutexWrite/goroutines-2480-8  49.1ns ± 0%
MutexWrite/goroutines-2560-8  49.7ns ± 1%
MutexWrite/goroutines-2640-8  50.5ns ± 3%
MutexWrite/goroutines-2720-8  50.9ns ± 2%
MutexWrite/goroutines-2800-8  51.8ns ± 3%
MutexWrite/goroutines-2880-8  52.5ns ± 2%
MutexWrite/goroutines-2960-8  54.1ns ± 4%
MutexWrite/goroutines-3040-8  54.5ns ± 2%
MutexWrite/goroutines-3120-8  56.1ns ± 3%
MutexWrite/goroutines-3200-8  63.2ns ± 5%
MutexWrite/goroutines-3280-8  77.5ns ± 6%
MutexWrite/goroutines-3360-8   141ns ± 6%
MutexWrite/goroutines-3440-8   239ns ± 8%
MutexWrite/goroutines-3520-8   248ns ± 3%
MutexWrite/goroutines-3600-8   254ns ± 2%
MutexWrite/goroutines-3680-8   256ns ± 1%
MutexWrite/goroutines-3760-8   261ns ± 2%
MutexWrite/goroutines-3840-8   266ns ± 3%
MutexWrite/goroutines-3920-8   276ns ± 3%
MutexWrite/goroutines-4000-8   278ns ± 3%
MutexWrite/goroutines-4080-8   286ns ± 5%
MutexWrite/goroutines-4160-8   293ns ± 4%
MutexWrite/goroutines-4240-8   295ns ± 2%
MutexWrite/goroutines-4320-8   280ns ± 8%
MutexWrite/goroutines-4400-8   294ns ± 9%
MutexWrite/goroutines-4480-8   285ns ±10%
MutexWrite/goroutines-4560-8   290ns ± 8%
MutexWrite/goroutines-4640-8   271ns ± 3%
MutexWrite/goroutines-4720-8   271ns ± 4%

ChanWrite/goroutines-2400-8  158ns ± 3%
ChanWrite/goroutines-2480-8  159ns ± 2%
ChanWrite/goroutines-2560-8  161ns ± 2%
ChanWrite/goroutines-2640-8  161ns ± 1%
ChanWrite/goroutines-2720-8  163ns ± 1%
ChanWrite/goroutines-2800-8  166ns ± 3%
ChanWrite/goroutines-2880-8  168ns ± 1%
ChanWrite/goroutines-2960-8  176ns ± 4%
ChanWrite/goroutines-3040-8  176ns ± 2%
ChanWrite/goroutines-3120-8  180ns ± 1%
ChanWrite/goroutines-3200-8  180ns ± 1%
ChanWrite/goroutines-3280-8  181ns ± 2%
ChanWrite/goroutines-3360-8  183ns ± 2%
ChanWrite/goroutines-3440-8  188ns ± 3%
ChanWrite/goroutines-3520-8  190ns ± 2%
ChanWrite/goroutines-3600-8  193ns ± 2%
ChanWrite/goroutines-3680-8  196ns ± 3%
ChanWrite/goroutines-3760-8  199ns ± 2%
ChanWrite/goroutines-3840-8  206ns ± 2%
ChanWrite/goroutines-3920-8  209ns ± 2%
ChanWrite/goroutines-4000-8  206ns ± 2%
ChanWrite/goroutines-4080-8  209ns ± 2%
ChanWrite/goroutines-4160-8  208ns ± 2%
ChanWrite/goroutines-4240-8  209ns ± 3%
ChanWrite/goroutines-4320-8  213ns ± 2%
ChanWrite/goroutines-4400-8  209ns ± 2%
ChanWrite/goroutines-4480-8  211ns ± 1%
ChanWrite/goroutines-4560-8  213ns ± 2%
ChanWrite/goroutines-4640-8  215ns ± 1%
ChanWrite/goroutines-4720-8  218ns ± 3%

Ian Lance Taylor

unread,

Aug 19, 2019, 3:15:50 PM8/19/19

to changkun, golang-nuts

On Mon, Aug 19, 2019 at 10:50 AM changkun <euryu...@gmail.com> wrote:
>
> I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

Might be interesting to try running your benchmark on a machine with
different hardware.

Ian

changkun

unread,

Aug 19, 2019, 4:18:18 PM8/19/19

to golang-nuts

Hi Ian,

Thanks for the hint. I tried on a Mac mini with i5-8500B, it seems the unexpected performance drop still exists (let GOMAXPROCS(8)) the control condition is the CPU:

sync.Mutex performance (GOMAXPROCS == 8).png

Robert Engels

unread,

Aug 19, 2019, 4:27:19 PM8/19/19

to changkun, golang-nuts

I think you'll find the reason that the Mutex uses the Go scheduler. The chan is controlled by a 'mutex' which eventually defers to the OS futex - and the OS futex is probably more efficient at scheduling in the face of large contention - although you would think it should be the other way around.

I am guessing that if you logged the number of operations done per Go routine, you will see that the Mutex version is very fair, and the chan/futex version is unfair - meaning many are starved.

-----Original Message-----
From: changkun
Sent: Aug 19, 2019 12:50 PM
To: golang-nuts
Subject: [go-nuts] sync.Mutex encounter large performance drop when goroutine contention more than 3400

I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark: https://play.golang.org/p/zLjVtsSx9gd

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/3275fb21-dfbd-411d-be42-683386e7ebe2%40googlegroups.com.

changkun

unread,

Aug 20, 2019, 4:33:31 AM8/20/19

to golang-nuts

Hi Robert,

Thanks for your explanation. But how could I "logged the number of operations done per Go routine", which particular debug settings you referring to?

It is reasonable that sync.Mutex rely on runtime scheduler but channels do not. However, it is unclear why a significant performance drop appears. Is it possible to determine when the performance will appear?

Best,

Changkun

To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Ian Davis

unread,

Aug 20, 2019, 4:55:13 AM8/20/19

to golan...@googlegroups.com

On Tue, 20 Aug 2019, at 9:33 AM, changkun wrote:

Hi Robert,

Thanks for your explanation. But how could I "logged the number of operations done per Go routine", which particular debug settings you referring to?
It is reasonable that sync.Mutex rely on runtime scheduler but channels do not. However, it is unclear why a significant performance drop appears. Is it possible to determine when the performance will appear?

This comment by Dmitry Vyukov on a very old issue might help (I have no idea if it is still valid after 6 years though)

If you are interested why chan does not have the same issue, runtime handles chan's in a
special way (active/passive spinning + thread parking instead of goroutine parking),
because it knowns that the critical section is bounded and small. For sync.Mutex it does
not have any knowledge as to critical section size.

See https://github.com/golang/go/issues/5183

changkun

unread,

Aug 20, 2019, 5:27:40 AM8/20/19

to golang-nuts

Hi Ian Davis, I read the issue before I post this thread.

I think the old issue is quite different than here. Here the problem discusses sudden performance drop and unexpected regression,

but the issue#5183 only did experiment on a very limited number of goroutines, and Dmitry's answer is fair in that case.

Best, Changkun

robert engels

unread,

Aug 20, 2019, 9:27:46 AM8/20/19

to changkun, golang-nuts

If you just print the final value of V in both cases you will see: On my iMac:

channels: 215606060

mutex: 651606060

Or the mutex has 3x greater throughput. How the Go routines are scheduled is OS dependent - the OSX code uses the same mechanism in both cases.

You have to remember that that the benchmark harness times the cost of operation when it runs, but it doesn’t always run the same number of invocations - it is based on the the scheduling of the Go routines, and the desired length of the test time.

goos: darwin

goarch: amd64

BenchmarkChanWrite/goroutines-4800-8 3000000 638 ns/op

BenchmarkChanWrite/goroutines-4880-8 2000000 636 ns/op

BenchmarkChanWrite/goroutines-4960-8 3000000 602 ns/op

BenchmarkChanWrite/goroutines-5040-8 2000000 632 ns/op

BenchmarkChanWrite/goroutines-5120-8 2000000 591 ns/op

BenchmarkChanWrite/goroutines-5200-8 3000000 634 ns/op

BenchmarkChanWrite/goroutines-5280-8 3000000 632 ns/op

BenchmarkChanWrite/goroutines-5360-8 3000000 581 ns/op

BenchmarkChanWrite/goroutines-5440-8 3000000 584 ns/op

BenchmarkChanWrite/goroutines-5520-8 3000000 582 ns/op

BenchmarkChanWrite/goroutines-5600-8 3000000 601 ns/op

BenchmarkChanWrite/goroutines-5680-8 3000000 609 ns/op

BenchmarkChanWrite/goroutines-5760-8 3000000 597 ns/op

BenchmarkChanWrite/goroutines-5840-8 3000000 674 ns/op

BenchmarkChanWrite/goroutines-5920-8 3000000 597 ns/op

BenchmarkChanWrite/goroutines-6000-8 3000000 601 ns/op

BenchmarkChanWrite/goroutines-6080-8 2000000 616 ns/op

BenchmarkChanWrite/goroutines-6160-8 3000000 603 ns/op

BenchmarkChanWrite/goroutines-6240-8 2000000 624 ns/op

BenchmarkChanWrite/goroutines-6320-8 3000000 603 ns/op

BenchmarkChanWrite/goroutines-6400-8 3000000 602 ns/op

BenchmarkChanWrite/goroutines-6480-8 2000000 616 ns/op

BenchmarkChanWrite/goroutines-6560-8 2000000 611 ns/op

BenchmarkChanWrite/goroutines-6640-8 3000000 603 ns/op

BenchmarkChanWrite/goroutines-6720-8 2000000 621 ns/op

BenchmarkChanWrite/goroutines-6800-8 3000000 636 ns/op

BenchmarkChanWrite/goroutines-6880-8 2000000 655 ns/op

BenchmarkChanWrite/goroutines-6960-8 3000000 605 ns/op

BenchmarkChanWrite/goroutines-7040-8 3000000 606 ns/op

BenchmarkChanWrite/goroutines-7120-8 2000000 639 ns/op

BenchmarkChanWrite/goroutines-7200-8 3000000 608 ns/op

BenchmarkChanWrite/goroutines-7280-8 3000000 617 ns/op

BenchmarkChanWrite/goroutines-7360-8 3000000 594 ns/op

BenchmarkChanWrite/goroutines-7440-8 3000000 596 ns/op

BenchmarkChanWrite/goroutines-7520-8 2000000 599 ns/op

BenchmarkChanWrite/goroutines-7600-8 2000000 610 ns/op

BenchmarkChanWrite/goroutines-7680-8 3000000 597 ns/op

BenchmarkChanWrite/goroutines-7760-8 2000000 608 ns/op

BenchmarkChanWrite/goroutines-7840-8 3000000 592 ns/op

BenchmarkChanWrite/goroutines-7920-8 3000000 611 ns/op

BenchmarkChanWrite/goroutines-8000-8 2000000 618 ns/op

BenchmarkChanWrite/goroutines-8080-8 3000000 612 ns/op

BenchmarkChanWrite/goroutines-8160-8 2000000 609 ns/op

BenchmarkChanWrite/goroutines-8240-8 3000000 594 ns/op

BenchmarkChanWrite/goroutines-8320-8 3000000 629 ns/op

BenchmarkChanWrite/goroutines-8400-8 2000000 624 ns/op

BenchmarkChanWrite/goroutines-8480-8 2000000 606 ns/op

BenchmarkChanWrite/goroutines-8560-8 2000000 635 ns/op

BenchmarkChanWrite/goroutines-8640-8 2000000 603 ns/op

BenchmarkChanWrite/goroutines-8720-8 3000000 625 ns/op

BenchmarkChanWrite/goroutines-8800-8 2000000 632 ns/op

BenchmarkChanWrite/goroutines-8880-8 2000000 602 ns/op

BenchmarkChanWrite/goroutines-8960-8 3000000 632 ns/op

BenchmarkChanWrite/goroutines-9040-8 2000000 602 ns/op

BenchmarkChanWrite/goroutines-9120-8 3000000 625 ns/op

BenchmarkChanWrite/goroutines-9200-8 3000000 657 ns/op

BenchmarkChanWrite/goroutines-9280-8 3000000 681 ns/op

BenchmarkChanWrite/goroutines-9360-8 2000000 630 ns/op

BenchmarkChanWrite/goroutines-9440-8 2000000 623 ns/op

BenchmarkChanWrite/goroutines-9520-8 2000000 613 ns/op

215606060

BenchmarkMutexWrite/goroutines-4800-8 10000000 634 ns/op

BenchmarkMutexWrite/goroutines-4880-8 20000000 635 ns/op

BenchmarkMutexWrite/goroutines-4960-8 20000000 645 ns/op

BenchmarkMutexWrite/goroutines-5040-8 20000000 665 ns/op

BenchmarkMutexWrite/goroutines-5120-8 20000000 627 ns/op

BenchmarkMutexWrite/goroutines-5200-8 20000000 721 ns/op

BenchmarkMutexWrite/goroutines-5280-8 20000000 697 ns/op

BenchmarkMutexWrite/goroutines-5360-8 20000000 697 ns/op

BenchmarkMutexWrite/goroutines-5440-8 5000000 628 ns/op

BenchmarkMutexWrite/goroutines-5520-8 10000000 647 ns/op

BenchmarkMutexWrite/goroutines-5600-8 3000000 528 ns/op

BenchmarkMutexWrite/goroutines-5680-8 10000000 656 ns/op

BenchmarkMutexWrite/goroutines-5760-8 10000000 649 ns/op

BenchmarkMutexWrite/goroutines-5840-8 20000000 721 ns/op

BenchmarkMutexWrite/goroutines-5920-8 10000000 666 ns/op

BenchmarkMutexWrite/goroutines-6000-8 10000000 655 ns/op

BenchmarkMutexWrite/goroutines-6080-8 20000000 719 ns/op

BenchmarkMutexWrite/goroutines-6160-8 10000000 730 ns/op

BenchmarkMutexWrite/goroutines-6240-8 5000000 524 ns/op

BenchmarkMutexWrite/goroutines-6320-8 5000000 645 ns/op

BenchmarkMutexWrite/goroutines-6400-8 5000000 753 ns/op

BenchmarkMutexWrite/goroutines-6480-8 10000000 755 ns/op

BenchmarkMutexWrite/goroutines-6560-8 10000000 697 ns/op

BenchmarkMutexWrite/goroutines-6640-8 10000000 682 ns/op

BenchmarkMutexWrite/goroutines-6720-8 10000000 698 ns/op

BenchmarkMutexWrite/goroutines-6800-8 10000000 661 ns/op

BenchmarkMutexWrite/goroutines-6880-8 5000000 609 ns/op

BenchmarkMutexWrite/goroutines-6960-8 10000000 689 ns/op

BenchmarkMutexWrite/goroutines-7040-8 5000000 656 ns/op

BenchmarkMutexWrite/goroutines-7120-8 10000000 682 ns/op

BenchmarkMutexWrite/goroutines-7200-8 5000000 578 ns/op

BenchmarkMutexWrite/goroutines-7280-8 3000000 607 ns/op

BenchmarkMutexWrite/goroutines-7360-8 5000000 544 ns/op

BenchmarkMutexWrite/goroutines-7440-8 5000000 589 ns/op

BenchmarkMutexWrite/goroutines-7520-8 20000000 692 ns/op

BenchmarkMutexWrite/goroutines-7600-8 5000000 673 ns/op

BenchmarkMutexWrite/goroutines-7680-8 5000000 618 ns/op

BenchmarkMutexWrite/goroutines-7760-8 5000000 667 ns/op

BenchmarkMutexWrite/goroutines-7840-8 10000000 730 ns/op

BenchmarkMutexWrite/goroutines-7920-8 5000000 628 ns/op

BenchmarkMutexWrite/goroutines-8000-8 10000000 758 ns/op

BenchmarkMutexWrite/goroutines-8080-8 10000000 634 ns/op

BenchmarkMutexWrite/goroutines-8160-8 5000000 640 ns/op

BenchmarkMutexWrite/goroutines-8240-8 5000000 622 ns/op

BenchmarkMutexWrite/goroutines-8320-8 10000000 680 ns/op

BenchmarkMutexWrite/goroutines-8400-8 10000000 687 ns/op

BenchmarkMutexWrite/goroutines-8480-8 5000000 575 ns/op

BenchmarkMutexWrite/goroutines-8560-8 10000000 693 ns/op

BenchmarkMutexWrite/goroutines-8640-8 10000000 657 ns/op

BenchmarkMutexWrite/goroutines-8720-8 10000000 716 ns/op

BenchmarkMutexWrite/goroutines-8800-8 5000000 513 ns/op

BenchmarkMutexWrite/goroutines-8880-8 5000000 543 ns/op

BenchmarkMutexWrite/goroutines-8960-8 5000000 811 ns/op

BenchmarkMutexWrite/goroutines-9040-8 10000000 741 ns/op

BenchmarkMutexWrite/goroutines-9120-8 5000000 683 ns/op

BenchmarkMutexWrite/goroutines-9200-8 5000000 666 ns/op

BenchmarkMutexWrite/goroutines-9280-8 30000000 746 ns/op

BenchmarkMutexWrite/goroutines-9360-8 5000000 638 ns/op

BenchmarkMutexWrite/goroutines-9440-8 5000000 658 ns/op

BenchmarkMutexWrite/goroutines-9520-8 10000000 688 ns/op

651606060

PASS

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/77b8dfc3-53d2-4fbe-9538-cd070d47cd34%40googlegroups.com.

Changkun Ou

unread,

Aug 20, 2019, 10:00:33 AM8/20/19

to robert engels, golang-nuts

Very Interesting observation and point, the higher throughput gets lower performance. Although I am still interesting the particular reason that caused the sudden performance drop, since the throughput doesn’t explain why scheduler made runtime semaphore slower than futex.

- Is it because of the randomization of work-stealing strategy? The result comparison on two machines doesn’t support this argument (sudden drop appears in two different hardware).

- Is it because of the contention of global goroutine queue lock? The reason is quit unclear to this deterministic observation.

- so on…

---

P.S. My logs:

goos: linux

goarch: amd64

BenchmarkMutexWrite/goroutines-2400-8 50000000 48.7 ns/op

BenchmarkMutexWrite/goroutines-2400-8 50000000 48.2 ns/op

BenchmarkMutexWrite/goroutines-2400-8 50000000 48.9 ns/op

BenchmarkMutexWrite/goroutines-2400-8 50000000 48.8 ns/op

BenchmarkMutexWrite/goroutines-2400-8 50000000 48.4 ns/op

BenchmarkMutexWrite/goroutines-2480-8 50000000 49.1 ns/op

BenchmarkMutexWrite/goroutines-2480-8 50000000 48.9 ns/op

BenchmarkMutexWrite/goroutines-2480-8 50000000 49.3 ns/op

BenchmarkMutexWrite/goroutines-2480-8 50000000 49.2 ns/op

BenchmarkMutexWrite/goroutines-2480-8 50000000 48.9 ns/op

BenchmarkMutexWrite/goroutines-2560-8 50000000 49.8 ns/op

BenchmarkMutexWrite/goroutines-2560-8 50000000 50.0 ns/op

BenchmarkMutexWrite/goroutines-2560-8 50000000 49.9 ns/op

BenchmarkMutexWrite/goroutines-2560-8 50000000 49.6 ns/op

BenchmarkMutexWrite/goroutines-2560-8 50000000 49.4 ns/op

BenchmarkMutexWrite/goroutines-2640-8 50000000 51.0 ns/op

BenchmarkMutexWrite/goroutines-2640-8 50000000 50.1 ns/op

BenchmarkMutexWrite/goroutines-2640-8 50000000 51.4 ns/op

BenchmarkMutexWrite/goroutines-2640-8 50000000 49.2 ns/op

BenchmarkMutexWrite/goroutines-2640-8 50000000 50.8 ns/op

BenchmarkMutexWrite/goroutines-2720-8 50000000 50.9 ns/op

BenchmarkMutexWrite/goroutines-2720-8 50000000 50.5 ns/op

BenchmarkMutexWrite/goroutines-2720-8 50000000 51.8 ns/op

BenchmarkMutexWrite/goroutines-2720-8 50000000 50.4 ns/op

BenchmarkMutexWrite/goroutines-2800-8 50000000 51.2 ns/op

BenchmarkMutexWrite/goroutines-2800-8 50000000 51.8 ns/op

BenchmarkMutexWrite/goroutines-2800-8 50000000 53.1 ns/op

BenchmarkMutexWrite/goroutines-2800-8 50000000 51.9 ns/op

BenchmarkMutexWrite/goroutines-2800-8 50000000 51.0 ns/op

BenchmarkMutexWrite/goroutines-2880-8 50000000 52.5 ns/op

BenchmarkMutexWrite/goroutines-2880-8 50000000 53.7 ns/op

BenchmarkMutexWrite/goroutines-2880-8 50000000 52.0 ns/op

BenchmarkMutexWrite/goroutines-2880-8 50000000 52.1 ns/op

BenchmarkMutexWrite/goroutines-2880-8 50000000 52.2 ns/op

BenchmarkMutexWrite/goroutines-2960-8 50000000 54.3 ns/op

BenchmarkMutexWrite/goroutines-2960-8 50000000 52.3 ns/op

BenchmarkMutexWrite/goroutines-2960-8 50000000 54.2 ns/op

BenchmarkMutexWrite/goroutines-2960-8 50000000 56.1 ns/op

BenchmarkMutexWrite/goroutines-2960-8 50000000 53.7 ns/op

BenchmarkMutexWrite/goroutines-3040-8 50000000 55.5 ns/op

BenchmarkMutexWrite/goroutines-3040-8 50000000 53.8 ns/op

BenchmarkMutexWrite/goroutines-3040-8 50000000 55.6 ns/op

BenchmarkMutexWrite/goroutines-3040-8 50000000 53.5 ns/op

BenchmarkMutexWrite/goroutines-3040-8 50000000 53.9 ns/op

BenchmarkMutexWrite/goroutines-3120-8 50000000 57.1 ns/op

BenchmarkMutexWrite/goroutines-3120-8 50000000 55.3 ns/op

BenchmarkMutexWrite/goroutines-3120-8 50000000 55.1 ns/op

BenchmarkMutexWrite/goroutines-3120-8 50000000 55.6 ns/op

BenchmarkMutexWrite/goroutines-3120-8 50000000 57.6 ns/op

BenchmarkMutexWrite/goroutines-3200-8 50000000 60.0 ns/op

BenchmarkMutexWrite/goroutines-3200-8 50000000 60.7 ns/op

BenchmarkMutexWrite/goroutines-3200-8 50000000 65.6 ns/op

BenchmarkMutexWrite/goroutines-3200-8 50000000 63.6 ns/op

BenchmarkMutexWrite/goroutines-3200-8 50000000 66.2 ns/op

BenchmarkMutexWrite/goroutines-3280-8 50000000 79.3 ns/op

BenchmarkMutexWrite/goroutines-3280-8 50000000 75.1 ns/op

BenchmarkMutexWrite/goroutines-3280-8 50000000 76.5 ns/op

BenchmarkMutexWrite/goroutines-3280-8 50000000 82.4 ns/op

BenchmarkMutexWrite/goroutines-3280-8 50000000 74.2 ns/op

BenchmarkMutexWrite/goroutines-3360-8 50000000 133 ns/op

BenchmarkMutexWrite/goroutines-3360-8 50000000 140 ns/op

BenchmarkMutexWrite/goroutines-3360-8 50000000 145 ns/op

BenchmarkMutexWrite/goroutines-3360-8 50000000 137 ns/op

BenchmarkMutexWrite/goroutines-3360-8 50000000 150 ns/op

BenchmarkMutexWrite/goroutines-3440-8 50000000 230 ns/op

BenchmarkMutexWrite/goroutines-3440-8 50000000 231 ns/op

BenchmarkMutexWrite/goroutines-3440-8 50000000 237 ns/op

BenchmarkMutexWrite/goroutines-3440-8 50000000 259 ns/op

BenchmarkMutexWrite/goroutines-3440-8 50000000 237 ns/op

BenchmarkMutexWrite/goroutines-3520-8 50000000 241 ns/op

BenchmarkMutexWrite/goroutines-3520-8 50000000 246 ns/op

BenchmarkMutexWrite/goroutines-3520-8 50000000 251 ns/op

BenchmarkMutexWrite/goroutines-3520-8 50000000 246 ns/op

BenchmarkMutexWrite/goroutines-3520-8 50000000 254 ns/op

BenchmarkMutexWrite/goroutines-3600-8 50000000 257 ns/op

BenchmarkMutexWrite/goroutines-3600-8 50000000 254 ns/op

BenchmarkMutexWrite/goroutines-3600-8 50000000 249 ns/op

BenchmarkMutexWrite/goroutines-3600-8 50000000 254 ns/op

BenchmarkMutexWrite/goroutines-3600-8 50000000 257 ns/op

BenchmarkMutexWrite/goroutines-3680-8 50000000 259 ns/op

BenchmarkMutexWrite/goroutines-3680-8 50000000 258 ns/op

BenchmarkMutexWrite/goroutines-3680-8 50000000 252 ns/op

BenchmarkMutexWrite/goroutines-3680-8 50000000 256 ns/op

BenchmarkMutexWrite/goroutines-3680-8 50000000 253 ns/op

BenchmarkMutexWrite/goroutines-3760-8 50000000 260 ns/op

BenchmarkMutexWrite/goroutines-3760-8 50000000 266 ns/op

BenchmarkMutexWrite/goroutines-3760-8 50000000 261 ns/op

BenchmarkMutexWrite/goroutines-3760-8 50000000 260 ns/op

BenchmarkMutexWrite/goroutines-3760-8 50000000 259 ns/op

BenchmarkMutexWrite/goroutines-3840-8 50000000 267 ns/op

BenchmarkMutexWrite/goroutines-3840-8 50000000 270 ns/op

BenchmarkMutexWrite/goroutines-3840-8 50000000 258 ns/op

BenchmarkMutexWrite/goroutines-3840-8 50000000 267 ns/op

BenchmarkMutexWrite/goroutines-3920-8 50000000 273 ns/op

BenchmarkMutexWrite/goroutines-3920-8 50000000 269 ns/op

BenchmarkMutexWrite/goroutines-3920-8 50000000 275 ns/op

BenchmarkMutexWrite/goroutines-3920-8 50000000 283 ns/op

BenchmarkMutexWrite/goroutines-3920-8 50000000 278 ns/op

BenchmarkMutexWrite/goroutines-4000-8 50000000 272 ns/op

BenchmarkMutexWrite/goroutines-4000-8 50000000 273 ns/op

BenchmarkMutexWrite/goroutines-4000-8 50000000 280 ns/op

BenchmarkMutexWrite/goroutines-4000-8 50000000 287 ns/op

BenchmarkMutexWrite/goroutines-4000-8 50000000 280 ns/op

BenchmarkMutexWrite/goroutines-4080-8 50000000 271 ns/op

BenchmarkMutexWrite/goroutines-4080-8 50000000 293 ns/op

BenchmarkMutexWrite/goroutines-4080-8 50000000 291 ns/op

BenchmarkMutexWrite/goroutines-4080-8 50000000 288 ns/op

BenchmarkMutexWrite/goroutines-4080-8 50000000 289 ns/op

BenchmarkMutexWrite/goroutines-4160-8 50000000 281 ns/op

BenchmarkMutexWrite/goroutines-4160-8 50000000 297 ns/op

BenchmarkMutexWrite/goroutines-4160-8 50000000 293 ns/op

BenchmarkMutexWrite/goroutines-4160-8 50000000 300 ns/op

BenchmarkMutexWrite/goroutines-4160-8 50000000 293 ns/op

BenchmarkMutexWrite/goroutines-4240-8 50000000 296 ns/op

BenchmarkMutexWrite/goroutines-4240-8 50000000 299 ns/op

BenchmarkMutexWrite/goroutines-4240-8 50000000 288 ns/op

BenchmarkMutexWrite/goroutines-4240-8 50000000 294 ns/op

BenchmarkMutexWrite/goroutines-4240-8 50000000 296 ns/op

BenchmarkMutexWrite/goroutines-4320-8 50000000 259 ns/op

BenchmarkMutexWrite/goroutines-4320-8 50000000 294 ns/op

BenchmarkMutexWrite/goroutines-4320-8 50000000 257 ns/op

BenchmarkMutexWrite/goroutines-4320-8 50000000 298 ns/op

BenchmarkMutexWrite/goroutines-4320-8 50000000 293 ns/op

BenchmarkMutexWrite/goroutines-4400-8 50000000 303 ns/op

BenchmarkMutexWrite/goroutines-4400-8 50000000 299 ns/op

BenchmarkMutexWrite/goroutines-4400-8 50000000 269 ns/op

BenchmarkMutexWrite/goroutines-4400-8 50000000 302 ns/op

BenchmarkMutexWrite/goroutines-4400-8 50000000 299 ns/op

BenchmarkMutexWrite/goroutines-4480-8 50000000 293 ns/op

BenchmarkMutexWrite/goroutines-4480-8 50000000 299 ns/op

BenchmarkMutexWrite/goroutines-4480-8 50000000 297 ns/op

BenchmarkMutexWrite/goroutines-4480-8 50000000 256 ns/op

BenchmarkMutexWrite/goroutines-4480-8 50000000 280 ns/op

BenchmarkMutexWrite/goroutines-4560-8 50000000 294 ns/op

BenchmarkMutexWrite/goroutines-4560-8 50000000 280 ns/op

BenchmarkMutexWrite/goroutines-4560-8 50000000 308 ns/op

BenchmarkMutexWrite/goroutines-4560-8 50000000 267 ns/op

BenchmarkMutexWrite/goroutines-4560-8 50000000 301 ns/op

BenchmarkMutexWrite/goroutines-4640-8 50000000 279 ns/op

BenchmarkMutexWrite/goroutines-4640-8 50000000 276 ns/op

BenchmarkMutexWrite/goroutines-4640-8 50000000 266 ns/op

BenchmarkMutexWrite/goroutines-4640-8 50000000 267 ns/op

BenchmarkMutexWrite/goroutines-4640-8 50000000 265 ns/op

BenchmarkMutexWrite/goroutines-4720-8 50000000 271 ns/op

BenchmarkMutexWrite/goroutines-4720-8 50000000 281 ns/op

BenchmarkMutexWrite/goroutines-4720-8 50000000 269 ns/op

BenchmarkMutexWrite/goroutines-4720-8 50000000 260 ns/op

BenchmarkMutexWrite/goroutines-4720-8 50000000 273 ns/op

BenchmarkChanWrite/goroutines-2400-8 10000000 154 ns/op

BenchmarkChanWrite/goroutines-2400-8 10000000 161 ns/op

BenchmarkChanWrite/goroutines-2400-8 10000000 156 ns/op

BenchmarkChanWrite/goroutines-2400-8 10000000 158 ns/op

BenchmarkChanWrite/goroutines-2480-8 10000000 160 ns/op

BenchmarkChanWrite/goroutines-2480-8 10000000 157 ns/op

BenchmarkChanWrite/goroutines-2480-8 10000000 159 ns/op

BenchmarkChanWrite/goroutines-2480-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2480-8 10000000 157 ns/op

BenchmarkChanWrite/goroutines-2560-8 10000000 159 ns/op

BenchmarkChanWrite/goroutines-2560-8 10000000 160 ns/op

BenchmarkChanWrite/goroutines-2560-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2560-8 10000000 165 ns/op

BenchmarkChanWrite/goroutines-2640-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2640-8 10000000 159 ns/op

BenchmarkChanWrite/goroutines-2640-8 10000000 161 ns/op

BenchmarkChanWrite/goroutines-2640-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2640-8 10000000 163 ns/op

BenchmarkChanWrite/goroutines-2720-8 10000000 164 ns/op

BenchmarkChanWrite/goroutines-2720-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2720-8 10000000 161 ns/op

BenchmarkChanWrite/goroutines-2720-8 10000000 162 ns/op

BenchmarkChanWrite/goroutines-2800-8 10000000 170 ns/op

BenchmarkChanWrite/goroutines-2800-8 10000000 165 ns/op

BenchmarkChanWrite/goroutines-2800-8 10000000 163 ns/op

BenchmarkChanWrite/goroutines-2800-8 10000000 165 ns/op

BenchmarkChanWrite/goroutines-2880-8 10000000 170 ns/op

BenchmarkChanWrite/goroutines-2880-8 10000000 167 ns/op

BenchmarkChanWrite/goroutines-2880-8 10000000 168 ns/op

BenchmarkChanWrite/goroutines-2880-8 10000000 169 ns/op

BenchmarkChanWrite/goroutines-2960-8 10000000 181 ns/op

BenchmarkChanWrite/goroutines-2960-8 10000000 182 ns/op

BenchmarkChanWrite/goroutines-2960-8 10000000 172 ns/op

BenchmarkChanWrite/goroutines-2960-8 10000000 170 ns/op

BenchmarkChanWrite/goroutines-2960-8 10000000 174 ns/op

BenchmarkChanWrite/goroutines-3040-8 10000000 175 ns/op

BenchmarkChanWrite/goroutines-3040-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3040-8 10000000 174 ns/op

BenchmarkChanWrite/goroutines-3040-8 10000000 175 ns/op

BenchmarkChanWrite/goroutines-3040-8 10000000 176 ns/op

BenchmarkChanWrite/goroutines-3120-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3120-8 10000000 182 ns/op

BenchmarkChanWrite/goroutines-3120-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3120-8 10000000 180 ns/op

BenchmarkChanWrite/goroutines-3200-8 10000000 178 ns/op

BenchmarkChanWrite/goroutines-3200-8 10000000 182 ns/op

BenchmarkChanWrite/goroutines-3200-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3200-8 10000000 180 ns/op

BenchmarkChanWrite/goroutines-3280-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3280-8 10000000 185 ns/op

BenchmarkChanWrite/goroutines-3280-8 10000000 180 ns/op

BenchmarkChanWrite/goroutines-3280-8 10000000 179 ns/op

BenchmarkChanWrite/goroutines-3360-8 10000000 183 ns/op

BenchmarkChanWrite/goroutines-3360-8 10000000 187 ns/op

BenchmarkChanWrite/goroutines-3360-8 10000000 180 ns/op

BenchmarkChanWrite/goroutines-3360-8 10000000 182 ns/op

BenchmarkChanWrite/goroutines-3360-8 10000000 184 ns/op

BenchmarkChanWrite/goroutines-3440-8 10000000 192 ns/op

BenchmarkChanWrite/goroutines-3440-8 10000000 186 ns/op

BenchmarkChanWrite/goroutines-3440-8 10000000 183 ns/op

BenchmarkChanWrite/goroutines-3440-8 10000000 188 ns/op

BenchmarkChanWrite/goroutines-3520-8 10000000 190 ns/op

BenchmarkChanWrite/goroutines-3520-8 10000000 193 ns/op

BenchmarkChanWrite/goroutines-3520-8 10000000 188 ns/op

BenchmarkChanWrite/goroutines-3520-8 10000000 190 ns/op

BenchmarkChanWrite/goroutines-3600-8 10000000 192 ns/op

BenchmarkChanWrite/goroutines-3600-8 10000000 196 ns/op

BenchmarkChanWrite/goroutines-3600-8 10000000 192 ns/op

BenchmarkChanWrite/goroutines-3600-8 10000000 194 ns/op

BenchmarkChanWrite/goroutines-3600-8 10000000 189 ns/op

BenchmarkChanWrite/goroutines-3680-8 10000000 196 ns/op

BenchmarkChanWrite/goroutines-3680-8 10000000 201 ns/op

BenchmarkChanWrite/goroutines-3680-8 10000000 196 ns/op

BenchmarkChanWrite/goroutines-3680-8 10000000 194 ns/op

BenchmarkChanWrite/goroutines-3680-8 10000000 192 ns/op

BenchmarkChanWrite/goroutines-3760-8 10000000 195 ns/op

BenchmarkChanWrite/goroutines-3760-8 10000000 203 ns/op

BenchmarkChanWrite/goroutines-3760-8 10000000 196 ns/op

BenchmarkChanWrite/goroutines-3760-8 10000000 200 ns/op

BenchmarkChanWrite/goroutines-3760-8 10000000 199 ns/op

BenchmarkChanWrite/goroutines-3840-8 10000000 206 ns/op

BenchmarkChanWrite/goroutines-3840-8 10000000 203 ns/op

BenchmarkChanWrite/goroutines-3840-8 10000000 202 ns/op

BenchmarkChanWrite/goroutines-3840-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-3840-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-3920-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-3920-8 10000000 206 ns/op

BenchmarkChanWrite/goroutines-3920-8 10000000 211 ns/op

BenchmarkChanWrite/goroutines-3920-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4000-8 10000000 205 ns/op

BenchmarkChanWrite/goroutines-4000-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4000-8 10000000 205 ns/op

BenchmarkChanWrite/goroutines-4000-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4000-8 10000000 202 ns/op

BenchmarkChanWrite/goroutines-4080-8 10000000 207 ns/op

BenchmarkChanWrite/goroutines-4080-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4080-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-4080-8 10000000 214 ns/op

BenchmarkChanWrite/goroutines-4080-8 10000000 205 ns/op

BenchmarkChanWrite/goroutines-4160-8 10000000 211 ns/op

BenchmarkChanWrite/goroutines-4160-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4160-8 10000000 205 ns/op

BenchmarkChanWrite/goroutines-4160-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4240-8 10000000 207 ns/op

BenchmarkChanWrite/goroutines-4240-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4240-8 10000000 214 ns/op

BenchmarkChanWrite/goroutines-4240-8 10000000 206 ns/op

BenchmarkChanWrite/goroutines-4240-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4320-8 10000000 216 ns/op

BenchmarkChanWrite/goroutines-4320-8 10000000 218 ns/op

BenchmarkChanWrite/goroutines-4320-8 10000000 212 ns/op

BenchmarkChanWrite/goroutines-4320-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-4320-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4400-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-4400-8 10000000 213 ns/op

BenchmarkChanWrite/goroutines-4400-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4400-8 10000000 206 ns/op

BenchmarkChanWrite/goroutines-4400-8 10000000 207 ns/op

BenchmarkChanWrite/goroutines-4480-8 10000000 214 ns/op

BenchmarkChanWrite/goroutines-4480-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4480-8 10000000 210 ns/op

BenchmarkChanWrite/goroutines-4480-8 10000000 208 ns/op

BenchmarkChanWrite/goroutines-4480-8 10000000 214 ns/op

BenchmarkChanWrite/goroutines-4560-8 10000000 212 ns/op

BenchmarkChanWrite/goroutines-4560-8 10000000 211 ns/op

BenchmarkChanWrite/goroutines-4560-8 10000000 216 ns/op

BenchmarkChanWrite/goroutines-4560-8 10000000 209 ns/op

BenchmarkChanWrite/goroutines-4560-8 10000000 215 ns/op

BenchmarkChanWrite/goroutines-4640-8 10000000 212 ns/op

BenchmarkChanWrite/goroutines-4640-8 10000000 215 ns/op

BenchmarkChanWrite/goroutines-4640-8 10000000 216 ns/op

BenchmarkChanWrite/goroutines-4640-8 10000000 217 ns/op

BenchmarkChanWrite/goroutines-4640-8 10000000 215 ns/op

BenchmarkChanWrite/goroutines-4720-8 10000000 214 ns/op

BenchmarkChanWrite/goroutines-4720-8 10000000 220 ns/op

BenchmarkChanWrite/goroutines-4720-8 10000000 224 ns/op

BenchmarkChanWrite/goroutines-4720-8 10000000 215 ns/op

Robert Engels

unread,

Aug 20, 2019, 12:10:32 PM8/20/19

to changkun, golang-nuts

I am assuming that there is an internal Go structure/process that when there is less than N Go routines it fits in the L1 CPU cache, and beyond a certain point it spills to the L2 or higher - thus the nearly order of magnitude performance decrease, yet consistent times within a range.

Since the worker code is so trivial, you are seeing this. Most worker code is not as trivial so the overhead of the locking/scheduler constructs have far less effect (or the worker is causing L1 evictions anyway - so you never see the optimum performance possible of the scheduler).

changkun

unread,

Aug 21, 2019, 5:51:04 PM8/21/19

to golang-nuts

"less than N Go routines it fits in the L1 CPU cache," I am guessing that you are thinking of local queues on each M, the scheduler's local queue size is strict to 256 goroutines. However, in our case, all blocking goroutines don't go the run queue but blocked and stored on semtable, which is a forest and each tree is an unlimited balanced tree. When a lock is released, only a single goroutine will be detached and put into the local queue (so scheduler only schedules runq with a single goroutine without content to globalq).

How could an L1/L2 problem appear here? Do you think this is still some kind of "limited L1 cache to store large mount of goroutines" ?

What interests me is a newly created issue, I am not sure if this question is relevant to https://github.com/golang/go/issues/33747

The issue talked about small contention on large Ps, but a full scale of my benchmark is shown as follows:

performance_ channel v.s. sync.Mutex v.s. atomic.png

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/77b8dfc3-53d2-4fbe-9538-cd070d47cd34%40googlegroups.com.

Robert Engels

unread,

Aug 21, 2019, 7:31:54 PM8/21/19

to changkun, golang-nuts

I don't think you've posted code for the atomic version...

Each Go routine has its own stack. So when you cycle through many Go routines you will be destroying the cache as each touches N memory addresses (that are obviously not shared).

That's my guess anyway - the performance profile certainly looks like a cache issue to me. Once the cache is exhausted - the kernel based scheduler is more efficient - so it does suggest to me that there are some optimizations that can be done in the Go scheduler.

I will look at a few things this evening.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/28298078-9aa1-4a9d-8e99-0b4f261cbb47%40googlegroups.com.

robert engels

unread,

Aug 22, 2019, 11:44:46 PM8/22/19

to Robert Engels, changkun, golang-nuts

As I expected, the test is not testing what you think it is. Many of the Go routines created do not perform the same number of iterations. The benchmark harness is only designed to try and perform enough iterations to get a time per op while “running N routines”.

You need to rework the test so that all Go routines run the same number of iterations - and time the entire process - then you can see how the concurrency/scheduling latency affects things (or the time of the operation when fully contended). Then the time per is = total time / (iterations * nRoutines)

Here is the code that shows the number of iterations per routine: https://play.golang.org/p/LkAvB39X3_Z

changkun

unread,

Aug 26, 2019, 12:27:14 PM8/26/19

to golang-nuts

Sorry for late response. Do you mean the total execution was not the same? If so then it is not true, you see below two bench are executed both 50000000 times:

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                46.5 ns/op

Type: cpu
Time: Aug 26, 2019 at 6:19pm (CEST)
Duration: 2.50s, Total samples = 5.47s (218.62%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 5.14s, 93.97% of 5.47s total
Dropped 15 nodes (cum <= 0.03s)
Showing top 20 nodes out of 48
      flat  flat%   sum%        cum   cum%
     1.77s 32.36% 32.36%      1.77s 32.36%  runtime.procyield
     1.08s 19.74% 52.10%      2.20s 40.22%  sync.(*Mutex).Lock
     0.40s  7.31% 59.41%      1.34s 24.50%  sync.(*Mutex).Unlock
     0.32s  5.85% 65.27%      2.06s 37.66%  runtime.lock
     0.23s  4.20% 69.47%      1.53s 27.97%  runtime.findrunnable
     0.20s  3.66% 73.13%      0.25s  4.57%  runtime.unlock
     0.18s  3.29% 76.42%      0.19s  3.47%  runtime.pidleput
     0.13s  2.38% 78.79%      0.13s  2.38%  runtime.cansemacquire
     0.12s  2.19% 80.99%      0.12s  2.19%  runtime.futex
     0.10s  1.83% 82.82%      0.11s  2.01%  sync.runtime_nanotime
     0.09s  1.65% 84.46%      3.64s 66.54%  _/home/changkun/dev/tests_test.BenchmarkMutexWrite.func1.1
     0.09s  1.65% 86.11%      0.09s  1.65%  runtime.casgstatus
     0.08s  1.46% 87.57%      0.08s  1.46%  runtime.(*semaRoot).dequeue
     0.08s  1.46% 89.03%      0.94s 17.18%  runtime.semrelease1
     0.07s  1.28% 90.31%      0.07s  1.28%  runtime.gopark
     0.06s  1.10% 91.41%      0.97s 17.73%  runtime.semacquire1
     0.04s  0.73% 92.14%      0.04s  0.73%  runtime.runqempty
     0.04s  0.73% 92.87%      0.04s  0.73%  sync.runtime_canSpin
     0.03s  0.55% 93.42%      0.03s  0.55%  runtime.(*guintptr).cas
     0.03s  0.55% 93.97%      0.03s  0.55%  runtime.gogo


--------------------

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-4800-8           50000000               317 ns/op
PASS
ok      _/home/changkun/dev/tests       16.020s

Type: cpu
Time: Aug 26, 2019 at 6:18pm (CEST)
Duration: 16.01s, Total samples = 17.03s (106.35%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 14640ms, 85.97% of 17030ms total
Dropped 19 nodes (cum <= 85.15ms)
Showing top 20 nodes out of 51
      flat  flat%   sum%        cum   cum%
    2130ms 12.51% 12.51%     2160ms 12.68%  runtime.gopark
    1940ms 11.39% 23.90%     7210ms 42.34%  sync.(*Mutex).Lock
    1640ms  9.63% 33.53%     1810ms 10.63%  sync.runtime_nanotime
    1490ms  8.75% 42.28%     4560ms 26.78%  runtime.semacquire1
    1130ms  6.64% 48.91%     1130ms  6.64%  runtime.casgstatus
     800ms  4.70% 53.61%     3740ms 21.96%  runtime.semrelease1
     590ms  3.46% 57.08%      590ms  3.46%  runtime.(*guintptr).cas
     560ms  3.29% 60.36%      560ms  3.29%  runtime.futex
     530ms  3.11% 63.48%      610ms  3.58%  runtime.lock
     520ms  3.05% 66.53%      520ms  3.05%  runtime.unlock
     440ms  2.58% 69.11%      440ms  2.58%  runtime.semroot
     430ms  2.52% 71.64%      430ms  2.52%  runtime.usleep
     430ms  2.52% 74.16%     4210ms 24.72%  sync.(*Mutex).Unlock
     370ms  2.17% 76.34%     1320ms  7.75%  runtime.ready
     370ms  2.17% 78.51%     4930ms 28.95%  sync.runtime_SemacquireMutex
     340ms  2.00% 80.50%      340ms  2.00%  runtime.(*semaRoot).dequeue
     270ms  1.59% 82.09%    11820ms 69.41%  _/home/changkun/dev/tests_test.BenchmarkMutexWrite.func1.1
     220ms  1.29% 83.38%      220ms  1.29%  runtime.cansemacquire
     220ms  1.29% 84.67%      290ms  1.70%  runtime.releaseSudog
     220ms  1.29% 85.97%     2490ms 14.62%  runtime.schedule

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/28298078-9aa1-4a9d-8e99-0b4f261cbb47%40googlegroups.com.

changkun

unread,

Aug 26, 2019, 12:37:00 PM8/26/19

to golang-nuts

And it looks like the `semacquire1` executed too many `gopark`, which means indicating that `cansemacquire` failed a lot when too much contention happens.

changkun

unread,

Aug 26, 2019, 12:49:32 PM8/26/19

to golang-nuts

According to your formula let's sample three points:

2400 goroutines: 2.508s/(50000000*2400) = 2.09 × 10^-11 s

3600 goroutines: 12.219s/(50000000*3600) = 6.78833333 × 10-11 seconds

4800 goroutines: 16.020s/(50000000*4800) = 6.67500 × 10^-11 s

One can observe that 3600 and 4800 mostly equal to each other, but they both three times slower than 2400.

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8           50000000                46.5 ns/op

PASS
ok      _/home/changkun/dev/tests       2.508s

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-3600-8           50000000               240 ns/op
PASS
ok      _/home/changkun/dev/tests       12.219s

Robert Engels

unread,

Aug 26, 2019, 2:08:27 PM8/26/19

to changkun, golang-nuts

Which is what I would expect - once the number of routines exhaust the cache, it will take the next level (or never since its main memory) to see an massive increase in time. 4800 is 30% slower than 3600 - so it is increasing linearly with the number of Go routines.

-----Original Message-----
From: changkun
Sent: Aug 26, 2019 11:49 AM
To: golang-nuts
Subject: Re: [go-nuts] sync.Mutex encounter large performance drop when goroutine contention more than 3400

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/6dd6ec66-b0cc-4c8e-a341-94bff187214f%40googlegroups.com.

changkun

unread,

Aug 26, 2019, 3:23:21 PM8/26/19

to golang-nuts

Your cache theory is very reasonable, but this was clear in the beginning post: "before or after the massive increase, performance drops linearly".

Your hypothesis is reasonable, but how can you prove your hypothesis? By host machine cache usage monitoring?

Matching of a concept is still not persuasive.

To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Robert Engels

unread,

Aug 26, 2019, 3:26:05 PM8/26/19

to changkun, golang-nuts

You can run the process under 'perf' and monitor the cpu cache hit/miss ratio.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/495e22e8-4a5f-4a1d-88f8-59ff2b0a4006%40googlegroups.com.

Robert Engels

unread,

Aug 26, 2019, 3:28:22 PM8/26/19

to changkun, golang-nuts

Another method would be to change the operation under test, to touch and read a large memory block - turning off the timer while this happens - so that only the concurrency ops are timed - I think then you will no longer see the 'super high performance' with the low routine counts. That is, do 'more work' yet constant that destroys the cache - and back into the concurrency times.

changkun

unread,

Aug 26, 2019, 4:15:59 PM8/26/19

to golang-nuts

Did I do anything wrong, the cache hint ratio decrease linearly, is it an expected result? I thought the cache hint ratio would have a significant drop:

Raw data:

#goroutines	cache-references	cache-misses	hint/(hint+miss)
2400	697103572	17641686	0.9753175194
3200	798160789	54169784	0.9364451004
3360	1387972473	148415678	0.9033996208
3600	1824541062	272166355	0.8701934506
4000	2053779401	393586501	0.8391795437
4800	1885622275	461872899	0.8032486268

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/495e22e8-4a5f-4a1d-88f8-59ff2b0a4006%40googlegroups.com.

Robert Engels

unread,

Aug 26, 2019, 4:41:42 PM8/26/19

to changkun, golang-nuts

You might want to try 'perf mem' to report the access delays - it may be contention on the memory controller as well.

Thinking about it again, I wouldn't expect a large jump if things were fair - for example, if at 100 they all fit in the cache, at 110, some are still in the cache, but some operations are slower, etc. so I would expect a jump but not as large as you see.

Still, most linux context switches are 3-4 us, and you are talking about 300 ns, so you're still doing pretty good, and at approx 40 ns, there are so many aspects that come into play, i'm not sure you or anyone has the time to figure out - maybe the HFT guys are interested...

Like I said, on my OSX machine the times are very similar with both approaches, so it is OS dependent, and probably OS and hardware configuration dependent - so I think I've probably reached the end of being able to help.

And finally, it probably doesn't matter at all - if the Go routine is doing anything of value, 300 ns is probably an insignificant cost.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/0c2b4dc1-3774-4fe6-9cd6-a841dd7265c4%40googlegroups.com.

changkun

unread,

Aug 26, 2019, 4:56:21 PM8/26/19

to golang-nuts

I also tested many times with `go tool pprof`, and it reproducible reports the following difference:

Here is for 2400 goroutines:

Here is for 4800 goroutines:

The difference here is: 4800 goroutines heavily call `gopark` and 2400 goroutines heavily calls runtime.procyield, have you notice this difference? Are they normal?

In attachment, you find the SVG graphs.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/0c2b4dc1-3774-4fe6-9cd6-a841dd7265c4%40googlegroups.com.

2400.svg

4800.svg

changkun

unread,

Aug 26, 2019, 5:08:31 PM8/26/19

to golang-nuts

Based on the pprof graph, I would rather believe that the massive performance drop happens because of the `semacquire1` implementation.

When the number of goroutines is small, most of the `semacquire1` success in the `cansemacquire ` fast path, or a middle path where a lock was required but then `cansemacquire` success again.

The drop happens in the case that goroutines are failed for fast path and middle path, and therefore needs to be parked, which involves runtime schedule costs.

How do you refute to this argument?

Robert Engels

unread,

Aug 26, 2019, 6:34:25 PM8/26/19

to changkun, golang-nuts

I said in my very first response to you, that the mechanisms of the implementation are different, with the in-kernel futex of the channel implementation faster that the Go. Much of this is probably because the thread is dedicated at this point. All that means is that up to a certain point - the CAS works, but then due to contention that path no longer works.

So you get better far performance up to N routines and slightly worse performance after. Seems like a decent design decision for a lot of workloads.

Still, you keep ignoring this aspect - in the context of actual workloads the difference is negligible.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/ddffa01f-a156-431a-8a3a-0dc247112723%40googlegroups.com.

Michael Jones

unread,

Aug 26, 2019, 7:52:25 PM8/26/19

to Robert Engels, changkun, golang-nuts

This is a perennial issue.

We all want to know and understand; it is the sign of true intellect to feel a need to own the details. It is also true that measuring “out of context” is literally to measure something different and often that difference is bigger than the measurement. It can be very difficult to map results from the measured situation back to the real world.

This has grown to be especially true over the last decade. As the size of caches increase, the levels increase, CPU speeds outpace RAM, with NUMA architectures at many levels, with on chip connection inconsistencies, multiple thread instruction dispatch, wide execution units, and hardware hazard stalls — all of these are in play and any one can make micro-benchmarks moot.

“Back in the old days”, so to speak, a cycle was a cycle and timing was simple. We’re 100x to 1000x faster now, but not by making the old model 1000x faster, but by doing 100 things at once, each 10x faster. That’s the problem with micro-benchmarks, they do one thing, back to back.

This is obvious in CPU benchmarks. Integer divide is very slow compared to everything else. It is also rare in the instruction stream compared to other things. In real code, it is rarely the bottleneck; in back to back divide micro-benchmarks, it always is. Both are true, but only one case is meaningful to understand the design choices or estimate performance.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/1051257775.8263.1566858836544%40wamui-albus.atl.sa.earthlink.net.

--

Michael T. Jones
michae...@gmail.com

changkun

unread,

Aug 27, 2019, 1:11:25 AM8/27/19

to golang-nuts

Hi Robert, you misunderstand my point. Your first response was talking about the difference between chan and mutex implementation, here I am comparing mutex with difference number of goroutines.

Basically what you suspected doesn't match what was observed from statistics.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/ddffa01f-a156-431a-8a3a-0dc247112723%40googlegroups.com.

Robert Engels

unread,

Aug 27, 2019, 7:36:45 AM8/27/19

to changkun, golang-nuts

Ok maybe it wasn’t the cache issue, so then try this, below a certain number of go routines given the workload the spin and cas works, beyond a certain point it is forced into the semaphore mode which is much slower - it is also slower than the futex used with channels.

I’ll repeat it is OS and hardware dependent so beyond this you’ll need to do your own more detailed investigation. I referred you to some tools that may help. Good luck.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/229dc928-b297-4e9d-aedc-9e51c176d86e%40googlegroups.com.

Changkun Ou

unread,

Aug 27, 2019, 8:52:35 AM8/27/19

to Robert Engels, golang-nuts

Many thank for the perf tool, it is pretty awesome.

Reply all

Reply to author

Forward