Why does a "concurrent" Go GC phase appear to be stop-the-world?

897 views
Skip to first unread message

Will Sewell

unread,
Oct 19, 2016, 8:32:18 AM10/19/16
to golang-nuts
Hey, I previously posted this on StackOverflow, but I was told this mailing list would be a better forum for discussion.

I am attempting to benchmark the maximum STW GC pause time for different numbers of heap objects. To do this I have written a simple benchmark that pushes and pops messages from a map:

package main

type message
[]byte

type channel map
[int]message

const (
    windowSize
= 200000
    msgCount
= 1000000
)

func mkMessage
(n int) message {
    m
:= make(message, 1024)
    for i := range m {
        m
[i] = byte(n)
   
}
   
return m
}


func pushMsg
(c *channel, highID int) {
    lowID
:= highID - windowSize
    m
:= mkMessage(highID)
   
(*c)[highID] = m
   
if lowID >= 0 {
       
delete(*c, lowID)
   
}
}


func main
() {
    c
:= make(channel)
   
for i := 0; i < msgCount; i++ {
        pushMsg
(&c, i)
   
}
}

I ran this with GODEBUG=gctrace=1, and on my machine the output is:

gc 1 @0.004s 2%: 0.007+0.44+0.032 ms clock, 0.029+0.22/0.20/0.28+0.12 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc
2 @0.009s 3%: 0.007+0.64+0.042 ms clock, 0.030+0/0.53/0.18+0.17 ms cpu, 7->7->7 MB, 8 MB goal, 4 P
gc
3 @0.019s 1%: 0.007+0.99+0.037 ms clock, 0.031+0/0.13/1.0+0.14 ms cpu, 13->13->13 MB, 14 MB goal, 4 P
gc
4 @0.044s 2%: 0.009+2.3+0.032 ms clock, 0.039+0/2.3/0.30+0.13 ms cpu, 25->25->25 MB, 26 MB goal, 4 P
gc
5 @0.081s 1%: 0.009+9.2+0.082 ms clock, 0.039+0/0.32/9.7+0.32 ms cpu, 49->49->48 MB, 50 MB goal, 4 P
gc
6 @0.162s 0%: 0.020+10+0.078 ms clock, 0.082+0/0.28/11+0.31 ms cpu, 93->93->91 MB, 96 MB goal, 4 P
gc
7 @0.289s 0%: 0.020+27+0.092 ms clock, 0.080+0/0.95/28+0.37 ms cpu, 178->178->173 MB, 182 MB goal, 4 P
gc
8 @0.557s 1%: 0.023+38+0.086 ms clock, 0.092+0/38/10+0.34 ms cpu, 337->339->209 MB, 346 MB goal, 4 P
gc
9 @0.844s 1%: 0.008+40+0.077 ms clock, 0.032+0/5.6/46+0.30 ms cpu, 407->409->211 MB, 418 MB goal, 4 P
gc
10 @1.100s 1%: 0.009+43+0.047 ms clock, 0.036+0/6.6/50+0.19 ms cpu, 411->414->212 MB, 422 MB goal, 4 P
gc
11 @1.378s 1%: 0.008+45+0.093 ms clock, 0.033+0/6.5/52+0.37 ms cpu, 414->417->213 MB, 425 MB goal, 4 P

My version of Go is:

$ go version
go version go1
.7.1 darwin/amd64

From the above results, the longest wall clock STW pause time is 0.093ms. Great!

However as a sanity check I also manually timed how long it took to create a new message by wrapping mkMessage with

start := time.Now()
m
:= mkMessage(highID)
elapsed
:= time.Since(start)

and printed the slowest `elapsed` time. The time I get for this was 38.573036ms!

I was instantly suspicious because this correlated strongly with the wall clock times in the supposedly concurrent mark/scan phase, and in particular with "idle GC time".

My question is: why does this supposedly concurrent phase of the GC appear to block the mutator?

If I force the GC to run at regular intervals, my manually calculated pause times go way down to <1ms, so it appears to be hitting some kind of limit of non-live heap objects. If so, I'm not sure what that limit is, and why it would cause a concurrent phase of the GC to appear to block the mutator.

Thanks!

Konstantin Khomoutov

unread,
Oct 19, 2016, 8:54:55 AM10/19/16
to Will Sewell, golang-nuts
On Wed, 19 Oct 2016 04:26:09 -0700 (PDT)
Will Sewell <m...@willsewell.com> wrote:

> Hey, I previously posted this on StackOverflow, but I was told this
> mailing list would be a better forum for discussion.

JFTR here's the discussion on SO:
<http://stackoverflow.com/q/40113915/720999>

[...]

r...@golang.org

unread,
Oct 19, 2016, 9:10:23 AM10/19/16
to golang-nuts
This is likely 23540.  

rhys.h...@gmail.com

unread,
Oct 19, 2016, 12:29:23 PM10/19/16
to golang-nuts
Yes, this sounds a lot like https://golang.org/issue/16293, where goroutines that allocate memory while the garbage collector is running can end up stalled for nearly the entire GC cycle, in programs where a large amount of the memory is in a single allocation. For the program you've shared, that would be the "channel" map. The bug is present in Go 1.5–1.7, and is fixed in tip (via CL 23540).

Do you still see the problem if you run the program with the current development version of Go?

Will Sewell

unread,
Oct 20, 2016, 9:36:27 AM10/20/16
to golang-nuts, rhys.h...@gmail.com
Interesting, that does sound like it could be the cause.

I just tried running the same benchmark on master (692df217ca21b6df8e4dc65538fcc90733e8900e), and I get the following results:

gc 1 @0.004s 3%: 0.009+0.41+0.049 ms clock, 0.036+0.11/0.36/0.12+0.19 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.008s 4%: 0.008+0.80+0.035 ms clock, 0.034+0.097/0.67/0.16+0.14 ms cpu, 7->7->7 MB, 8 MB goal, 4 P
gc 3 @0.016s 3%: 0.010+0.91+0.044 ms clock, 0.041+0/0.31/0.79+0.17 ms cpu, 13->15->14 MB, 15 MB goal, 4 P
gc 4 @0.032s 3%: 0.009+2.3+0.10 ms clock, 0.037+0.60/2.0/0.12+0.40 ms cpu, 27->28->27 MB, 29 MB goal, 4 P
gc 5 @0.070s 3%: 0.010+7.6+0.068 ms clock, 0.043+0.79/5.4/8.5+0.27 ms cpu, 51->53->51 MB, 54 MB goal, 4 P
gc 6 @0.149s 3%: 0.020+8.2+0.12 ms clock, 0.081+0.56/7.2/9.7+0.48 ms cpu, 98->102->99 MB, 103 MB goal, 4 P
gc 7 @0.282s 4%: 0.028+21+0.082 ms clock, 0.11+10/20/1.9+0.32 ms cpu, 190->195->190 MB, 198 MB goal, 4 P
gc 8 @0.568s 3%: 0.024+24+0.080 ms clock, 0.098+0/23/41+0.32 ms cpu, 364->376->214 MB, 381 MB goal, 4 P
gc 9 @0.816s 3%: 0.008+27+0.072 ms clock, 0.035+0/25/34+0.29 ms cpu, 412->420->213 MB, 428 MB goal, 4 P
gc 10 @1.064s 3%: 0.009+31+0.10 ms clock, 0.039+6.1/26/33+0.41 ms cpu, 415->427->216 MB, 427 MB goal, 4 P

My manually calculated worst time for a call to mkMessage is 7.73812ms, which is much better than before. It's significantly faster than the worst wall clock time for the concurrent mark/scan phase, but it's also much slower than the worst STW phase. Do you know why this might be?

Best,
Will

rhys.h...@gmail.com

unread,
Oct 24, 2016, 12:06:59 PM10/24/16
to golang-nuts, rhys.h...@gmail.com
Yes, this sounds like https://golang.org/issue/16528. During the concurrent mark phase (the "27 [ms]" of "0.008+27+0.072 ms clock"), both your code and the garbage collector are running. The program is allowed to use four OS threads ("4 P"), which might be executing your code in your goroutines, or might be running GC code in dedicated GC goroutines.

There's plenty of work for the GC to do, so when a GC helper goroutine is allowed to have some processing time it'll keep running until it has used up all of its allowed time—ten milliseconds. If all four threads end up running GC goroutines at the same time, your goroutine will need to wait until one of them has run for about 10ms before it can be scheduled again. This can lead to individual goroutines being paused for up to 10ms.

You might be able to see this with the execution tracer, but it's not an easy tool to use. See package "runtime/trace" and the command "go tool trace" for some hints. Issue 16528 includes some screenshots of the tool.

Will Sewell

unread,
Oct 26, 2016, 10:46:47 AM10/26/16
to golang-nuts, rhys.h...@gmail.com
Thanks for the information. I think it could well be caused by that. Below is the screenshot of one of the periods of time where the mutator is blocked.



Note: pause times were as high as 15ms with the tracer enabled.

Similar sizes occur every ~100ms.

Let's hope this gets resolved in Go1.8 :)

Dave Cheney

unread,
Oct 26, 2016, 4:45:59 PM10/26/16
to golang-nuts, rhys.h...@gmail.com
Will, the changes has been in master for a few months now, are you able to build master from source and see if this has addressed the issue? I'm sure Rick and Austin would appreciate the feedback.

Will Sewell

unread,
Oct 27, 2016, 12:52:29 PM10/27/16
to golang-nuts, rhys.h...@gmail.com
Are you referring to https://go-review.googlesource.com/#/c/23540/ or https://github.com/golang/go/issues/16528? If it's the former, then yes I have tried the benchmark with the HEAD version on master of the compiler, and it did bring the pause down to ~7.7ms. I was under the impression the latter issue has not been fixed yet, and is the reason the pause time was not even lower.

Dave Cheney

unread,
Oct 27, 2016, 12:58:30 PM10/27/16
to golang-nuts, rhys.h...@gmail.com
Thanks for confirming that CL 23540 has reduced STW delays.

Will Sewell

unread,
Dec 1, 2016, 2:27:34 PM12/1/16
to golang-nuts
I just wrote up our investigation in this blog post: https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/

It includes an animation of the GC algorithm that my colleague James Fisher made. I think this could be valuable as a standalone resource.

If any of you have any feedback, please let me know.
Reply all
Reply to author
Forward
0 new messages