Latency spike during GC

1,156 views
Skip to first unread message

Xun Liu

unread,
May 31, 2017, 9:26:15 AM5/31/17
to golang-nuts
Hi, we see a clear correlation between GC and latency spike in our Go server. The server uses fairly large amount of memory (20G) and does mostly CPU work. The server runs on a beefy box with 32 cores and the load is pretty light (average CPU 20-30%).  GC kicks in once every 10-20 seconds and whenever it runs we observe pretty big latency spike ranging from 30% to 100% across p50-p90 percentiles (e.g. p90 can jump from 100-120ms to 160-250ms) 

I captured a trace of a gc and noticed the following:

1. user gorountines seem run longer during gc. This is through ad-hoc check. I don't really know how to get stats to confirm this.
The gc log is as following (tiny pauses but is very aggressive in assist and idle time)
gc 347 @6564.164s 0%: 0.89+518+1.0 ms clock, 28+3839/4091/3959+33 ms cpu, 23813->23979->12265 MB, 24423 MB goal, 32 P

2. during gc, goroutines can queue up. In this particular case there is a stretch of time (~20ms) where we see many goroutines are GCWaiting. See below -- the second row is goroutines with light grey indicating GCWaiting count and light green Runnable.



Any idea what's going on here? What can I do to reduce the spikes?

 

Ian Lance Taylor

unread,
May 31, 2017, 10:13:38 AM5/31/17
to Xun Liu, golang-nuts, Rick Hudson, Austin Clements
[ +rlh, austin ]

Which version of Go are you running?

Ian

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xun Liu

unread,
May 31, 2017, 10:25:25 AM5/31/17
to golang-nuts, pas...@gmail.com, r...@golang.org, aus...@google.com

$ go version

go version go1.8 linux/amd64

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Rick Hudson

unread,
May 31, 2017, 2:36:12 PM5/31/17
to Xun Liu, golang-nuts, Austin Clements
gc 347 @6564.164s 0%: 0.89+518+1.0 ms clock, 28+3839/4091/3959+33 ms cpu, 23813->23979->12265 MB, 24423 MB goal, 32 P

What I'm seeing here is that you have 32 HW threads and you spend .89+518+1 or 520 ms wall clock in the GC. You also spend  28+3839+4091+3959+33 or 11950 ms CPU time out of total of 520*32 or 16640 available CPU cycles while the GC is running. The GC will reserve 25% of the CPU to do its work. That's 16640*.25 or 4160 ms. If the GC finds any of the remaining 24 threads idle it will aggressively enlist them to do GC work.

The graph only shows the 8 HW threads but not the other 24 so it is hard to tell what is going on with them.

This may well be related to  issue 12812 so you might want to read that thread for more insight.

Xun Liu

unread,
May 31, 2017, 3:02:33 PM5/31/17
to Rick Hudson, golang-nuts, Austin Clements
On Wed, May 31, 2017 at 11:35 AM, Rick Hudson <r...@golang.org> wrote:
gc 347 @6564.164s 0%: 0.89+518+1.0 ms clock, 28+3839/4091/3959+33 ms cpu, 23813->23979->12265 MB, 24423 MB goal, 32 P

What I'm seeing here is that you have 32 HW threads and you spend .89+518+1 or 520 ms wall clock in the GC. You also spend  28+3839+4091+3959+33 or 11950 ms CPU time out of total of 520*32 or 16640 available CPU cycles while the GC is running. The GC will reserve 25% of the CPU to do its work. That's 16640*.25 or 4160 ms. If the GC finds any of the remaining 24 threads idle it will aggressively enlist them to do GC work.

The graph only shows the 8 HW threads but not the other 24 so it is hard to tell what is going on with them.

Other cpus are doing user level work. I pasted the whole trace pictures at the end. 
 

This may well be related to  issue 12812 so you might want to read that thread for more insight.


I read the issue before which seems to be related. Something special about my trace is many goroutines are GCWaiting, what triggers gc waiting?
Also is there an easy way to turn off idle gc or assist gc? Not knowing too much details about go gc, I wonder if it's worthy to test how much they 
play a role in the latency increase -- it feels to me they are too aggressive, e.g. the gc finishes using ~27% of the time it needs to finish: 27% = (23979 - 23813) / (24423 - 23813)

gc 347 @6564.164s 0%: 0.89+518+1.0 ms clock, 28+3839/4091/3959+33 ms cpu, 23813->23979->12265 MB, 24423 MB goal, 32 P





​ 

Francis

unread,
Jun 1, 2017, 9:02:27 AM6/1/17
to golang-nuts, r...@golang.org, aus...@google.com
Xun Liu,

Are you able to post the actual trace here? I appreciate that it contains information about your source code, so I can understand if you don't want to share it. But I would be very interested to look at the full trace.

Thanks

Francis

Rick Hudson

unread,
Jun 1, 2017, 9:07:03 AM6/1/17
to Xun Liu, golang-nuts, Austin Clements
Yes, the GC seems to finish early but that is the result of having to be very conservative about the amount of work left to do. Add to this a lot of Ps with nothing scheduled that are enlisted to do GC work. It seems counter intuitive to leave Ps idle so the approach will need numbers to support it. There is no easy way to change the assist and idle settings.

In an attempt to get some sort of an upper bound on potential latency improvements that might result from making the GC less aggressive could you run the following experiments. 

GOMAXPROCS=24 yourApp
and report what, if any, change you see on your P50, P90 and P99 numbers when GC is not running.
This should give us an upper bound on what can be expected if the GC runs with only 25% of the CPU. The experiment is not interested in the latency when the GC is running. This is trying to simulate what happens if we have a minimally aggressive GC.

GOMAXPROCS=16 yourApp
This should give us an idea of what to expect if the GC never schedules half the processors regardless if they are idle.

GOMAXPROCS=8 yourApp
This should give us an idea of what to expect if the GC never schedules 25% of the processors regardless if they are idle.

Thanks, looking forward to seeing the percentile numbers and graphs. 




Xun Liu

unread,
Jun 1, 2017, 12:27:58 PM6/1/17
to Rick Hudson, golang-nuts, Austin Clements
Now you mentioned it I remember a test I did before by setting GOMAXPROCS=16 (to gauge latency impact if we put more load on the server, e.g. 2x) and I didn't see much difference across latency percentiles, gc or not. I could do more such testings as you suggested but it's gonna take a while to have results since I have to go on a last minute trip very soon.


Xun Liu

unread,
Jun 1, 2017, 12:48:54 PM6/1/17
to Rick Hudson, golang-nuts, Austin Clements
On Thu, Jun 1, 2017 at 6:06 AM, Rick Hudson <r...@golang.org> wrote:
Yes, the GC seems to finish early but that is the result of having to be very conservative about the amount of work left to do. Add to this a lot of Ps with nothing scheduled that are enlisted to do GC work. It seems counter intuitive to leave Ps idle so the approach will need numbers to support it. There is no easy way to change the assist and idle settings.

I would argue that the fact that we saw runnable goroutines piling up from time to time during gc (which almost never happens outside gc) suggests go gc is too aggressive in enlisting help to a point that can starve user goroutines -- unless you have other explanation for the goroutine queue-up.

Rick Hudson

unread,
Jun 1, 2017, 2:46:27 PM6/1/17
to Xun Liu, golang-nuts, Austin Clements
As a baseline the GC's SLO allows it to use 25% of the CPU and some assists from high allocating goroutines that is at most proportional to the goroutine's allocations. Beyond that the intent is that the GC only enlist otherwise idle Ps.  If it is the first 25% that is backing up the goroutines then things are working as expected. If on the other hand it is the enlistment of idle Ps that are backing up the goroutines then we need to understand what is happening and adjust the scheduling based on what we learn. I am hoping that the experiment I proposed above will shed some light on what is causing the backup.

Xun Liu

unread,
Jun 5, 2017, 10:05:31 PM6/5/17
to Rick Hudson, golang-nuts, Austin Clements
Here are some updates on the tests I did:

1. GOMAXPROCS=24: no latency change (they are the same as GOMAXPROCS=32, both when gc is running and not running)
2. GOMAXPROCS=16: no latency change
3. GOMAXPROCS=8: latency increase by 20-30% across p50-p99 -- the process starts to starve on cpu.

I also did the following to directly measure the impact by assist and idle scan (by hacking into golang runtime package):
1. disable idle scan : no latency change
2. disable assist scan : no latency change
3. disable both : no latency change

So something else in gc is at play here -- if you'd like to have a guess, here is one more data point:
The latency spikes only 80-90% of times during gc. There are 10-20% times the latency is not affected by gc.  

Rick Hudson

unread,
Jun 6, 2017, 8:38:44 AM6/6/17
to Xun Liu, golang-nuts, Austin Clements
>> 1. GOMAXPROCS=24: no latency change (they are the same as GOMAXPROCS=32, both when GC is running and not running)
This is an interesting clue. 

My guess:
The application is memory bandwidth limited. If the GC's concurrent mark phase is memory bandwidth bound then the application would also be memory bandwidth bound. If one reduces the amount of bandwidth the GC is using then the application would get the bandwidth it needs. One easy way to reduce the GC's bandwidth requirement would be to reduce the Ps the GC is given from 8 to 6 which is exactly what the reducing GOMAXPROCS=32 to 24 does. The remaining 16 Ps would no longer be memory bandwidth bound so could meet the latency SLO. The graph of latency vs. GOMAXPROCS might have an interesting shape.

Alternatively if the GC flushes the cache at a high rate it would cause the application to experience capacity misses and this would also slow the application down. It could be both since capacity misses tend to increase memory bandwidth requirements.

I'll discuss this at lunch and see if we can come up with a simple experiment to confirm/refute the guess.

The good news is that the application has a work around, just reduce GOMAXPROCS to 24. 

>> The latency spikes only 80-90% of times during gc. There are 10-20% times the latency is not affected by gc.  
Don't know what to make of this clue. Is the application the only thing running on the machine?

Xun Liu

unread,
Jun 6, 2017, 4:12:03 PM6/6/17
to Rick Hudson, Austin Clements, golang-nuts


On Tue, Jun 6, 2017, 20:38 Rick Hudson <r...@golang.org> wrote:
>> 1. GOMAXPROCS=24: no latency change (they are the same as GOMAXPROCS=32, both when GC is running and not running)
This is an interesting clue. 

My guess:
The application is memory bandwidth limited. If the GC's concurrent mark phase is memory bandwidth bound then the application would also be memory bandwidth bound. If one reduces the amount of bandwidth the GC is using then the application would get the bandwidth it needs. One easy way to reduce the GC's bandwidth requirement would be to reduce the Ps the GC is given from 8 to 6 which is exactly what the reducing GOMAXPROCS=32 to 24 does. The remaining 16 Ps would no longer be memory bandwidth bound so could meet the latency SLO. The graph of latency vs. GOMAXPROCS might have an interesting shape.

Alternatively if the GC flushes the cache at a high rate it would cause the application to experience capacity misses and this would also slow the application down. It could be both since capacity misses tend to increase memory bandwidth requirements.

I'll discuss this at lunch and see if we can come up with a simple experiment to confirm/refute the guess.

The good news is that the application has a work around, just reduce GOMAXPROCS to 24. 
Sorry maybe I didn't make myself clear. By "no latency change" I meant the application latency was the same as when GOMAXPROCS is set to 32: e.g p90 was the same 100-120ms during non-gc and jumped to 160-250ms during GC.

In fact among all the things I tried, nothing seemed to change the applucation latency (comparing to GOMAXPROCS=32) except when GOMAXPROCS was set to 8 , in which case the application starved for CPU and p90 jumped to 120-150ms during non-gc and 200-300 during GC.


 
On Wed, May 31, 2017 at 10:25 AM, Xun Liu <pas...@gmail.com> wrote:

Rick Hudson

unread,
Jun 7, 2017, 10:15:28 AM6/7/17
to Xun Liu, Austin Clements, golang-nuts
Could you send us the trace file that you used. From the image it is hard to distinguish Goroutines doing GC assists from application goroutines.

Xun Liu

unread,
Jun 7, 2017, 11:37:28 PM6/7/17
to Rick Hudson, Austin Clements, golang-nuts

r...@google.com

unread,
Jun 12, 2017, 9:31:38 AM6/12/17
to golang-nuts, r...@golang.org, aus...@google.com
Nothing is obvious but there is something going on we don't understand. If you have the time to create a simple reproducer the next step would be to open an issue that contains the reproducer. Once the problem can be reproduced locally it can be diagnosed.

Austin Clements

unread,
Jun 12, 2017, 10:22:36 AM6/12/17
to Xun Liu, Rick Hudson, golang-nuts
Hi Xun. Thanks for the trace. Nothing jumps out at me, either. I would suggest trying with a reduced idleCheckThreshold in the runtime, but you said that disabling idle workers entirely didn't make a difference, so I wouldn't expect that to, either.

Would it be possible to run your workload with Go 1.9 (either current master or the beta which should be out any day now)? 1.9 adds some extra information in the tracer that could be useful.

How are you correlating your latency spikes with GC? I'm wondering if, for example, you know exactly which phase it spikes during (early in GC or late?) or if it's possible that it actually spikes just after a GC cycle (which would implicate sweeping rather than marking).

Xun Liu

unread,
Jun 15, 2017, 12:03:09 AM6/15/17
to Austin Clements, Rick Hudson, golang-nuts


On Mon, Jun 12, 2017, 22:22 Austin Clements <aus...@google.com> wrote:
Hi Xun. Thanks for the trace. Nothing jumps out at me, either. I would suggest trying with a reduced idleCheckThreshold in the runtime, but you said that disabling idle workers entirely didn't make a difference, so I wouldn't expect that to, either.

Would it be possible to run your workload with Go 1.9 (either current master or the beta which should be out any day now)? 1.9 adds some extra information in the tracer that could be useful.

How are you correlating your latency spikes with GC? I'm wondering if, for example, you know exactly which phase it spikes during (early in GC or late?) or if it's possible that it actually spikes just after a GC cycle (which would implicate sweeping rather than marking).
I logged requests and calculated latencies once every N seconds. If I also saw a gc log during the interval I know a gc happened. I tried to be more fine grained just like you suggested to get stats for requests overlapping gc but gave up due to gc reports time since start while our requests are logged wall time. Looks like I have to hack gc log format to get such stats.

Xun Liu

unread,
Jul 6, 2017, 6:29:30 PM7/6/17
to Austin Clements, Rick Hudson, golang-nuts
I've been away and finally got back to work. Here is the trace obtained with Go 1.9 (from the tip of the master)
$ bin/go version
go version devel +9d2de77 Thu Jul 6 18:05:22 2017 +0000 linux/amd64

I also instrumented my code and runtime so I can precisely correlate timestamps between my requests and Go gc. The findings are:
1. the latency increase is from the requests that overlap GC. Requests starting right after gc also have a slight increase (e.g. 5-10%), but not nearly to the degree of overlapping requests (30-100% increase) 
2. I divided the requests overlapping GC into two groups: ones that start in the first half of gc and ones in the second half.  And there is no difference in latency between the two groups.

Please let me know if you could spot anything from the data. Thanks.

Francis

unread,
Jul 7, 2017, 7:00:38 AM7/7/17
to golang-nuts
When you describe the first and second half of gc are you referring to the mark and sweep phases?
Reply all
Reply to author
Forward
0 new messages