Large GC pauses with large map

417 views
Skip to first unread message

Lee Armstrong

unread,
Apr 20, 2017, 9:49:49 AM4/20/17
to golang-nuts
See attached graph which shows the GC pauses of an application we have.

I am frequently seeing pauses of 1-1.5 seconds. This is using Go 1.8.1 and have a large map that is frequently accessed and items are removed and added to it.  These can be of some size.

Is there a way to get these pauses down at all?  Would forcing a GC() after removing a batch of elements help at all?

Alongside the pauses I see some substantial CPU usage showing up in traces for the GC scan.

Thanks in advance!
2017-04-20_07-44-27.png
2017-04-20_07-47-17.png

Egon

unread,
Apr 20, 2017, 9:56:50 AM4/20/17
to golang-nuts
Use a custom map implementation.

Jesper Louis Andersen

unread,
Apr 20, 2017, 10:01:03 AM4/20/17
to Lee Armstrong, golang-nuts
A somewhat common culprit seems to be the following case:

1. In order for the GC to switch from marking to sweeping, it needs all cores to agree. This requires a "barrier" in the system and thus we have to wait on all CPUs.
2. The barrier check happens on a function call.
3. A CPU core is currently executing a long-running operation inside a function, but is never calling another function in doing so. Usually there is a loop involved.
4. The system is now practically stalled.

There may be other reasons, but the above scheme is somewhat common. A workaround is to make sure you break long-running loops every N iterations and coerce the system to consider a garbage collection by returning from a function, say. Work is being done to automate this test so low latency can be achieved automatically without you having to intervene[0]

To check for this assumption, set GODEBUG=gctrace=1 and look at the pauses of the different phases. It should tell you if this is the culprit or not. It could easily be something else, for instance pressure on the GC itself due to heavy allocation.

[0] This will come at a slight cost to throughput, but it is probably a valid trade-off to make in a modern world of highly parallelized systems.
 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

刘桂祥

unread,
Apr 20, 2017, 10:39:07 PM4/20/17
to golang-nuts
try use value type
example:
map[string]*struct  => map[[20]byte]struct  

在 2017年4月20日星期四 UTC+8下午9:49:49,Lee Armstrong写道:
Message has been deleted

les...@gmail.com

unread,
Apr 21, 2017, 4:44:45 AM4/21/17
to golang-nuts
Thanks all.

Jesper: I will try those things and may look to force a GC when I do my redcuing of the map.  Your explanation was very useful!
The application is processing a fast stream from RabbitMQ so yeah there is a lot of throughput onto this map.

Egon: Can you explain what to and why this would help?

刘桂祥: I do currently have it as map[string]*struct, how will changing it (in what I can see two places) make a difference.  The key to a byte and then a value in the map.

Thanks again,

Lee

Egon

unread,
Apr 21, 2017, 6:23:52 AM4/21/17
to golang-nuts
On Friday, 21 April 2017 11:44:45 UTC+3, Lee Armstrong wrote:
Thanks all.

Jesper: I will try those things and may look to force a GC when I do my redcuing of the map.  Your explanation was very useful!
The application is processing a fast stream from RabbitMQ so yeah there is a lot of throughput onto this map.

Egon: Can you explain what to and why this would help?

There are several ways of implementing a map and there are also offheap implementations (https://godoc.org/github.com/glycerine/offheap) that avoid GC completely -- at the cost of a more complicated API.

For example ideas to try:
1. sharding
2. using map[string]struct
3. using map[string]int and [][1<<10]struct (and manage collection/reuse yourself)
4. if the keys don't change often you could use an atom table e.g. map[string]Atom and map[Atom]struct

If you can show the minimized real-world case (or at least similar in struct sizes/types and read-write distribution) then it would be possible to suggest something more concrete.


刘桂祥: I do currently have it as map[string]*struct, how will changing it (in what I can see two places) make a difference.  The key to a byte and then a value in the map.

[20]byte is a value
[]byte contains a pointer to a value
string contains a pointer to a value

The grossly oversimplified version:

  "Pointers make GC slower, the less pointers you have the less work GC has to do."

r...@golang.org

unread,
Apr 21, 2017, 10:16:31 AM4/21/17
to golang-nuts
How did you generate the GC pause graphs? Could also provide the output from "GODEBUG=gctrace=1 yourApp"? It would help confirm that it is a GC pause problem. Also some insight into the number of cores / HW threads and the value of GOMAXPROCS could eliminate some possibilities.
A reproducer would be great.
Thanks in advance. 

les...@gmail.com

unread,
Apr 21, 2017, 10:51:45 AM4/21/17
to golang-nuts
It will be tough to reproduce in an example as it's quite highly embedded into a large project.

It does seem that moving away from pointers may help improve the situation from what Egon has said.

I will try and capture the gctrace next week.  The graphs were obtained from stackimpact.com which is embedded inside the binary.

The machine is a 2 core machine and GOMAXPROCS has not been changed manually so it is the default.  

GOARCH="amd64"

GOBIN=""

GOEXE=""

GOHOSTARCH="amd64"

GOHOSTOS="linux"

GOOS="linux"

GOPATH="/root/work"

GORACE=""

GOROOT="/root/.gvm/gos/go1.8.1"

GOTOOLDIR="/root/.gvm/gos/go1.8.1/pkg/tool/linux_amd64"

GCCGO="gccgo"

CC="gcc"

GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build994225408=/tmp/go-build -gno-record-gcc-switches"

CXX="g++"

CGO_ENABLED="1"

PKG_CONFIG="pkg-config"

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

les...@gmail.com

unread,
Apr 21, 2017, 11:23:11 AM4/21/17
to golang-nuts
Managed to find time to run it quickly before the end of UK time...

gc 1 @0.114s 0%: 0.088+2.1+0.068 ms clock, 0.17+0.14/0/2.1+0.13 ms cpu, 5->5->3 MB, 6 MB goal, 2 P
gc 2 @0.117s 1%: 0.035+3.0+0.051 ms clock, 0.071+0.076/2.9/0+0.10 ms cpu, 7->7->6 MB, 8 MB goal, 2 P
gc 3 @0.123s 1%: 0.020+2.6+0.073 ms clock, 0.041+0.52/0.089/2.5+0.14 ms cpu, 9->9->7 MB, 13 MB goal, 2 P
gc 4 @0.174s 3%: 0.047+12+0.051 ms clock, 0.095+0.58/6.7/5.7+0.10 ms cpu, 12->12->11 MB, 15 MB goal, 2 P
gc 5 @0.295s 5%: 0.022+27+0.020 ms clock, 0.045+0.80/19/7.7+0.041 ms cpu, 20->22->20 MB, 23 MB goal, 2 P
gc 6 @0.360s 8%: 0.040+55+0.15 ms clock, 0.081+4.8/29/25+0.31 ms cpu, 36->39->24 MB, 40 MB goal, 2 P
gc 7 @0.436s 9%: 0.027+45+0.13 ms clock, 0.055+1.0/20/24+0.27 ms cpu, 44->47->35 MB, 48 MB goal, 2 P
gc 8 @0.519s 11%: 0.026+66+0.15 ms clock, 0.052+14/31/35+0.31 ms cpu, 66->70->53 MB, 71 MB goal, 2 P
gc 9 @0.640s 12%: 0.051+81+0.080 ms clock, 0.10+1.6/40/41+0.16 ms cpu, 99->103->78 MB, 106 MB goal, 2 P
gc 10 @0.799s 14%: 0.020+98+0.058 ms clock, 0.040+31/50/48+0.11 ms cpu, 147->154->116 MB, 156 MB goal, 2 P
gc 11 @0.974s 14%: 0.022+133+0.12 ms clock, 0.045+1.6/66/67+0.24 ms cpu, 219->226->171 MB, 233 MB goal, 2 P
gc 12 @10.293s 2%: 4.0+185+0.074 ms clock, 8.1+166/90/7.2+0.14 ms cpu, 326->341->206 MB, 342 MB goal, 2 P
gc 13 @13.960s 2%: 0.079+207+0.097 ms clock, 0.15+3.3/107/163+0.19 ms cpu, 380->384->187 MB, 413 MB goal, 2 P
gc 14 @17.521s 2%: 0.031+161+0.087 ms clock, 0.062+168/84/10+0.17 ms cpu, 364->377->209 MB, 374 MB goal, 2 P
gc 15 @22.344s 2%: 0.16+327+0.15 ms clock, 0.32+1.9/164/197+0.31 ms cpu, 375->382->193 MB, 419 MB goal, 2 P
gc 16 @28.699s 2%: 0.16+299+0.077 ms clock, 0.33+0.55/149/205+0.15 ms cpu, 363->370->194 MB, 387 MB goal, 2 P
gc 17 @30.111s 2%: 0.029+0+190 ms clock, 0.059+0.55/149/205+380 ms cpu, 222->222->187 MB, 222 MB goal, 2 P (forced)
gc 18 @35.607s 2%: 0.056+323+0.12 ms clock, 0.11+107/163/139+0.25 ms cpu, 361->367->195 MB, 375 MB goal, 2 P
gc 19 @41.534s 2%: 0.18+197+0.094 ms clock, 0.36+259/102/0.29+0.18 ms cpu, 389->390->215 MB, 391 MB goal, 2 P
gc 20 @50.514s 2%: 0.063+397+0.067 ms clock, 0.12+17/205/219+0.13 ms cpu, 399->411->204 MB, 431 MB goal, 2 P
gc 21 @56.530s 2%: 4.0+373+0.12 ms clock, 8.1+15/193/118+0.25 ms cpu, 389->398->226 MB, 409 MB goal, 2 P
gc 22 @63.571s 2%: 0.092+316+0.085 ms clock, 0.18+5.8/159/282+0.17 ms cpu, 434->439->199 MB, 452 MB goal, 2 P
gc 23 @69.543s 2%: 0.059+329+0.13 ms clock, 0.11+195/165/69+0.26 ms cpu, 389->396->203 MB, 399 MB goal, 2 P
gc 24 @75.534s 2%: 4.0+460+0.097 ms clock, 8.1+402/240/5.1+0.19 ms cpu, 394->404->210 MB, 407 MB goal, 2 P
gc 25 @76.042s 3%: 0.035+0+368 ms clock, 0.070+402/240/5.1+736 ms cpu, 213->213->202 MB, 213 MB goal, 2 P (forced)
gc 26 @79.000s 3%: 4.0+397+0.10 ms clock, 8.0+442/205/4.8+0.21 ms cpu, 388->392->231 MB, 405 MB goal, 2 P
gc 27 @83.890s 3%: 0.030+475+0.12 ms clock, 0.060+123/241/262+0.25 ms cpu, 442->458->227 MB, 462 MB goal, 2 P
gc 28 @88.984s 3%: 4.0+331+0.072 ms clock, 8.0+389/173/17+0.14 ms cpu, 437->452->239 MB, 455 MB goal, 2 P
gc 29 @93.960s 4%: 0.041+413+0.059 ms clock, 0.083+324/207/7.0+0.11 ms cpu, 419->465->256 MB, 479 MB goal, 2 P
gc 30 @98.790s 4%: 0.37+687+0.088 ms clock, 0.75+90/352/0+0.17 ms cpu, 418->494->279 MB, 512 MB goal, 2 P
gc 31 @104.682s 3%: 2.1+482+0.089 ms clock, 4.2+103/247/49+0.17 ms cpu, 459->507->275 MB, 559 MB goal, 2 P
gc 32 @113.348s 3%: 2.6+509+0.062 ms clock, 5.2+3.4/248/368+0.12 ms cpu, 462->475->218 MB, 550 MB goal, 2 P
gc 33 @119.196s 3%: 0.047+391+0.077 ms clock, 0.094+0.67/195/300+0.15 ms cpu, 396->405->216 MB, 436 MB goal, 2 P
gc 34 @125.229s 3%: 1.7+460+0.038 ms clock, 3.5+2.9/232/221+0.077 ms cpu, 408->415->238 MB, 433 MB goal, 2 P
gc 35 @135.701s 3%: 4.0+468+0.039 ms clock, 8.0+269/237/94+0.078 ms cpu, 459->470->220 MB, 476 MB goal, 2 P
gc 36 @142.355s 3%: 0.037+568+0.069 ms clock, 0.075+126/284/26+0.13 ms cpu, 422->435->248 MB, 440 MB goal, 2 P
scvg0: inuse: 495, idle: 101, sys: 596, released: 0, consumed: 596 (MB)
gc 37 @151.456s 3%: 0.091+402+0.057 ms clock, 0.18+33/202/362+0.11 ms cpu, 474->480->217 MB, 496 MB goal, 2 P
gc 38 @160.788s 3%: 0.078+444+0.077 ms clock, 0.15+393/225/42+0.15 ms cpu, 422->430->222 MB, 434 MB goal, 2 P
gc 39 @170.674s 3%: 4.0+501+0.11 ms clock, 8.1+276/255/134+0.22 ms cpu, 426->438->227 MB, 444 MB goal, 2 P
gc 40 @178.975s 3%: 0.037+264+0.085 ms clock, 0.074+304/136/5.0+0.17 ms cpu, 436->450->243 MB, 454 MB goal, 2 P
gc 41 @187.694s 3%: 0.14+655+0.056 ms clock, 0.28+23/331/395+0.11 ms cpu, 444->462->227 MB, 486 MB goal, 2 P
gc 42 @197.907s 3%: 0.079+575+0.11 ms clock, 0.15+145/289/219+0.23 ms cpu, 426->444->229 MB, 454 MB goal, 2 P
gc 43 @204.584s 3%: 4.1+529+0.11 ms clock, 8.2+70/276/345+0.22 ms cpu, 430->441->223 MB, 459 MB goal, 2 P
gc 44 @214.030s 3%: 0.067+538+0.055 ms clock, 0.13+256/277/198+0.11 ms cpu, 426->439->225 MB, 446 MB goal, 2 P
gc 45 @224.505s 3%: 0.21+521+0.15 ms clock, 0.42+315/263/114+0.30 ms cpu, 428->442->229 MB, 451 MB goal, 2 P
gc 46 @234.125s 3%: 0.28+443+0.059 ms clock, 0.56+3.1/222/421+0.11 ms cpu, 431->437->222 MB, 458 MB goal, 2 P
gc 47 @240.497s 3%: 0.057+561+0.24 ms clock, 0.11+177/284/184+0.48 ms cpu, 429->440->250 MB, 445 MB goal, 2 P
gc 48 @246.455s 3%: 0.098+552+0.12 ms clock, 0.19+47/282/330+0.25 ms cpu, 482->493->251 MB, 501 MB goal, 2 P
gc 49 @254.984s 2%: 0.065+357+0.087 ms clock, 0.13+17/183/373+0.17 ms cpu, 487->492->223 MB, 503 MB goal, 2 P
gc 50 @261.595s 3%: 0.037+452+0.10 ms clock, 0.075+321/230/142+0.21 ms cpu, 435->443->228 MB, 446 MB goal, 2 P
gc 51 @267.769s 3%: 0.029+510+0.051 ms clock, 0.058+244/261/203+0.10 ms cpu, 440->452->232 MB, 456 MB goal, 2 P
gc 52 @275.186s 3%: 0.067+830+0.22 ms clock, 0.13+217/415/82+0.45 ms cpu, 446->459->258 MB, 465 MB goal, 2 P
gc 53 @286.604s 3%: 0.085+512+0.055 ms clock, 0.17+122/256/379+0.11 ms cpu, 494->502->230 MB, 517 MB goal, 2 P
gc 54 @294.859s 3%: 3.7+370+0.085 ms clock, 7.5+344/190/118+0.17 ms cpu, 447->450->250 MB, 461 MB goal, 2 P
scvg1: inuse: 461, idle: 146, sys: 607, released: 0, consumed: 607 (MB)
gc 55 @301.574s 3%: 0.039+654+0.15 ms clock, 0.079+320/331/60+0.31 ms cpu, 486->497->259 MB, 501 MB goal, 2 P
gc 56 @313.050s 3%: 0.055+480+0.10 ms clock, 0.11+452/244/46+0.20 ms cpu, 499->511->238 MB, 518 MB goal, 2 P
gc 57 @323.153s 3%: 4.0+548+0.075 ms clock, 8.1+184/275/296+0.15 ms cpu, 452->463->239 MB, 476 MB goal, 2 P
gc 58 @330.111s 3%: 0.010+0+341 ms clock, 0.021+184/275/296+682 ms cpu, 365->365->227 MB, 365 MB goal, 2 P (forced)
gc 59 @336.184s 3%: 2.8+638+0.067 ms clock, 5.6+380/321/11+0.13 ms cpu, 434->443->260 MB, 454 MB goal, 2 P
gc 60 @347.341s 3%: 4.0+664+0.098 ms clock, 8.0+324/338/228+0.19 ms cpu, 498->512->244 MB, 520 MB goal, 2 P
gc 61 @358.119s 3%: 0.060+612+0.14 ms clock, 0.12+358/306/177+0.29 ms cpu, 466->481->247 MB, 489 MB goal, 2 P
gc 62 @368.812s 3%: 0.027+770+0.070 ms clock, 0.054+188/386/412+0.14 ms cpu, 467->485->250 MB, 495 MB goal, 2 P
gc 63 @375.551s 3%: 0.041+0+360 ms clock, 0.082+188/386/412+720 ms cpu, 393->393->238 MB, 393 MB goal, 2 P (forced)
gc 64 @384.918s 3%: 0.052+504+0.095 ms clock, 0.10+556/252/0+0.19 ms cpu, 451->461->254 MB, 477 MB goal, 2 P
gc 65 @393.651s 3%: 0.052+781+0.080 ms clock, 0.10+613/392/8.2+0.16 ms cpu, 478->498->256 MB, 509 MB goal, 2 P
gc 66 @401.559s 3%: 0.095+566+0.098 ms clock, 0.19+10/291/546+0.19 ms cpu, 470->481->248 MB, 512 MB goal, 2 P
gc 67 @412.234s 3%: 0.065+502+0.066 ms clock, 0.13+14/254/509+0.13 ms cpu, 470->478->248 MB, 496 MB goal, 2 P
gc 68 @423.118s 3%: 4.1+564+0.11 ms clock, 8.2+200/283/391+0.22 ms cpu, 478->489->249 MB, 496 MB goal, 2 P
gc 69 @434.298s 3%: 0.053+472+0.087 ms clock, 0.10+1.5/240/550+0.17 ms cpu, 481->485->246 MB, 498 MB goal, 2 P

st ov

unread,
Apr 21, 2017, 11:58:06 AM4/21/17
to golang-nuts, les...@gmail.com
@Jesper, curious how you determined that? 
Is it in the spec? or the source?
Or is this a common GC pattern that you presume is being used?

Jesper Louis Andersen

unread,
Apr 21, 2017, 1:02:12 PM4/21/17
to les...@gmail.com, golang-nuts
On Fri, Apr 21, 2017 at 5:22 PM <les...@gmail.com> wrote:
Managed to find time to run it quickly before the end of UK time...

gc 62 @368.812s 3%: 0.027+770+0.070 ms clock, 0.054+188/386/412+0.14 ms cpu, 467->485->250 MB, 495 MB goal, 2 P
gc 63 @375.551s 3%: 0.041+0+360 ms clock, 0.082+188/386/412+720 ms cpu, 393->393->238 MB, 393 MB goal, 2 P (forced)


rlh@ is the official authoritative source here, but I have a hunch about the above:

Somewhere in the program, you are calling runtime.GC(). Its documentation says it may block the entire program. If you look at gc 63, it is forced, which means it is a call to runtime.GC(). It also spends 360ms in mark termination compared to 0.070 for a normally invoked GC.

One possible venue is to remove your own forced GC calls and let the runtime itself manage them and then look if that helps the programs behavior w.r.t. stalls. The 3% spent in the collector suggests you have good productivity in the mutator. And every forced GC has a long pause, which coincides with your original graph.

Note that the decision to make the runtime.GC() call may be a couple of releases old, when the behavior of that call was different. So you can have hit a situation where a good decision then is a bad decision now. 

Jesper Louis Andersen

unread,
Apr 21, 2017, 1:08:02 PM4/21/17
to st ov, golang-nuts, les...@gmail.com
On Fri, Apr 21, 2017 at 5:58 PM st ov <so.q...@gmail.com> wrote:
@Jesper, curious how you determined that? 
Is it in the spec? or the source?
Or is this a common GC pattern that you presume is being used?


There are a couple of design documents by Clements and Hudson which are worth reading. They are well written, and they do a good job at acknowledging both what problems are solved and what problems are left to solve.

Another source is this very mailing list. You can read what problems people have had and then try to form a hypothesis around it.

Finally, I have an acute interest in systems with low-latency soft-realtime properties. Especially when those systems use garbage collectors rather than manual memory management. This can sometimes inform you about typical "pain points" in the system designs.

It turns out my initial hypothesis must be rejected however. The gctrace=1 output tells an entirely different story because normal collections are not getting delayed in the mark termination phase, which is what my hypothesis suggested. I cannot stress how important it is to measure. As for understanding the gctrace=1 output, go with the documentation of 'package runtime' rather than older blog posts. It is release-specific and thus likely to change a little bit. Keep this in mind when you read older posts. However, some of the posts by Dave Cheney seems good because they also tell you a bit of background about what to look out for.

les...@gmail.com

unread,
Apr 21, 2017, 1:11:25 PM4/21/17
to golang-nuts, so.q...@gmail.com, les...@gmail.com
Thanks again all,

Interestingly I noticed the same and my code does not call runtime.GC() so that is a good spot.

I wonder if something I have imported is doing so, I will audit all of the imports and take a look!

les...@gmail.com

unread,
Apr 21, 2017, 1:13:55 PM4/21/17
to golang-nuts, so.q...@gmail.com, les...@gmail.com
Anyone think of an easy way to search for it's usage? 😆

les...@gmail.com

unread,
Apr 21, 2017, 1:22:48 PM4/21/17
to golang-nuts, so.q...@gmail.com, les...@gmail.com
I cheated,  I used godep to put everything into a /vendor folder and then did a quick search.

Turns out the only calls are in the stackimpact.com code!  https://github.com/stackimpact/stackimpact-go/search?utf8=✓&q=runtime.GC%28%29&type=

I've reached out to them but any insight there would be handy! :D

les...@gmail.com

unread,
Apr 21, 2017, 2:51:17 PM4/21/17
to golang-nuts
Interestingly stackimpact.com just updated their code to remove the runtime.GC() calls.

It has made a HUGE difference to the GC pauses.

The code was updated just before 19:30.

Interesting that the manual call had such an impact!

Keith Randall

unread,
Apr 21, 2017, 3:52:55 PM4/21/17
to golang-nuts
It is almost never a good idea to call runtime.GC explicitly.
It does block until a garbage collection completes.  This behavior is sometimes useful in tests, but almost never otherwise.  If it weren't for go1 compatibility, we'd rename this function to something that more clearly spells out its blocking behavior.

r...@golang.org

unread,
Apr 21, 2017, 5:11:53 PM4/21/17
to golang-nuts
Lee,  
As far as I can tell this is resolved. Thanks for the discussion and for working with stackimpact to fix the root cause.

Lee Armstrong

unread,
Apr 21, 2017, 5:22:30 PM4/21/17
to golang-nuts, r...@golang.org
Yes it is sorted thank you and it has been a super useful discussion. 

I had no idea the gravity of calling runtime.GC() and it was a bit of a rabbit hole to get to the bottom of that too as it's not in my code. 

Appreciate all the replies!

Lee



--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/olZisifl02Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

st ov

unread,
Apr 21, 2017, 6:51:36 PM4/21/17
to golang-nuts, so.q...@gmail.com, les...@gmail.com
amazing, Thank you for that detailed breakdown!

andrey mirtchovski

unread,
Apr 21, 2017, 6:55:22 PM4/21/17
to st ov, golang-nuts, les...@gmail.com
> amazing, Thank you for that detailed breakdown!

when one does something for a long time and understands the system
well they develop an intuition for where the problems may lie. this is
true in every profession...
Reply all
Reply to author
Forward
0 new messages