Garbage collecting stop the world for about 10 seconds

Jingcheng Zhang

unread,

Nov 17, 2012, 2:29:02 AM11/17/12

to golan...@googlegroups.com

Hello everyone,

Our business suffered from an annoying problem. We are developing an
iMessage-like service in Go, the server can serves hundreds of
thousands of concurrent TCP connection per process, and it's robust
(be running for about a month), which is awesome. However, the process
consumes 16GB memory quickly, since there are so many connections,
there are also a lot of goroutines and buffered memories used. I
extend the memory limit to 64GB by changing runtime/malloc.h and
runtime/malloc.goc. It works, but brings a big problem too - The
garbage collecting process is then extremely slow, it stops the world
for about 10 seconds every 2 minutes, and brings me some problems
which are very hard to trace, for example, when stoping the world,
messages delivered may be lost. This is a disaster, since our service
is a real-time service which requires delivering messages as fast as
possible and there should be no stops and message lost at all.

I'm planning to split the "big server process" to many "small
processes" to avoid this problem (smaller memory footprint results to
smaller time stop), and waiting for Go's new GC implementation.

Or any suggestions for me to improve our service currently? I don't
know when Go's new latency-free garbage collection will occur.

Thanks.

--
Best regards,
Jingcheng Zhang
Beijing, P.R.China

Christoph Hack

unread,

Nov 17, 2012, 3:00:29 AM11/17/12

to golan...@googlegroups.com

Avoid the garbage in the first place. So, for example instead of allocating and returning a new strings in the String() methods of your object, you might want to implement the WriteTo method (or a similar interface). The standard library doesn't produce much garbage, so it's probably your program that allocates all those objects. Use the memory profiler to find those parts.

-christoph

Dmitry Vyukov

unread,

Nov 17, 2012, 3:13:58 AM11/17/12

to golan...@googlegroups.com

On Saturday, November 17, 2012 12:00:29 PM UTC+4, Christoph Hack wrote:

Avoid the garbage in the first place. So, for example instead of allocating and returning a new strings in the String() methods of your object, you might want to implement the WriteTo method (or a similar interface). The standard library doesn't produce much garbage, so it's probably your program that allocates all those objects. Use the memory profiler to find those parts.

This won't help. GC duration is not a function of garbage generation speed.

GC frequency is a function of garbage generation speed. So instead of 10sec every 2min, you can get 10sec every 4min. But I guess it won't solve the problem.

Rémy Oudompheng

unread,

Nov 17, 2012, 3:19:25 AM11/17/12

to Dmitry Vyukov, golan...@googlegroups.com

2minutes is the frequency of the scavenger froced GC. Maybe your
application doesn't need it and you can increase the hardcoded
frequency for something larger.

But you should definitely profile and reduce the number of objects you
are needing (go tool pprof --inuse_objects).

Rémy.

⚛

unread,

Nov 17, 2012, 6:42:09 AM11/17/12

to golan...@googlegroups.com

On Saturday, November 17, 2012 8:29:12 AM UTC+1, Jingcheng Zhang wrote:

Hello everyone,

Our business suffered from an annoying problem. We are developing an
iMessage-like service in Go, the server can serves hundreds of
thousands of concurrent TCP connection per process, and it's robust
(be running for about a month), which is awesome. However, the process
consumes 16GB memory quickly, since there are so many connections,
there are also a lot of goroutines and buffered memories used. I
extend the memory limit to 64GB by changing runtime/malloc.h and
runtime/malloc.goc. It works, but brings a big problem too - The
garbage collecting process is then extremely slow, it stops the world
for about 10 seconds every 2 minutes, and brings me some problems
which are very hard to trace, for example, when stoping the world,
messages delivered may be lost. This is a disaster, since our service
is a real-time service which requires delivering messages as fast as
possible and there should be no stops and message lost at all.

I'm planning to split the "big server process" to many "small
processes" to avoid this problem (smaller memory footprint results to
smaller time stop), and waiting for Go's new GC implementation.

1. There haven't been reports about major memory leaks on 64-bit CPUs. Are you suspecting that the application may be generating memory leaks on a 64-bit CPU?

2. The performance of the newer GC implementation (CL 6114046, in respect to the current GC) depends on the structure of data existing in the heap. For example, if all structs in the heap contain pointer fields and no integer fields, then the new GC is slower than the current one. This slowdown is unavoidable, because in this case the new GC is processing more bits of information per machine word. On the other hand, if the structs contain a mix of field types (pointers, integers, etc), the new GC may be faster. So, the new GC may be faster or slower than the current one. In the cases I have seen so far, the performance in regular Go programs is approximately the same.

Dmitry Vyukov

unread,

Nov 17, 2012, 6:49:21 AM11/17/12

to ⚛, golang-nuts

I am curious have you considered reordering fields in structs so that pointers are packed together? I understand that it's impossible in general case (when there are sub-structs), but in otherwise I think it's OK to arbitrary reorder fields. Then you can say in the metainfo -- in this object of size 128 scan only 4 words. This is, of course, complicates things.

Or any suggestions for me to improve our service currently? I don't
know when Go's new latency-free garbage collection will occur.

Thanks.

--
Best regards,
Jingcheng Zhang
Beijing, P.R.China

--

⚛

unread,

Nov 17, 2012, 7:50:23 AM11/17/12

to golan...@googlegroups.com, ⚛

On Saturday, November 17, 2012 12:49:40 PM UTC+1, Dmitry Vyukov wrote:

I am curious have you considered reordering fields in structs so that pointers are packed together? I understand that it's impossible in general case (when there are sub-structs), but in otherwise I think it's OK to arbitrary reorder fields. Then you can say in the metainfo -- in this object of size 128 scan only 4 words. This is, of course, complicates things.

There is cgo so reordering fields in general is forbidden. Also the runtime is sharing certain Go types and C types.

In my opinion, if GC already knows where the pointer fields are in the struct (knows their byte-offsets in the struct) then reordering the fields will not improve performance.

One reason for this is that in any case the GC needs to check the value of every pointer in the struct. The performance gain obtained by reordering fields seems negligible in comparison to checking the values.

A second reason is that while walking the heap each step ("instruction") in the GC implementation should generate a finite number of pointers. Typically, 1 step generates at most 1 pointer that needs to be checked. There should be an upper bound on the number of pointers generated in one step, so considering N consecutive words in one step would be problematic if N can be an arbitrary number.

Each pointer needs to be considered separately and there is absolutely no relation between any two pointers P and Q. This uncertainty appears to be the main reason why reordering fields wouldn't make the GC much faster.

One area where reordering fields would help is when the structure is big, the non-pointer fields in the structure are forming a gap, and this gap is putting pointer fields on different cachelines.

Dmitry Vyukov

unread,

Nov 17, 2012, 8:06:49 AM11/17/12

to ⚛, golang-nuts

I was thinking about an ideal case where you have only 1 instruction than says "first N fields are pointers", and you just need a single for loop 1..N. It should be not much slower than current non-precise gc in situation when structs contain only points.

But I think your analysis is correct.

Do I get it right that with type info there is no need check whether the pointer points to an allocated chunk of memory. It can be either 0 or a valid pointer to a valid allocated memory chunk, no other cases possible, right?

And additionally you know size of the pointed-to object -- it's either determined by type, or if it's a slice then size is in the subsequent word. Right?

⚛

unread,

Nov 17, 2012, 9:22:55 AM11/17/12

to golan...@googlegroups.com, ⚛

It can be a pointer from C, or if the GC does not know the actual type of the object then it may be an integer (not a pointer).

And additionally you know size of the pointed-to object -- it's either determined by type, or if it's a slice then size is in the subsequent word. Right?

In general, no. The actual size of an object is unknown to GC, because in some cases it cannot be inferred from the type of the pointer. There are however cases when the GC knows the actual type of the object and thus knows the actual size.

The GC code is robust in the sense that it will work even if there is no type information available about any object. If some type information is available, GC will use it and may be able to free more objects.

It is impossible to decide whether to prefer the partial typeinfo about an object or the full typeinfo. Getting the full typeinfo is more costly than getting the partial one. In some cases the full typeinfo isn't available at all. The GC implementation is primarily using the partial typeinfo and tries to retrieve the full typeinfo only if something goes wrong.

The length and capacity of a slice are insufficient to determine where the underlying array starts or ends. However, the knowledge that the object is a slice can be used to determine the full typeinfo of the slice's element (and thus the actual size of the slice's element). However, if the GC sees a pointer (Go type: *T) to any part of the underlying array prior to seeing that the block is actually of type [N]U, it will retrieve the full typeinfo U - if U can be determined and T is insufficient. Typeinfo T may be sufficient to process the whole array.

The easiest values to GC are for example Go's maps because they are completely self-contained and it is impossible to get a pointer to their interior.

Dmitry Vyukov

unread,

Nov 17, 2012, 9:44:40 AM11/17/12

to ⚛, golang-nuts

I see. Thanks!

Dmitry Vyukov

unread,

Nov 17, 2012, 11:25:59 AM11/17/12

to ⚛, golang-nuts

Humm... can't type info help with the following issue?

http://code.google.com/p/go/issues/detail?id=4246

Basically, having an address in heap or .data/.bss I need to output variable name, it can be best-effort.

Sugu Sougoumarane

unread,

Nov 17, 2012, 1:37:50 PM11/17/12

to golan...@googlegroups.com

For vtocc (vitess), we measured an overhead of about 40K per connection. So, 16G sounds a little high, even for 100k connections. You may want to profile your memory to get a better picture of what's going on. We typically run anywhere betwen 5-20k connections, and rarely exceed 1G.

Are you using Go 1? If so, you should try out a newer build with parallel GC. It should give you a speed up proportional to the number CPUs you have.

If most of your memory is due to large buffer sizes, you should tone down GOGC lower (try 50?). This will cause the garbage collector to run more often, with shorter pauses. This is because the GC does not scan inside byte slices.

Jingcheng Zhang

unread,

Nov 18, 2012, 8:44:54 PM11/18/12

to Dmitry Vyukov, golan...@googlegroups.com

Thanks for your explanation, so what determines the GC duration? The memory arena size? I changed the arena limit to 64GB, before the change it is fast to complete the GC. It is fine for our business to stop for about 2 seconds but bad to stop for 10 seconds.

> --
>
>

Jingcheng Zhang

unread,

Nov 18, 2012, 8:59:01 PM11/18/12

to Rémy Oudompheng, golan...@googlegroups.com

On Sat, Nov 17, 2012 at 4:19 PM, Rémy Oudompheng
<remyoud...@gmail.com> wrote:
> On 2012/11/17 Dmitry Vyukov <dvy...@google.com> wrote:
>> On Saturday, November 17, 2012 12:00:29 PM UTC+4, Christoph Hack wrote:
>>>
>>> Avoid the garbage in the first place. So, for example instead of
>>> allocating and returning a new strings in the String() methods of your
>>> object, you might want to implement the WriteTo method (or a similar
>>> interface). The standard library doesn't produce much garbage, so it's
>>> probably your program that allocates all those objects. Use the memory
>>> profiler to find those parts.
>>>
>>
>> This won't help. GC duration is not a function of garbage generation speed.
>> GC frequency is a function of garbage generation speed. So instead of 10sec
>> every 2min, you can get 10sec every 4min. But I guess it won't solve the
>> problem.
>
> 2minutes is the frequency of the scavenger froced GC. Maybe your
> application doesn't need it and you can increase the hardcoded
> frequency for something larger.

I set GOGC=200 to reduce the GC frequency, it seems the scavenger
triggered the GC before GOGC=200 play its role.
I will give this a try.

>
> But you should definitely profile and reduce the number of objects you
> are needing (go tool pprof --inuse_objects).

Thanks, we are working on this too, but haven't tried the pprof.

>
> Rémy.
>
> --

Jingcheng Zhang

unread,

Nov 18, 2012, 9:31:44 PM11/18/12

to ⚛, golan...@googlegroups.com

On Sat, Nov 17, 2012 at 7:42 PM, ⚛ <0xe2.0x...@gmail.com> wrote:
> On Saturday, November 17, 2012 8:29:12 AM UTC+1, Jingcheng Zhang wrote:
>>
>> Hello everyone,
>>
>> Our business suffered from an annoying problem. We are developing an
>> iMessage-like service in Go, the server can serves hundreds of
>> thousands of concurrent TCP connection per process, and it's robust
>> (be running for about a month), which is awesome. However, the process
>> consumes 16GB memory quickly, since there are so many connections,
>> there are also a lot of goroutines and buffered memories used. I
>> extend the memory limit to 64GB by changing runtime/malloc.h and
>> runtime/malloc.goc. It works, but brings a big problem too - The
>> garbage collecting process is then extremely slow, it stops the world
>> for about 10 seconds every 2 minutes, and brings me some problems
>> which are very hard to trace, for example, when stoping the world,
>> messages delivered may be lost. This is a disaster, since our service
>> is a real-time service which requires delivering messages as fast as
>> possible and there should be no stops and message lost at all.
>>
>> I'm planning to split the "big server process" to many "small
>> processes" to avoid this problem (smaller memory footprint results to
>> smaller time stop), and waiting for Go's new GC implementation.
>
>
> 1. There haven't been reports about major memory leaks on 64-bit CPUs. Are
> you suspecting that the application may be generating memory leaks on a
> 64-bit CPU?
>

Runtime is stable, no memory leaks, our server processes all have
uptime > 1 month.

> 2. The performance of the newer GC implementation (CL 6114046, in respect to
> the current GC) depends on the structure of data existing in the heap. For
> example, if all structs in the heap contain pointer fields and no integer
> fields, then the new GC is slower than the current one. This slowdown is
> unavoidable, because in this case the new GC is processing more bits of
> information per machine word. On the other hand, if the structs contain a
> mix of field types (pointers, integers, etc), the new GC may be faster.
> So, the new GC may be faster or slower than the current one. In the cases I
> have seen so far, the performance in regular Go programs is approximately
> the same.
>

The CL looks big. Does "precise GC" means "latency-free GC"? Or there
is still improvement space between precise GC and latency-free GC (One
of Go's goal)?

>>
>> Or any suggestions for me to improve our service currently? I don't
>> know when Go's new latency-free garbage collection will occur.
>>
>> Thanks.
>>
>> --
>> Best regards,
>> Jingcheng Zhang
>> Beijing, P.R.China
>

> --

Jingcheng Zhang

unread,

Nov 18, 2012, 9:44:33 PM11/18/12

to Sugu Sougoumarane, golan...@googlegroups.com

On Sun, Nov 18, 2012 at 2:37 AM, Sugu Sougoumarane <sou...@google.com> wrote:
> For vtocc (vitess), we measured an overhead of about 40K per connection. So,
> 16G sounds a little high, even for 100k connections. You may want to profile
> your memory to get a better picture of what's going on. We typically run
> anywhere betwen 5-20k connections, and rarely exceed 1G.
> Are you using Go 1? If so, you should try out a newer build with parallel
> GC. It should give you a speed up proportional to the number CPUs you have.
> If most of your memory is due to large buffer sizes, you should tone down
> GOGC lower (try 50?). This will cause the garbage collector to run more
> often, with shorter pauses. This is because the GC does not scan inside byte
> slices.

Currently we serve 600,000 concurrent, keep-alive TCP connections, per
process. The process consumes 16GB res memory, so each connection
28KB.
Go version is 1.0.3, amd64, with GOGC set to 200.

I'll tune GOGC and Scavenger's GC frequency to see if there are any
space to improve beside of code optimization.
Thanks for your help.

> --

Ian Lance Taylor

unread,

Nov 18, 2012, 10:23:52 PM11/18/12

to Jingcheng Zhang, Dmitry Vyukov, golan...@googlegroups.com

On Sun, Nov 18, 2012 at 5:44 PM, Jingcheng Zhang <dio...@gmail.com> wrote:
>
> Thanks for your explanation, so what determines the GC duration? The memory
> arena size? I changed the arena limit to 64GB, before the change it is fast
> to complete the GC. It is fine for our business to stop for about 2 seconds
> but bad to stop for 10 seconds.

The time it takes to run a GC is approximately proportional to the
size of live memory that may contain pointers. The total size of the
memory arena has a relatively small effect on the time it takes to run
a GC.

Ian

Ian Lance Taylor

unread,

Nov 18, 2012, 10:25:59 PM11/18/12

to Jingcheng Zhang, ⚛, golan...@googlegroups.com

On Sun, Nov 18, 2012 at 6:31 PM, Jingcheng Zhang <dio...@gmail.com> wrote:
>
> The CL looks big. Does "precise GC" means "latency-free GC"? Or there
> is still improvement space between precise GC and latency-free GC (One
> of Go's goal)?

Precise GC does not mean latency-free GC. It means a GC where only
genuine pointers are considered. The opposite of precise GC is
conservative GC, which is approximately what the Go runtime has now: a
value that looks like a valid pointer value is treated as a valid
pointer, even though in reality it may actually be, for example, a
floating point number or a string. With precise GC, floating point or
string values are never treated as pointers; only pointers are treated
as pointers.

Ian

Sugu Sougoumarane

unread,

Nov 18, 2012, 11:37:16 PM11/18/12

to golan...@googlegroups.com, Sugu Sougoumarane

On Sunday, November 18, 2012 6:44:45 PM UTC-8, Jingcheng Zhang wrote:

On Sun, Nov 18, 2012 at 2:37 AM, Sugu Sougoumarane <sou...@google.com> wrote:
> For vtocc (vitess), we measured an overhead of about 40K per connection. So,
> 16G sounds a little high, even for 100k connections. You may want to profile
> your memory to get a better picture of what's going on. We typically run
> anywhere betwen 5-20k connections, and rarely exceed 1G.
> Are you using Go 1? If so, you should try out a newer build with parallel
> GC. It should give you a speed up proportional to the number CPUs you have.
> If most of your memory is due to large buffer sizes, you should tone down
> GOGC lower (try 50?). This will cause the garbage collector to run more
> often, with shorter pauses. This is because the GC does not scan inside byte
> slices.

Currently we serve 600,000 concurrent, keep-alive TCP connections, per
process. The process consumes 16GB res memory, so each connection
28KB.
Go version is 1.0.3, amd64, with GOGC set to 200.

I'll tune GOGC and Scavenger's GC frequency to see if there are any
space to improve beside of code optimization.
Thanks for your help.

600k is a lot of connections :). However, a pause time of 10 seconds seems suspicious for 16G. It should be in the ballpark of 1-2 seconds for an 8-core box. This makes me think that 1.0.3 doesn't have the parallel GC improvements. I assume you have GOMAXPROCS set correctly.

Dmitry Vyukov

unread,

Nov 18, 2012, 11:55:15 PM11/18/12

to Sugu Sougoumarane, golang-nuts

On Mon, Nov 19, 2012 at 8:37 AM, Sugu Sougoumarane <sou...@google.com> wrote:

On Sun, Nov 18, 2012 at 2:37 AM, Sugu Sougoumarane <sou...@google.com> wrote:

> For vtocc (vitess), we measured an overhead of about 40K per connection. So,
> 16G sounds a little high, even for 100k connections. You may want to profile
> your memory to get a better picture of what's going on. We typically run
> anywhere betwen 5-20k connections, and rarely exceed 1G.
> Are you using Go 1? If so, you should try out a newer build with parallel
> GC. It should give you a speed up proportional to the number CPUs you have.
> If most of your memory is due to large buffer sizes, you should tone down
> GOGC lower (try 50?). This will cause the garbage collector to run more
> often, with shorter pauses. This is because the GC does not scan inside byte
> slices.

Currently we serve 600,000 concurrent, keep-alive TCP connections, per
process. The process consumes 16GB res memory, so each connection
28KB.
Go version is 1.0.3, amd64, with GOGC set to 200.

I'll tune GOGC and Scavenger's GC frequency to see if there are any
space to improve beside of code optimization.
Thanks for your help.

600k is a lot of connections :). However, a pause time of 10 seconds seems suspicious for 16G. It should be in the ballpark of 1-2 seconds for an 8-core box. This makes me think that 1.0.3 doesn't have the parallel GC improvements. I assume you have GOMAXPROCS set correctly.

Yes, Go1.0.3 does not have the parallel GC improvements. With the improved GC and GOMAXPROCS=8 it can drop to 2 seconds.

--

Jingcheng Zhang

unread,

Nov 19, 2012, 2:07:26 AM11/19/12

to Ian Lance Taylor, golan...@googlegroups.com

Thanks Ian for your explanation, so after precise GC, there should be
another improvement exist to make it latency-free (ultimately, a
precise, parallel, latency-free GC), right?

Rémy Oudompheng

unread,

Nov 19, 2012, 2:12:54 AM11/19/12

to Jingcheng Zhang, Sugu Sougoumarane, golan...@googlegroups.com

On 2012/11/19 Jingcheng Zhang <dio...@gmail.com> wrote:
> Currently we serve 600,000 concurrent, keep-alive TCP connections, per
> process. The process consumes 16GB res memory, so each connection
> 28KB.
> Go version is 1.0.3, amd64, with GOGC set to 200.
>
> I'll tune GOGC and Scavenger's GC frequency to see if there are any
> space to improve beside of code optimization.
> Thanks for your help.
>

Resident memory is not an accurate way of measuring your process used
memory. Can you run the server with GOGCTRACE=1 and give the GC
statistics that come out?

Please also run "go tool pprof --inuse_objects
http://myserver:myport/debug/pprof/heap" on your server with
net/http/pprof enabled (or get memory profiling by other means). It is
really essential to obtain improvements.

Rémy Oudompheng

unread,

Nov 19, 2012, 2:15:11 AM11/19/12

to Jingcheng Zhang, Ian Lance Taylor, golan...@googlegroups.com

On 2012/11/19 Jingcheng Zhang <dio...@gmail.com> wrote:

> Thanks Ian for your explanation, so after precise GC, there should be
> another improvement exist to make it latency-free (ultimately, a
> precise, parallel, latency-free GC), right?

It has been discussed but there is no plan as far as I know. It has
been estimated that it would take months (probably a year) before a
usable version would come out (if someone works on the subject of
course).

Rémy.

Jingcheng Zhang

unread,

Nov 19, 2012, 2:16:07 AM11/19/12

to Sugu Sougoumarane, golan...@googlegroups.com

Yes, I set GOMAXPROCS with:

runtime.GOMAXPROCS(runtime.NumCPU())

But runtime.NumCPU() is 24 in our box, while Go's MaxGcproc is 8 by default.

Jingcheng Zhang

unread,

Nov 19, 2012, 2:20:35 AM11/19/12

to Dmitry Vyukov, golang-nuts, brad...@golang.org

I heard that dl.google.com is using tip of Go, but I am afraid of the stability.
Could Brad tell which revision is dl.google.com currently using?

Thanks very much!

Dave Cheney

unread,

Nov 19, 2012, 2:21:54 AM11/19/12

to Jingcheng Zhang, golang-nuts, Brad Fitzpatrick, Dmitry Vyukov

Tip.

--

David Symonds

unread,

Nov 19, 2012, 2:26:43 AM11/19/12

to Jingcheng Zhang, Dmitry Vyukov, golang-nuts, brad...@golang.org

On Mon, Nov 19, 2012 at 6:20 PM, Jingcheng Zhang <dio...@gmail.com> wrote:

> I heard that dl.google.com is using tip of Go, but I am afraid of the stability.
> Could Brad tell which revision is dl.google.com currently using?

It's not exactly tip, but it's pretty close, and it's got almost all
the changes that would cause stability concerns (GC work, etc.).

Dave.

Jingcheng Zhang

unread,

Nov 19, 2012, 2:35:47 AM11/19/12

to Rémy Oudompheng, golan...@googlegroups.com

Would enable this behavior kill the server's performance?
If not, I'll try it on one of the servers.

Thanks!

Jingcheng Zhang

unread,

Nov 19, 2012, 2:45:04 AM11/19/12

to David Symonds, golang-nuts, brad...@golang.org

Does this mean that there is an internal branch of the tip?
Or only update to current tip when there are some changes improving
the stability?

>
> Dave.

Rémy Oudompheng

unread,

Nov 19, 2012, 2:50:53 AM11/19/12

to Jingcheng Zhang, golan...@googlegroups.com

On 2012/11/19 Jingcheng Zhang <dio...@gmail.com> wrote:
> On Mon, Nov 19, 2012 at 3:12 PM, Rémy Oudompheng
> <remyoud...@gmail.com> wrote:
>> On 2012/11/19 Jingcheng Zhang <dio...@gmail.com> wrote:
>> Resident memory is not an accurate way of measuring your process used
>> memory. Can you run the server with GOGCTRACE=1 and give the GC
>> statistics that come out?
>>
>> Please also run "go tool pprof --inuse_objects
>> http://myserver:myport/debug/pprof/heap" on your server with
>> net/http/pprof enabled (or get memory profiling by other means). It is
>> really essential to obtain improvements.
>
> Would enable this behavior kill the server's performance?
> If not, I'll try it on one of the servers.

Memory profiling is enabled by default even if you don't ask for it.
Requesting the memory profile through http can be a costly operation,
but it's only at the moment you use it and it's only extra CPU
consumption.

GOGCTRACE=1 enables debugging printing after each GC: it's only a
print of 2 lines at each GC, which is probably much cheaper than the
10 second pause.

Rémy.

Sugu Sougoumarane

unread,

Nov 19, 2012, 3:24:49 AM11/19/12

to golan...@googlegroups.com, David Symonds, brad...@golang.org

Does this mean that there is an internal branch of the tip?
Or only update to current tip when there are some changes improving
the stability?

For the longest time, we've run vtocc on this version of go: 4fdf6aa4f602 from a June snapshot.

We also have other servers that use a more recent snapshot: 024dde07c08d from October.

Both those versions have the parallel GC work. If you're skeptical, you can use the older one. But the newer snapshot may contain other improvements.

⚛

unread,

Nov 19, 2012, 3:32:42 AM11/19/12

to golan...@googlegroups.com, Ian Lance Taylor

On Monday, November 19, 2012 8:07:36 AM UTC+1, Jingcheng Zhang wrote:

Thanks Ian for your explanation, so after precise GC, there should be
another improvement exist to make it latency-free (ultimately, a
precise, parallel, latency-free GC), right?

Maybe. It depends on the data structures the program is using.

Implementing perfect latency-free concurrent GC would cause a slowdown in the overall throughput of many Go programs. So it cannot be said that concurrent latency-free GC is the ultimate goal.

The current GC implementation allows Go programs to avoid GC pauses if the Go code is managing memory allocations and deallocations on its own. That is: objects which are known to be no longer in use can be put into buffers and the buffers will serve forthcoming allocations. However, this may increase the total memory consumption so this approach isn't applicable to all Go programs.

⚛

unread,

Nov 19, 2012, 3:51:37 AM11/19/12

to golan...@googlegroups.com, Sugu Sougoumarane

On Monday, November 19, 2012 3:44:45 AM UTC+1, Jingcheng Zhang wrote:

On Sun, Nov 18, 2012 at 2:37 AM, Sugu Sougoumarane <sou...@google.com> wrote:
> For vtocc (vitess), we measured an overhead of about 40K per connection. So,
> 16G sounds a little high, even for 100k connections. You may want to profile
> your memory to get a better picture of what's going on. We typically run
> anywhere betwen 5-20k connections, and rarely exceed 1G.
> Are you using Go 1? If so, you should try out a newer build with parallel
> GC. It should give you a speed up proportional to the number CPUs you have.
> If most of your memory is due to large buffer sizes, you should tone down
> GOGC lower (try 50?). This will cause the garbage collector to run more
> often, with shorter pauses. This is because the GC does not scan inside byte
> slices.

Currently we serve 600,000 concurrent, keep-alive TCP connections, per
process. The process consumes 16GB res memory, so each connection
28KB.

Within a 5 minute time window, how many of those 600,000 connections are receiving or sending data?

bryanturley

unread,

Nov 19, 2012, 10:29:13 AM11/19/12

to golan...@googlegroups.com

Could manually running the gc more often help in this case? Less dead objects to scan perhaps.

Han-Wen Nienhuys

unread,

Nov 19, 2012, 10:38:12 AM11/19/12

to bryanturley, golan...@googlegroups.com

On Mon, Nov 19, 2012 at 4:29 PM, bryanturley <bryan...@gmail.com> wrote:
> Could manually running the gc more often help in this case? Less dead
> objects to scan perhaps.

Dead objects are not scanned. They are only sweeped.

--
Han-Wen Nienhuys
Google Munich
han...@google.com

⚛

unread,

Nov 19, 2012, 10:55:58 AM11/19/12

to golan...@googlegroups.com, bryanturley

On Monday, November 19, 2012 4:38:46 PM UTC+1, Han-Wen Nienhuys wrote:

On Mon, Nov 19, 2012 at 4:29 PM, bryanturley <bryan...@gmail.com> wrote:
> Could manually running the gc more often help in this case? Less dead
> objects to scan perhaps.

Dead objects are not scanned. They are only sweeped.

The sweep phase can be a fairly large part of a GC. It is hard to tell what the exact numbers are without running the server in question as "GOGCTRACE=1 ./server-binary".

bryanturley

unread,

Nov 19, 2012, 12:13:04 PM11/19/12

to golan...@googlegroups.com

On Monday, November 19, 2012 9:38:46 AM UTC-6, Han-Wen Nienhuys wrote:

On Mon, Nov 19, 2012 at 4:29 PM, bryanturley <bryan...@gmail.com> wrote:
> Could manually running the gc more often help in this case? Less dead
> objects to scan perhaps.

Dead objects are not scanned. They are only sweeped.

Yeah, I meant less dead objects to scan *for*.
If a program is making a lot of allocations that are short lived wouldn't scanning/reaping more often lead to less scan time during a scan?

Every 2 minutes his code stopped and gc'ed for 10 seconds. What if every 5 seconds he gc'ed? or every 20 seconds?
10 second pause every 2 minutes with tons of short lived objects could lead to (optimistic) ~5sec every min, ~2.5 every 30secs, ~1.25 every 15secs.
It is more likely to be a curve though, and it would slow the program down overall but you *might* be able to make it smoother.

Just a guess though he would have to try/measure and see.

Jingcheng Zhang

unread,

Nov 21, 2012, 5:49:55 AM11/21/12

to bryanturley, golan...@googlegroups.com

Hello everyone,

Thanks for all your help, I updated our Go version to:

go version devel +852ee39cc8c4 Mon Nov 19 06:53:58 2012 +1100

and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's
a big improvement!
Thank contributors on the new GC!

Dave Cheney

unread,

Nov 21, 2012, 5:54:37 AM11/21/12

to Jingcheng Zhang, bryanturley, golan...@googlegroups.com

Fantastic news, Dmitry will be proud.

> --
>
>

Anoop K

unread,

Nov 21, 2012, 6:14:06 AM11/21/12

to golan...@googlegroups.com, bryanturley

How much is the total memory consumed with new GO version ?

steve wang

unread,

Nov 21, 2012, 10:00:39 AM11/21/12

to golan...@googlegroups.com

On Wednesday, November 21, 2012 6:50:03 PM UTC+8, Jingcheng Zhang wrote:

Hello everyone,

Thanks for all your help, I updated our Go version to:

go version devel +852ee39cc8c4 Mon Nov 19 06:53:58 2012 +1100

and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's
a big improvement!

Is it possible that GC does even better?

One second is still a noticeable interruption when serving game players.

Ian Lance Taylor

unread,

Nov 21, 2012, 10:08:33 AM11/21/12

to steve wang, golan...@googlegroups.com

On Wed, Nov 21, 2012 at 7:00 AM, steve wang <steve....@gmail.com> wrote:
>
>
> On Wednesday, November 21, 2012 6:50:03 PM UTC+8, Jingcheng Zhang wrote:
>>
>> and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's
>> a big improvement!
>
> Is it possible that GC does even better?
> One second is still a noticeable interruption when serving game players.

Yes, it is possible.

In fact 1-2 seconds is still surprisingly high.

Ian

bryanturley

unread,

Nov 21, 2012, 1:11:12 PM11/21/12

to golan...@googlegroups.com

On Wednesday, November 21, 2012 9:00:40 AM UTC-6, steve wang wrote:

On Wednesday, November 21, 2012 6:50:03 PM UTC+8, Jingcheng Zhang wrote:
Hello everyone,

Thanks for all your help, I updated our Go version to:

go version devel +852ee39cc8c4 Mon Nov 19 06:53:58 2012 +1100

and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's
a big improvement!
Is it possible that GC does even better?
One second is still a noticeable interruption when serving game players.

Those are gc times on his workload, you would have to measure yourself on others.

Dave Cheney

unread,

Nov 21, 2012, 2:52:43 PM11/21/12

to steve wang, golang-nuts

Posibly , the OP has not yet provided the debugging information that was requested.

--

Dmitry Vyukov

unread,

Nov 23, 2012, 2:50:05 AM11/23/12

to Jingcheng Zhang, bryanturley, golang-nuts

On Wed, Nov 21, 2012 at 2:49 PM, Jingcheng Zhang <dio...@gmail.com> wrote:

Hello everyone,

Thanks for all your help, I updated our Go version to:

go version devel +852ee39cc8c4 Mon Nov 19 06:53:58 2012 +1100

and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's
a big improvement!
Thank contributors on the new GC!

Hi,

How many hardware threads do you have? If you have a huge heap and more than 8 hardware threads, can you try to bump maximum number of GC worker threads, and check if it improves pause further?

To do this you need to edit src/pkg/runtime/malloc.h

MaxGcproc = 8,

\/\/\/\/\/\/\/\/\/\/\/

MaxGcproc = 16/32/64,

and then rebuild everything.

I've limited maximum number of GC threads to 8, because I was testing on a machine with only 8 HT cores (16 hw threads total, but only 8 real cores) and on tests that consume ~300MB. If the heap is e.g. > 2GB it may make sense to increase number of threads further.

Jingcheng Zhang

unread,

Nov 28, 2012, 3:21:00 AM11/28/12

to Dmitry Vyukov, golang-nuts

Hello Dmitry,

Sorry to reply your mail so late. I noticed this variable before but
am not sure what will happen if I increase it to 12 or 24
(our server has 24 hardware threads: 2 CPUs, 6 core per CPU, with HT
support), as it's not exactly 2^N.

Does "proc" in "MaxGcproc" mean "24 logic cores with HT support" or
"12 real cores" in our server?
Or any difference for "MaxGcproc" between logic core with HT support
and real core?

Thanks,
Jingcheng Zhang

Dmitry Vyukov

unread,

Nov 28, 2012, 3:34:49 AM11/28/12

to Jingcheng Zhang, golang-nuts

On Wed, Nov 28, 2012 at 12:21 PM, Jingcheng Zhang <dio...@gmail.com> wrote:
> Hello Dmitry,
>
> Sorry to reply your mail so late. I noticed this variable before but
> am not sure what will happen if I increase it to 12 or 24
> (our server has 24 hardware threads: 2 CPUs, 6 core per CPU, with HT
> support), as it's not exactly 2^N.
>
> Does "proc" in "MaxGcproc" mean "24 logic cores with HT support" or
> "12 real cores" in our server?
> Or any difference for "MaxGcproc" between logic core with HT support
> and real core?

Go runtime does not know about HyperThreading, it just requests N
threads from OS and relies on OS thread scheduling and balancing.
Anyway, I think you just need to try different values, e.g. 12, 16,
20, 24 and see what works best for you.

Jingcheng Zhang

unread,

Nov 28, 2012, 3:45:20 AM11/28/12

to Dmitry Vyukov, golang-nuts

I'll try it later, thanks very much!

⚛

unread,

Nov 28, 2012, 3:02:18 PM11/28/12

to shka...@gmail.com, golang-nuts

GOGCTRACE=1 ./executable

On Nov 28, 2012 8:55 PM, <shka...@gmail.com> wrote:

Guys,
What is the best way to measure garbage collection times in GO?
Thanks

On Saturday, November 17, 2012 1:29:12 AM UTC-6, Jingcheng Zhang wrote:

Hello everyone,

Our business suffered from an annoying problem. We are developing an
iMessage-like service in Go, the server can serves hundreds of
thousands of concurrent TCP connection per process, and it's robust
(be running for about a month), which is awesome. However, the process
consumes 16GB memory quickly, since there are so many connections,
there are also a lot of goroutines and buffered memories used. I
extend the memory limit to 64GB by changing runtime/malloc.h and
runtime/malloc.goc. It works, but brings a big problem too - The
garbage collecting process is then extremely slow, it stops the world
for about 10 seconds every 2 minutes, and brings me some problems
which are very hard to trace, for example, when stoping the world,
messages delivered may be lost. This is a disaster, since our service
is a real-time service which requires delivering messages as fast as
possible and there should be no stops and message lost at all.

I'm planning to split the "big server process" to many "small
processes" to avoid this problem (smaller memory footprint results to
smaller time stop), and waiting for Go's new GC implementation.

Or any suggestions for me to improve our service currently? I don't
know when Go's new latency-free garbage collection will occur.

Thanks.

--
Best regards,
Jingcheng Zhang
Beijing, P.R.China

--

bryanturley

unread,

Nov 28, 2012, 3:59:19 PM11/28/12

to golan...@googlegroups.com

On Wednesday, November 28, 2012 2:02:18 PM UTC-6, ⚛ wrote:

GOGCTRACE=1 ./executable

Might help if you tell him what the fields mean exactly (from go 1.0.3, maybe less cryptic in tip)

"gc63(4): 0+0+0 ms 1 -> 0 MB 8257 -> 1073 (92277-91204) objects 127 handoff"

and from pkg/runtime/mgc0.c

runtime·printf("gc%d(%d): %D+%D+%D ms %D -> %D MB %D -> %D (%D-%D) objects %D handoff\n",
                    mstats.numgc, work.nproc, (t1-t0)/1000000, (t2-t1)/1000000, (t3-t2)/1000000,
                    heap0>>20, heap1>>20, obj0, obj1,
                    mstats.nmalloc, mstats.nfree,
                    nhandoff);

Without reading much of this code i am assuming obj0/heap0 are the before and obj1/heap1 are the after?
nmalloc and nfree seem obvious enough.
not even a guess as to what handoff is though ;)
this code i am working on didn't trigger an scvg line.

⚛

unread,

Nov 28, 2012, 4:24:01 PM11/28/12

to bryanturley, golang-nuts

On Nov 28, 2012 9:59 PM, "bryanturley" <bryan...@gmail.com> wrote:
>
> On Wednesday, November 28, 2012 2:02:18 PM UTC-6, ⚛ wrote:
>>
>> GOGCTRACE=1 ./executable
>
>
> Might help if you tell him what the fields mean exactly (from go 1.0.3, maybe less cryptic in tip)
>
> "gc63(4): 0+0+0 ms 1 -> 0 MB 8257 -> 1073 (92277-91204) objects 127 handoff"
>
> and from pkg/runtime/mgc0.c
>
> runtime·printf("gc%d(%d): %D+%D+%D ms %D -> %D MB %D -> %D (%D-%D) objects %D handoff\n",
>                     mstats.numgc, work.nproc, (t1-t0)/1000000, (t2-t1)/1000000, (t3-t2)/1000000,
>                     heap0>>20, heap1>>20, obj0, obj1,
>                     mstats.nmalloc, mstats.nfree,
>                     nhandoff);
>
> Without reading much of this code i am assuming obj0/heap0 are the before and obj1/heap1 are the after?

That is correct.

The sum of the 3 numbers before "ms" is the GC pause time.

There is also: godoc runtime MemStats

> nmalloc and nfree seem obvious enough.

nmalloc and nfree are totals since the start of the program.

> not even a guess as to what handoff is though ;)

handoff is communication between GC threads.

Reply all

Reply to author

Forward