RES vs MemStats.Sys memory reporting on Linux

1,119 views
Skip to first unread message

Blake Caldwell

unread,
Nov 7, 2014, 5:53:00 PM11/7/14
to golan...@googlegroups.com
I'm seeing a *huge* discrepancy between RES and MemStats.Sys on Linux. My MemStats.Sys looks perfect - exactly what I'd expect. With 50 concurrent sessions on my SSH service, I'm right around 20MB.

However, RES gets up around 1GB, which is the total amount on this VM, then crashes. I'm fine with moving this to a VM with more ram, but doesn't this seem unreasonably high? 

Most of my allocations are due to the crypto package's eventual use of math/big.nat.divLarge. Is there anything I can do here, or is this just a silly thing the GC does in 1.3?

Dave Cheney

unread,
Nov 7, 2014, 5:57:19 PM11/7/14
to golan...@googlegroups.com
Please run your program with GODEBUG=gctrace=1 until you see RSS and SYS diverge.

At a guess, your os is ignoring the scavenger's suggestion to release memory, probably because your machine is not under sufficient memory pressure.

That, or heap fragmentstion. Gctrace will make that clear.

Blake Caldwell

unread,
Nov 7, 2014, 8:13:41 PM11/7/14
to golan...@googlegroups.com
Cool. I'll try that. Is there any chance of circular reference leaks in Golang? I know I'm not leaking goroutines.

Dave Cheney

unread,
Nov 7, 2014, 8:15:38 PM11/7/14
to Blake Caldwell, golan...@googlegroups.com
It depends, can you show what one would look like?



> On 8 Nov 2014, at 12:13, Blake Caldwell <bl...@fogcreek.com> wrote:
>
> Cool. I'll try that. Is there any chance of circular reference leaks in Golang? I know I'm not leaking goroutines.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/kuS4kLCwkbE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Blake Caldwell

unread,
Nov 7, 2014, 8:22:48 PM11/7/14
to golan...@googlegroups.com
Oh you know... A and B reference each other, so reference count on each is 1, so they stick around even though nothing else references them. I know in Python, there's weird situations where if one of the objects had a "del" method, the GC can't decide which to delete first, so neither do, and you get a leak. Didn't know if there are situations like this to worry about in Golang.

Dave Cheney

unread,
Nov 7, 2014, 8:30:51 PM11/7/14
to Blake Caldwell, golang-nuts
Go doesn't use reference counting, if the reference is not reachable
from a goroutine (or a time), then it is not retained.

On Sat, Nov 8, 2014 at 12:22 PM, Blake Caldwell <bl...@fogcreek.com> wrote:
> Oh you know... A and B reference each other, so reference count on each is 1, so they stick around even though nothing else references them. I know in Python, there's weird situations where if one of the objects had a "del" method, the GC can't decide which to delete first, so neither do, and you get a leak. Didn't know if there are situations like this to worry about in Golang.
>

Dave Cheney

unread,
Nov 7, 2014, 8:31:12 PM11/7/14
to Blake Caldwell, golang-nuts
s/time/timer

Blake Caldwell

unread,
Nov 7, 2014, 8:36:48 PM11/7/14
to golan...@googlegroups.com
Awesome, thanks :)

Dave Cheney

unread,
Nov 7, 2014, 8:39:37 PM11/7/14
to Blake Caldwell, golang-nuts
Don't forget to get that GODEBUG data, it should be pretty simple to
figure out what is going on from the trace.

On Sat, Nov 8, 2014 at 12:36 PM, Blake Caldwell <bl...@fogcreek.com> wrote:
> Awesome, thanks :)

Blake Caldwell

unread,
Nov 7, 2014, 9:49:14 PM11/7/14
to golan...@googlegroups.com, bl...@fogcreek.com
Weird... I'm running now on a different Linux VM with the debug setting you suggested, and my RES just about exactly matches what I'm seeing from MemStats.Sys.

I gotta find out what's going on on that other machine. I'll post back when I find out - thanks again!

Dave Cheney

unread,
Nov 8, 2014, 12:23:17 AM11/8/14
to Blake Caldwell, golan...@googlegroups.com
Please try to get en GODEBUG data, it will be able to explain what you are seeing.

On the server which is showing high RSS, how much swap does it have? Similarly for the machine that shows low RSS. Are they running the same kernel and/or os?


Blake Caldwell

unread,
Nov 8, 2014, 12:27:52 AM11/8/14
to golan...@googlegroups.com, bl...@fogcreek.com
Oh, I will as soon as I can. That machine is out of my hands, but we're logging out that data.

Blake Caldwell

unread,
Nov 11, 2014, 12:54:43 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com
Here's some of the trace output while MemStats.Sys is showing about 17MB, RES is 326MB and VIRT 560MB.

I'm not really sure what I'm looking for here - do you see anything obvious?'

As for kernel/OS - they're both Linux. This box that's using way too much RES is 3.2.0-4-amd64, while the one where I saw no problems was 3.13.0-32-generic.
 
gc2810(1): 1+0+1837+0 us, 2 -> 4 MB, 16286 (31303956-31287670) objects, 811/3/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2811(1): 1+0+1910+0 us, 2 -> 5 MB, 15953 (31314082-31298129) objects, 811/1/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2812(1): 1+0+1860+0 us, 2 -> 5 MB, 19029 (31327130-31308101) objects, 811/0/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2813(1): 1+0+1958+0 us, 2 -> 5 MB, 23497 (31344501-31321004) objects, 811/0/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2814(1): 2+1+1886+0 us, 2 -> 5 MB, 23976 (31362280-31338304) objects, 811/2/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2815(1): 1+0+1717+0 us, 2 -> 5 MB, 25018 (31381285-31356267) objects, 811/1/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2816(1): 1+0+1722+0 us, 2 -> 5 MB, 18677 (31395434-31376757) objects, 811/3/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2817(1): 2+1+3238+1 us, 2 -> 5 MB, 23086 (31414160-31391074) objects, 811/575/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2818(1): 1+0+1215+0 us, 2 -> 4 MB, 15903 (31425171-31409268) objects, 811/5/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2819(1): 1+0+980+0 us, 2 -> 4 MB, 15360 (31437323-31421963) objects, 811/431/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2820(1): 3+1+1464+1 us, 1 -> 3 MB, 10860 (31445642-31434782) objects, 811/388/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields


Thanks!


On Saturday, November 8, 2014 12:23:17 AM UTC-5, Dave Cheney wrote:

Dave Cheney

unread,
Nov 11, 2014, 2:58:49 PM11/11/14
to Blake Caldwell, Dmitry Vyukov, golang-nuts
Thanks Blake.

There is some missing info,

* the scavenger lines, scvg lines, are the crucial part, there should
be one every 5 minutes
* the swap settings on both machines, free -m will do

Dmitry,

gc2820(1): 3+1+1464+1 us, 1 -> 3 MB, 10860 (31445642-31434782)
objects, 811/388/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields

1.4 seconds to collect a tiny heap sounds very odd, any ideas ?

James Bardin

unread,
Nov 11, 2014, 3:18:38 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com, dvy...@google.com


On Tuesday, November 11, 2014 2:58:49 PM UTC-5, Dave Cheney wrote:

gc2820(1): 3+1+1464+1 us, 1 -> 3 MB, 10860 (31445642-31434782)
objects, 811/388/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields

1.4 seconds to collect a tiny heap sounds very odd, any ideas ?


that's 1464us; or 0.001464 seconds

Dave Cheney

unread,
Nov 11, 2014, 3:22:31 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com, dvy...@google.com
Ignore me, carry on.

Blake, this is either a reporting error from the older 3.2 kernel, or one machine is configured without swap, which causes the first problem. If you can find those scvg: lines, they'll show you the real memory usage.

Blake Caldwell

unread,
Nov 11, 2014, 3:46:02 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com, dvy...@google.com
Here's the full output: https://gist.githubusercontent.com/wblakecaldwell/70165b8eded10d3b101d/raw/aa1c5ca34de6af2547d658e69e0a123df448d496/gistfile1.txt

We spend one minute ramping up the number of concurrent connections to 25, then hold it for 10 minutes or so. We then resume it again for a bit. There's a certain point where the service has no connections, which is where most of this output is reporting from. I apologize that I don't have the timings perfect. Let me know if that's necessary.

free -m on the problem machine (3.2.0-4-amd64) yields:
             total       used       free     shared    buffers     cached
Mem:          1002        894        108          0         50        242
-/+ buffers/cache:        602        400
Swap:          713         20        693

free -m on the machine where the service RES stays low (3.13.0-32-generic) yields:
             total       used       free     shared    buffers     cached
Mem:           990        923         67          0         10         58
-/+ buffers/cache:        854        135
Swap:         1021        782        239

I appreciate everyone's help. I need to learn more about trace and GC so I can troubleshoot this on my own next time.

Much appreciated!

Dave Cheney

unread,
Nov 11, 2014, 3:53:58 PM11/11/14
to Blake Caldwell, golang-nuts, Dmitry Vyukov
Thanks Blake,

scvg259: inuse: 4, idle: 5, sys: 10, released: 3, consumed: 6 (MB)

This is the key bit of info. This is what the runtime believes it has
asked the operating system for, reported RSS should be closer to that.

For both the machines it's basically impossible to buy a machine with
only 1Gb of ram these days, so unless these machines are very old, I
suspect some virtualisation is involved, which can also throw off
memory usage measurements.

Blake Caldwell

unread,
Nov 11, 2014, 4:13:02 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com, dvy...@google.com
Oh, right. I'm sorry - these are VMs. So, all bets are off, then?

And yeah, those numbers are what I'm seeing in Memstats.Sys, and are close to RES on the machine that's behaving.

Dave Cheney

unread,
Nov 11, 2014, 4:21:41 PM11/11/14
to Blake Caldwell, golang-nuts, Dmitry Vyukov
On Wed, Nov 12, 2014 at 8:13 AM, Blake Caldwell <bl...@fogcreek.com> wrote:
> Oh, right. I'm sorry - these are VMs. So, all bets are off, then?

Stick with the numbers the Go runtime reports, those are accurate.

Blake Caldwell

unread,
Nov 11, 2014, 4:24:42 PM11/11/14
to golan...@googlegroups.com, bl...@fogcreek.com, dvy...@google.com
Yeah, the numbers definitely show that I'm not doing something stupid, but... the service's RES use is acting like you'd expect - once it gets to 100% of the system's available memory, it swaps, and then the service crashes. This isn't a theoretical or reporting problem - the service is crashing from this. :(

Blake Caldwell

unread,
Dec 8, 2014, 8:28:06 PM12/8/14
to golan...@googlegroups.com
I wanted to share some closure on this one. 

To recap: the runtime for my service was reporting memory usage around 20-40MB while under load, but the RES would approach 1GB, and then crash, because that's all we gave this VM. That VM's kernel was 3.2.0-4-amd64, which seems to be almost 3 years old.

We just updated to 3.16-0.bpo.3-amd64, and we're seeing RES matching the runtime's bookkeeping, staying around 20MB!

Thanks again for everyone's help!

- Blake
Reply all
Reply to author
Forward
0 new messages