Overcommiting Memory Leak in Go Runtime

845 views
Skip to first unread message

Aleksa Sarai

unread,
Apr 4, 2016, 11:01:24 AM4/4/16
to golan...@googlegroups.com, golan...@googlegroups.com
Hi all,

I've been recently debugging an interesting memory "leak" inside Docker,
and have tracked it down to the Go runtime. The "leak" involves
overcommitted memory that is never freed. While I am aware that the Go
GC doesn't free memory to the operating system (meaning `munmap`,
because MADV_DONTNEED is not a substitute), I believe there is a
legitimate resource leak issue inside the Go runtime.

Unfortunately I can't appear to provide you any better "small test case"
than running Docker (sorry about that). I'm working on a better test
case, but this is really causing us some grief so hopefully someone can
give us a hand. I'm going to be referring to the Docker daemon < 1.11
(this problem still affects the currently in-development version of
Docker but the process interactions are much more complicated in that
version even though the base problem is the same).

If you run the following script:

% echo 1 | sudo tee /proc/sys/vm/overcommit_memory
% for i in {1..1000}; do docker run --name shell_$i -dit busybox sh; done
% for i in {1..1000}; do docker rm -f shell_$i; done

You have started and then destroyed 1000 containers. You can verify that
there's no resource leaks inside the Docker daemon if you start the
Docker daemon like so:

# DEBUG=1 docker daemon -H tcp://:8080

And then nagivating to http://127.0.0.1:8080/pprof/heap. But if you look
at the overcommitted amount of memory in /proc/meminfo, it'll look
something like this:

% grep Commit /proc/meminfo
CommitLimit: 1472088 kB
Committed_AS: 3535880 kB

However, if one assumes that this is just heap memory (sysUnused) that
will be reused if we start that many containers again, that doesn't
appear to be the case. If you run the same "start 1000 containers"
command again (bearing in mind the old containers were all completely
purged from existence), the memory **will still climb**. If you can't
reproduce it with 1000 containers, do it with 500 or something.

Now, the weird thing is that this growing of overcommited memory doesn't
appear to go on forever. For me (on a machine with 8GB of physical
memory), it stops growing once it reaches ~8GB. Does anyone know if this
is some "feature" of the Go runtime, that it decides to reuse heap
memory after it's exhausted as much overallocation as possible?

I played around with the Go runtime a little bit (and I ended up
modifying sysUnused in src/runtime/mem_linux.go to call `munmap` rather
than `madvise(v, n, _MADV_DONTNEED)`. One would assume this would cause
my program to segfault. *It didn't*. Nor did it seem to reduce the
overcommiting problem. So I'm pretty much out of my depth here, is there
any help anyone can give me? Thanks so much.

Is there some MemStats trickery I can try, or some `runtime` method that
can help me debug this issue or provide you with information?

--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Austin Clements

unread,
Apr 4, 2016, 2:22:12 PM4/4/16
to Aleksa Sarai, Go Mailing List, golang-dev
Hi Aleksa. Could you clarify why you're concerned about address space overcommit?

On Mon, Apr 4, 2016 at 10:29 AM, Aleksa Sarai <asa...@suse.de> wrote:
Hi all,

I've been recently debugging an interesting memory "leak" inside Docker, and have tracked it down to the Go runtime. The "leak" involves overcommitted memory that is never freed. While I am aware that the Go GC doesn't free memory to the operating system (meaning `munmap`, because MADV_DONTNEED is not a substitute), I believe there is a legitimate resource leak issue inside the Go runtime.

MADV_DONTNEED does return memory to the operating system. What it does not return is virtual address space. There's really no such thing as "overcommitted memory", just "overcommitted address space".

Address space is, to a first order approximation, free. In fact, using *less* address space can be *more* expensive in terms of kernel resources because each contiguous range of address space uses a kernel VMA. If we unmap some memory in the middle of that range (or change its protections), that splits a VMA into multiple VMAs. MADV_DONTNEED doesn't have this effect, and one reason we don't use munmap for this is that there's a default limit of only 64k VMAs per process.

Unfortunately I can't appear to provide you any better "small test case" than running Docker (sorry about that). I'm working on a better test case, but this is really causing us some grief so hopefully someone can give us a hand. I'm going to be referring to the Docker daemon < 1.11 (this problem still affects the currently in-development version of Docker but the process interactions are much more complicated in that version even though the base problem is the same).

If you run the following script:

% echo 1 | sudo tee /proc/sys/vm/overcommit_memory
% for i in {1..1000}; do docker run --name shell_$i -dit busybox sh; done
% for i in {1..1000}; do docker rm -f shell_$i; done

You have started and then destroyed 1000 containers. You can verify that there's no resource leaks inside the Docker daemon if you start the Docker daemon like so:

# DEBUG=1 docker daemon -H tcp://:8080

And then nagivating to http://127.0.0.1:8080/pprof/heap. But if you look at the overcommitted amount of memory in /proc/meminfo, it'll look something like this:

% grep Commit /proc/meminfo
CommitLimit:     1472088 kB
Committed_AS:    3535880 kB

However, if one assumes that this is just heap memory (sysUnused) that will be reused if we start that many containers again, that doesn't appear to be the case. If you run the same "start 1000 containers" command again (bearing in mind the old containers were all completely purged from existence), the memory **will still climb**. If you can't reproduce it with 1000 containers, do it with 500 or something.

This is somewhat surprising. I would expect the address space to be reused fairly efficiently, though not necessarily perfectly. Does it climb by the same amount when you start each batch of 1,000, or does it climb by less on the subsequent batches?

Now, the weird thing is that this growing of overcommited memory doesn't appear to go on forever. For me (on a machine with 8GB of physical memory), it stops growing once it reaches ~8GB. Does anyone know if this is some "feature" of the Go runtime, that it decides to reuse heap memory after it's exhausted as much overallocation as possible?

The Go runtime does not look at the either the physical memory available or any kernel overcommit settings.

I played around with the Go runtime a little bit (and I ended up modifying sysUnused in src/runtime/mem_linux.go to call `munmap` rather than `madvise(v, n, _MADV_DONTNEED)`. One would assume this would cause my program to segfault. *It didn't*. Nor did it seem to reduce the overcommiting problem. So I'm pretty much out of my depth here, is there any help anyone can give me? Thanks so much.

This is quite surprising, and suggests that it may not be returning memory to the system at all. How long did you wait? The runtime is currently fairly conservative about returning memory, so it may take between 5 and 7.5 minutes to return memory to the OS. (I would like to change this, but amortizing the cost of more frequently scavenging will require a different scavenging algorithm, so it's non-trivial.) You could try reducing this time for testing by modifying scavengelimit in proc.go.

Also, what architecture are you on? There's currently an issue with systems with large physical pages (ARM64, PPC64, and MIPS64) that causes the runtime to not return memory (https://golang.org/issue/9993).

Is there some MemStats trickery I can try, or some `runtime` method that can help me debug this issue or provide you with information?

Please run with GODEBUG=gctrace=1. In addition the to garbage collection trace, this turns on scavenger tracing. You should see a line beginning with "scvg" every 2.5 minutes that reports its activity.

Lars Tørnes Hansen

unread,
Apr 4, 2016, 10:29:50 PM4/4/16
to golang-nuts, golan...@googlegroups.com, asa...@suse.de
 
Now, the weird thing is that this growing of overcommited memory doesn't appear to go on forever.

The kernel will in vm.overcommit_memory=0 mode (default) let your software get away with overcomiiting.

Memory is actually allocated in a lazy manner: When a process writes to a new page the kernel will find a free page.
The problem is: When the kernel cannot find any free pages, it starts its OOM killer.
The OOM killer uses heuristics to guesstimate which process is guilty of using too much memory. Processes are killed until enough pages are freed.

You could let allocation of virtual memory fail early by tuning vm.overcommit_* sysctl variables:

vm.overcommit_memory
vm.overcommit_kbytes
vm.overcommit_ratio

That probably makes it easier to debug your software.

Set:
vm.overcommit_memory=2
... or set an absolute max. value with:
vm.overcommit_kbytes

Read more: Documentation/vm/overcommit-accounting

With overcommit mode set to 2 (vm.overcommit_memory=2), malloc will have an overcommit limit of:
commit lint = (vm.overcommit_ratio' * Physical RAM) + Swap

vm.overcommit_ratio can be set to a value > 100.

This should make malloc fail early, rather than late.

/Lars

Keith Randall

unread,
Apr 6, 2016, 11:11:05 AM4/6/16
to Austin Clements, Aleksa Sarai, Go Mailing List, golang-dev
Shot in the dark here, but try turning off huge pages and see if that makes a difference.  You might be running into something like https://github.com/golang/go/issues/8832

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aleksa Sarai

unread,
May 26, 2016, 5:11:08 AM5/26/16
to Austin Clements, Go Mailing List, golang-dev
Sorry for not responding in so long.

>> I've been recently debugging an interesting memory "leak" inside Docker,
>> and have tracked it down to the Go runtime. The "leak" involves
>> overcommitted memory that is never freed. While I am aware that the Go GC
>> doesn't free memory to the operating system (meaning `munmap`, because
>> MADV_DONTNEED is not a substitute), I believe there is a legitimate
>> resource leak issue inside the Go runtime.
>
> MADV_DONTNEED does return memory to the operating system. What it does not
> return is virtual address space. There's really no such thing as
> "overcommitted memory", just "overcommitted address space".

The issue is that the kernel will start killing processes that take up
too much "overcomitted address space" -- which in the case of Docker is
when you start about 500 containers on a single box. It's possible there
are other issues here, but the memory problems in Docker can't be
profiled using pprof -- implying this is an issue in the runtime.

Sure, you can disable this "kill if too much is overcommited" feature,
but that's just simply not going to fly in enterprises.

> Address space is, to a first order approximation, free. In fact, using
> *less* address space can be *more* expensive in terms of kernel resources
> because each contiguous range of address space uses a kernel VMA. If we
> unmap some memory in the middle of that range (or change its protections),
> that splits a VMA into multiple VMAs. MADV_DONTNEED doesn't have this
> effect, and one reason we don't use munmap for this is that there's a
> default limit of only 64k VMAs per process.

Interesting, I didn't know that.
I believe it's smaller in the subsequent batches (I don't have the data
with me right now, I'll take another look at this issue soon).

>> I played around with the Go runtime a little bit (and I ended up modifying
>> sysUnused in src/runtime/mem_linux.go to call `munmap` rather than
>> `madvise(v, n, _MADV_DONTNEED)`. One would assume this would cause my
>> program to segfault. *It didn't*. Nor did it seem to reduce the
>> overcommiting problem. So I'm pretty much out of my depth here, is there
>> any help anyone can give me? Thanks so much.
>>
>
> This is quite surprising, and suggests that it may not be returning memory
> to the system at all. How long did you wait? The runtime is currently
> fairly conservative about returning memory, so it may take between 5 and
> 7.5 minutes to return memory to the OS. (I would like to change this, but
> amortizing the cost of more frequently scavenging will require a different
> scavenging algorithm, so it's non-trivial.) You could try reducing this
> time for testing by modifying scavengelimit in proc.go.

I waited quite a while. I'll try it again, this time paying attention to
how long I waited to see if bad things happen. I'll also use
GODEBUG=gctrace=1.

> Also, what architecture are you on? There's currently an issue with systems
> with large physical pages (ARM64, PPC64, and MIPS64) that causes the
> runtime to not return memory (https://golang.org/issue/9993).

We saw this issue on both s390 and amd64. I'm not sure whether or not
this happens on other architectures (but it probably does).

Aleksa Sarai

unread,
May 26, 2016, 5:13:11 AM5/26/16
to Lars Tørnes Hansen, golang-nuts, golan...@googlegroups.com
>> Now, the weird thing is that this growing of overcommited memory doesn't
>> appear to go on forever.
>>
>
> The kernel will in vm.overcommit_memory=0 mode (default) let your software
> get away with overcomiiting.

If you don't set that option, the process will be killed (which
obviously doesn't fly as a way to fix this issue on a production box).

> Memory is actually allocated in a lazy manner: When a process writes to a
> new page the kernel will find a free page.
> The problem is: When the kernel cannot find any free pages, it starts its
> OOM killer.
> The OOM killer uses heuristics to guesstimate which process is guilty of
> using too much memory. Processes are killed until enough pages are freed.
>
> You could let allocation of virtual memory fail early by tuning
> vm.overcommit_* sysctl variables:
>
> vm.overcommit_memory
> vm.overcommit_kbytes
> vm.overcommit_ratio
>
> That probably makes it easier to debug your software.
>
> Set:
> vm.overcommit_memory=2
> ... or set an absolute max. value with:
> vm.overcommit_kbytes

I'll give this a shot, thanks.

Austin Clements

unread,
May 26, 2016, 10:34:36 AM5/26/16
to Aleksa Sarai, Go Mailing List, golang-dev
On Thu, May 26, 2016 at 5:10 AM, Aleksa Sarai <asa...@suse.de> wrote:
Sorry for not responding in so long.

I've been recently debugging an interesting memory "leak" inside Docker,
and have tracked it down to the Go runtime. The "leak" involves
overcommitted memory that is never freed. While I am aware that the Go GC
doesn't free memory to the operating system (meaning `munmap`, because
MADV_DONTNEED is not a substitute), I believe there is a legitimate
resource leak issue inside the Go runtime.

MADV_DONTNEED does return memory to the operating system. What it does not
return is virtual address space. There's really no such thing as
"overcommitted memory", just "overcommitted address space".

The issue is that the kernel will start killing processes that take up too much "overcomitted address space" -- which in the case of Docker is when you start about 500 containers on a single box. It's possible there are other issues here, but the memory problems in Docker can't be profiled using pprof -- implying this is an issue in the runtime.

Sure, you can disable this "kill if too much is overcommited" feature, but that's just simply not going to fly in enterprises.

On 64-bit Linux, we obtain address space in 64 KB chunks as its needed precisely because of this overcommitted address space issue. What we don't currently do is return address space to the system, even though we do return memory to the system (partly for the reason I mentioned about the surprising costs of address space fragmentation). Does Docker have a temporarily high peak memory usage that would drive up the mapped address space?

Can you get the output of runtime.ReadMemStats? That accounts for several sources of memory allocation that are outside of the heap and hence not reported by pprof (we should probably make pprof report these somehow). For example, I just fixed a bug with over-allocation of GC-internal memory (https://github.com/golang/go/issues/15319) that was only visible in ReadMemStats. It's possible Docker was affected by the bug, though I'd be kind of surprised.
Okay. Both of those should be returning memory to the system. (And for completeness, the bug with ARM64 etc not returning memory has been fixed since I wrote the previous reply.)
Reply all
Reply to author
Forward
0 new messages