Google Groups

mmap fiasco

rsc Feb 4, 2011 2:49 PM
Posted in group: golang-dev
This email is about:
 - what some recent CLs did to memory allocation
 - why it broke for some people (issue 1464)
 - what we should do to fix it.

* What happened

The memory allocator tracks runs of memory pages called
Spans and must maintain a mapping from page address to Span.

On 32-bit machines, tcmalloc maps page address to Span using
a two-level array: the top level has 1024 entries indexed by
the first 10 bits, each pointing at a second level page of
1024 entries indexed by the second 10 bits.  (Pages are 4096
bytes, so the final 12 bits are irrelevant.) The Go
allocator followed tcmalloc's lead and did the same thing.
During the recent changes to the allocator, I decided that
this was trading time for space in a fairly ineffective way:
the collector refers to this structure often, and the extra
pointer indirection is noticeable.  The only benefit of the
two-level tree compared to a (faster) single 4 MB array is
that it uses less space.  However, if you don't touch the
pages of the 4 MB array that you aren't using, the OS won't
map them into memory anyway.  We only do lookups in that map
for pointers in the range [min(allocated address),
max(allocated address)), so if the allocated memory is
contiguous, the single 4 MB array uses no more memory than
the two-level array and avoids the time for the extra

On 64-bit machines, tcmalloc and the old Go allocator used a
three-level (or four-level, I forget) tree to map 64-bit
addresses (52-bit page numbers) to Spans.  On further
reflection this seemed insane: if we assume a maximum arena
size of 16 GB, that's only 22 page bits to map but we're
paying for the generality of 52.  So I made that one an
array too (4 million entries * 8 bytes = 32 MB).  Again
under the assumption that the allocated memory is
contiguous, there's no real cost on a demand-paged system.

There are some other data structures associated with the
heap now too, to support the new garbage collector.  It
dedicates 4 bits per heap word to tracking various metadata.
It's the same lookup problem as with Spans, just with more
data.  Like in the Span case, you can't do better than a big
array indexed directly by heap address (translated and
scaled), assuming that the allocated memory is contiguous.

The heap bitmap would be 512 MB for a 4 GB 32-bit address
space (overkill since you'd only have 3.5 GB left on a
32-bit machine), 1 GB for a 16 GB 64-bit address space.

The easy way to make sure the allocated memory is contiguous
is to make a big reservation by using mmap with prot==0 (no
permission) and then change the permissions to rwx during
the first "allocation".  The same applies to the bitmap:
reserve it but grow as needed.

So that's what I did for 64-bit systems.  It works very well
on my machines.

(On 32-bit systems you pretend the whole address space is
the reservation and take what you're given from it; there's
no room for a real reservation, but the effect is similar.)

* What broke

It is common (I knew this and forgot) for people to use
ulimit -v to limit the amount of memory that a running
program can request.  At least that's what people wish it
did.  What they're trying to do is limit the amount of
swapping that a program might cause, because swapping makes
the system completely unresponsive.  I figured that my
mappings wouldn't get charged to the limit until I actually
asked for memory there (prot == rwx), but in fact ulimit -v
means "virtual address space" and they're serious.  Even
though there's no possible adverse consequence, you get
charged for prot == 0.  (Except on OS X, where ulimit -v
appears to be ignored always.)

On 32-bit systems, only the initial bitmap gets reserved;
that's 256 MB which is small enough for most people's
ulimit.  I don't think people have seen problems on 32-bit

On 64-bit systems, the 1 GB bitmap and the 16 GB allocation
arena get reserved at startup.  Most people who have ulimit
-v set don't have it set high enough to allow 17 GB of
address space, so all Go binaries stopped working for those
people.  Making people type
    ulimit -v unlimited
is a workaround, but not a very good one.  Another, simpler
work around is for the Go binary to try to do that itself at
startup.  That has some charm to it but is not going to be
received well.

* How to fix it

I don't want to give up the many benefits of the simplifying
assumption that the heap is contiguous.  It works too well.
The only question is how to make the heap contiguous.  It's
okay for the answer to be operating system-dependent: most
details about low-level memory allocation already are.

The 32-bit ports are fine.  They look like they are using
256 MB minimum, but if people look at RSS instead of VSIZE
in top, they'll see the true story.  (Windows broke too, but
not because of anything relevant to this discussion.)

On darwin/amd64 I think we can keep doing what we're doing,
since the kernel doesn't enforce ulimit -v and what we're
doing is not actually expensive.

On linux/amd64 and freebsd/amd64, we have to do something
different.  I think the answer is to pretend that we have a
17 GB reservation at 0xf800000000 even though we don't, and
then when we want to take from it, we die if someone else
has gotten there first.  I wrote a few test programs, and
Linux will not give out that address if you don't ask for
it.  For allocations where it can pick the address, it
prefers to allocate memory downward from the maximum user
address 0x7fffffffffff, even after some memory is mapped at
our reservation address.  In a Go program linked against C
code via cgo, the C allocator would not end up at our
reservation address without hitting it intentionally, which
it doesn't.  I expect FreeBSD to be the same but do not have
an available system on which to try it.

So that's my current plan: make SysReserve a no-op on Linux
and FreeBSD and make SysMap abort the program if the address
we want to use is taken.  It's admittedly dodgy, but it will
work just fine for pure Go programs and empirically looks
like it will work just fine for Go programs linked with C
libraries too.  In an ideal world the kernel would be more

When windows/amd64 comes along, there will be no problem,
because its VirtualAlloc API is better designed than mmap.

Comments, warnings, protests, counter-proposals welcome.