This email is about: - what some recent CLs did to memory allocation - why it broke for some people (issue 1464) - what we should do to fix it.
* What happened
The memory allocator tracks runs of memory pages called Spans and must maintain a mapping from page address to Span.
On 32-bit machines, tcmalloc maps page address to Span using a two-level array: the top level has 1024 entries indexed by the first 10 bits, each pointing at a second level page of 1024 entries indexed by the second 10 bits. (Pages are 4096 bytes, so the final 12 bits are irrelevant.) The Go allocator followed tcmalloc's lead and did the same thing. During the recent changes to the allocator, I decided that this was trading time for space in a fairly ineffective way: the collector refers to this structure often, and the extra pointer indirection is noticeable. The only benefit of the two-level tree compared to a (faster) single 4 MB array is that it uses less space. However, if you don't touch the pages of the 4 MB array that you aren't using, the OS won't map them into memory anyway. We only do lookups in that map for pointers in the range [min(allocated address), max(allocated address)), so if the allocated memory is contiguous, the single 4 MB array uses no more memory than the two-level array and avoids the time for the extra indirect.
On 64-bit machines, tcmalloc and the old Go allocator used a three-level (or four-level, I forget) tree to map 64-bit addresses (52-bit page numbers) to Spans. On further reflection this seemed insane: if we assume a maximum arena size of 16 GB, that's only 22 page bits to map but we're paying for the generality of 52. So I made that one an array too (4 million entries * 8 bytes = 32 MB). Again under the assumption that the allocated memory is contiguous, there's no real cost on a demand-paged system.
There are some other data structures associated with the heap now too, to support the new garbage collector. It dedicates 4 bits per heap word to tracking various metadata. It's the same lookup problem as with Spans, just with more data. Like in the Span case, you can't do better than a big array indexed directly by heap address (translated and scaled), assuming that the allocated memory is contiguous.
The heap bitmap would be 512 MB for a 4 GB 32-bit address space (overkill since you'd only have 3.5 GB left on a 32-bit machine), 1 GB for a 16 GB 64-bit address space.
The easy way to make sure the allocated memory is contiguous is to make a big reservation by using mmap with prot==0 (no permission) and then change the permissions to rwx during the first "allocation". The same applies to the bitmap: reserve it but grow as needed.
So that's what I did for 64-bit systems. It works very well on my machines.
(On 32-bit systems you pretend the whole address space is the reservation and take what you're given from it; there's no room for a real reservation, but the effect is similar.)
* What broke
It is common (I knew this and forgot) for people to use ulimit -v to limit the amount of memory that a running program can request. At least that's what people wish it did. What they're trying to do is limit the amount of swapping that a program might cause, because swapping makes the system completely unresponsive. I figured that my mappings wouldn't get charged to the limit until I actually asked for memory there (prot == rwx), but in fact ulimit -v means "virtual address space" and they're serious. Even though there's no possible adverse consequence, you get charged for prot == 0. (Except on OS X, where ulimit -v appears to be ignored always.)
On 32-bit systems, only the initial bitmap gets reserved; that's 256 MB which is small enough for most people's ulimit. I don't think people have seen problems on 32-bit systems.
On 64-bit systems, the 1 GB bitmap and the 16 GB allocation arena get reserved at startup. Most people who have ulimit -v set don't have it set high enough to allow 17 GB of address space, so all Go binaries stopped working for those people. Making people type ulimit -v unlimited is a workaround, but not a very good one. Another, simpler work around is for the Go binary to try to do that itself at startup. That has some charm to it but is not going to be received well.
* How to fix it
I don't want to give up the many benefits of the simplifying assumption that the heap is contiguous. It works too well. The only question is how to make the heap contiguous. It's okay for the answer to be operating system-dependent: most details about low-level memory allocation already are.
The 32-bit ports are fine. They look like they are using 256 MB minimum, but if people look at RSS instead of VSIZE in top, they'll see the true story. (Windows broke too, but not because of anything relevant to this discussion.)
On darwin/amd64 I think we can keep doing what we're doing, since the kernel doesn't enforce ulimit -v and what we're doing is not actually expensive.
On linux/amd64 and freebsd/amd64, we have to do something different. I think the answer is to pretend that we have a 17 GB reservation at 0xf800000000 even though we don't, and then when we want to take from it, we die if someone else has gotten there first. I wrote a few test programs, and Linux will not give out that address if you don't ask for it. For allocations where it can pick the address, it prefers to allocate memory downward from the maximum user address 0x7fffffffffff, even after some memory is mapped at our reservation address. In a Go program linked against C code via cgo, the C allocator would not end up at our reservation address without hitting it intentionally, which it doesn't. I expect FreeBSD to be the same but do not have an available system on which to try it.
So that's my current plan: make SysReserve a no-op on Linux and FreeBSD and make SysMap abort the program if the address we want to use is taken. It's admittedly dodgy, but it will work just fine for pure Go programs and empirically looks like it will work just fine for Go programs linked with C libraries too. In an ideal world the kernel would be more useful.
When windows/amd64 comes along, there will be no problem, because its VirtualAlloc API is better designed than mmap.