mmap fiasco

2,818 views
Skip to first unread message

Russ Cox

unread,
Feb 4, 2011, 5:49:57 PM2/4/11
to golang-dev
This email is about:
- what some recent CLs did to memory allocation
- why it broke for some people (issue 1464)
- what we should do to fix it.

* What happened

The memory allocator tracks runs of memory pages called
Spans and must maintain a mapping from page address to Span.

On 32-bit machines, tcmalloc maps page address to Span using
a two-level array: the top level has 1024 entries indexed by
the first 10 bits, each pointing at a second level page of
1024 entries indexed by the second 10 bits. (Pages are 4096
bytes, so the final 12 bits are irrelevant.) The Go
allocator followed tcmalloc's lead and did the same thing.
During the recent changes to the allocator, I decided that
this was trading time for space in a fairly ineffective way:
the collector refers to this structure often, and the extra
pointer indirection is noticeable. The only benefit of the
two-level tree compared to a (faster) single 4 MB array is
that it uses less space. However, if you don't touch the
pages of the 4 MB array that you aren't using, the OS won't
map them into memory anyway. We only do lookups in that map
for pointers in the range [min(allocated address),
max(allocated address)), so if the allocated memory is
contiguous, the single 4 MB array uses no more memory than
the two-level array and avoids the time for the extra
indirect.

On 64-bit machines, tcmalloc and the old Go allocator used a
three-level (or four-level, I forget) tree to map 64-bit
addresses (52-bit page numbers) to Spans. On further
reflection this seemed insane: if we assume a maximum arena
size of 16 GB, that's only 22 page bits to map but we're
paying for the generality of 52. So I made that one an
array too (4 million entries * 8 bytes = 32 MB). Again
under the assumption that the allocated memory is
contiguous, there's no real cost on a demand-paged system.

There are some other data structures associated with the
heap now too, to support the new garbage collector. It
dedicates 4 bits per heap word to tracking various metadata.
It's the same lookup problem as with Spans, just with more
data. Like in the Span case, you can't do better than a big
array indexed directly by heap address (translated and
scaled), assuming that the allocated memory is contiguous.

The heap bitmap would be 512 MB for a 4 GB 32-bit address
space (overkill since you'd only have 3.5 GB left on a
32-bit machine), 1 GB for a 16 GB 64-bit address space.

The easy way to make sure the allocated memory is contiguous
is to make a big reservation by using mmap with prot==0 (no
permission) and then change the permissions to rwx during
the first "allocation". The same applies to the bitmap:
reserve it but grow as needed.

So that's what I did for 64-bit systems. It works very well
on my machines.

(On 32-bit systems you pretend the whole address space is
the reservation and take what you're given from it; there's
no room for a real reservation, but the effect is similar.)

* What broke

It is common (I knew this and forgot) for people to use
ulimit -v to limit the amount of memory that a running
program can request. At least that's what people wish it
did. What they're trying to do is limit the amount of
swapping that a program might cause, because swapping makes
the system completely unresponsive. I figured that my
mappings wouldn't get charged to the limit until I actually
asked for memory there (prot == rwx), but in fact ulimit -v
means "virtual address space" and they're serious. Even
though there's no possible adverse consequence, you get
charged for prot == 0. (Except on OS X, where ulimit -v
appears to be ignored always.)

On 32-bit systems, only the initial bitmap gets reserved;
that's 256 MB which is small enough for most people's
ulimit. I don't think people have seen problems on 32-bit
systems.

On 64-bit systems, the 1 GB bitmap and the 16 GB allocation
arena get reserved at startup. Most people who have ulimit
-v set don't have it set high enough to allow 17 GB of
address space, so all Go binaries stopped working for those
people. Making people type
ulimit -v unlimited
is a workaround, but not a very good one. Another, simpler
work around is for the Go binary to try to do that itself at
startup. That has some charm to it but is not going to be
received well.

* How to fix it

I don't want to give up the many benefits of the simplifying
assumption that the heap is contiguous. It works too well.
The only question is how to make the heap contiguous. It's
okay for the answer to be operating system-dependent: most
details about low-level memory allocation already are.

The 32-bit ports are fine. They look like they are using
256 MB minimum, but if people look at RSS instead of VSIZE
in top, they'll see the true story. (Windows broke too, but
not because of anything relevant to this discussion.)

On darwin/amd64 I think we can keep doing what we're doing,
since the kernel doesn't enforce ulimit -v and what we're
doing is not actually expensive.

On linux/amd64 and freebsd/amd64, we have to do something
different. I think the answer is to pretend that we have a
17 GB reservation at 0xf800000000 even though we don't, and
then when we want to take from it, we die if someone else
has gotten there first. I wrote a few test programs, and
Linux will not give out that address if you don't ask for
it. For allocations where it can pick the address, it
prefers to allocate memory downward from the maximum user
address 0x7fffffffffff, even after some memory is mapped at
our reservation address. In a Go program linked against C
code via cgo, the C allocator would not end up at our
reservation address without hitting it intentionally, which
it doesn't. I expect FreeBSD to be the same but do not have
an available system on which to try it.

So that's my current plan: make SysReserve a no-op on Linux
and FreeBSD and make SysMap abort the program if the address
we want to use is taken. It's admittedly dodgy, but it will
work just fine for pure Go programs and empirically looks
like it will work just fine for Go programs linked with C
libraries too. In an ideal world the kernel would be more
useful.

When windows/amd64 comes along, there will be no problem,
because its VirtualAlloc API is better designed than mmap.

Comments, warnings, protests, counter-proposals welcome.

Russ

wern...@googlemail.com

unread,
Feb 7, 2011, 1:16:13 PM2/7/11
to golan...@googlegroups.com
Hi Russ,

for Linux/AMD64 the method looks somewhat like "optimistic locking" :-) which works most of the time but fails
sometime. Only real "optimistic locking" processes usually redo/recover if an optimistic lock fails, they don't die.

IMHO (without looking at the code of the implementation) this method could lead to unexpected behavior if
a Go program links to some library, e.g. C++ using SWIG, that also uses mmap which the Go program cannot
influence.

Would it possible that the memory management asks the system for the actual ulimit settings and compute
some sensible data to allocate (reserve) a continuous block of memory that fits. In Linux using
"getrlimit(RLIMIT_AS, ...) reports the ulimit -v setting. Or is it an absolute requirement to have a 16GB continous
virtual memory array?

Werner

Russ Cox

unread,
Feb 7, 2011, 1:28:12 PM2/7/11
to golan...@googlegroups.com
Thanks for the feedback.

> for Linux/AMD64 the method looks somewhat like "optimistic locking" :-)
> which works most of the time but fails
> sometime. Only real "optimistic locking" processes usually redo/recover if
> an optimistic lock fails, they don't die.

Yes, but optimistic locking is for the case where you still
have a good chance, in aggregate, of having a lock conflict ever.
I believe that the chance of having a conflict at the chosen
address in a 64-bit address space, even in aggregate, is
very nearly zero.

> IMHO (without looking at the code of the implementation) this method could
> lead to unexpected behavior if
> a Go program links to some library, e.g. C++ using SWIG, that also uses mmap
> which the Go program cannot
> influence.

If by unexpected behavior you mean an error message and an exit,
then yes. It cannot lead to memory corruption, because it is easy
to see if you have succeeded in getting the address you wanted.
And that would only be if the C++ program asks for a fixed mapping
address too, which basically no programs do, and then happens to
ask for the same one we use. It seems rare enough to try and see.

> Would it possible that the memory management asks the system for the actual
> ulimit settings and compute
> some sensible data to allocate (reserve) a continuous block of memory that
> fits. In Linux using
> "getrlimit(RLIMIT_AS, ...) reports the ulimit -v setting. Or is it an
> absolute requirement to have a 16GB continous
> virtual memory array?

If anyone ever hits the error message and exit, then this might
be a good fallback. For now I don't see much benefit to it:
reserving up to the limit would mean that linked-in C libraries
would not be able to allocate anything.

Russ

wern...@googlemail.com

unread,
Feb 7, 2011, 1:58:27 PM2/7/11
to golan...@googlegroups.com
Russ,

thanks for the fast reply.

On Monday, February 7, 2011 7:28:12 PM UTC+1, rsc wrote:
Thanks for the feedback.> Would it possible that the memory management asks the system for the actual

> ulimit settings and compute
> some sensible data to allocate (reserve) a continuous block of memory that
> fits. In Linux using
> "getrlimit(RLIMIT_AS, ...) reports the ulimit -v setting. Or is it an
> absolute requirement to have a 16GB continous
> virtual memory array?

If anyone ever hits the error message and exit, then this might
be a good fallback.  For now I don't see much benefit to it:
reserving up to the limit would mean that linked-in C libraries
would not be able to allocate anything.


I didn't meant that Go should allocate to full size but, as said, some sensible size. Maybe half of
the available virtual size or the memory management can evaluate a environment variable, a specific switch
on the command line (similar to the Java memory commands), etc. You are right that chances to hit the exact
address is near zero - but not zero. And Murphy is just around the corner :-)

Werner

Russ

Gustavo Niemeyer

unread,
Feb 7, 2011, 2:54:29 PM2/7/11
to r...@golang.org, golang-dev
> The easy way to make sure the allocated memory is contiguous
> is to make a big reservation by using mmap with prot==0 (no
> permission) and then change the permissions to rwx during
> the first "allocation".  The same applies to the bitmap:
> reserve it but grow as needed.
(...)

> It is common (I knew this and forgot) for people to use
> ulimit -v to limit the amount of memory that a running
> program can request.  At least that's what people wish it
> did.  What they're trying to do is limit the amount of

FWIW, I've been talking to kernel people that know better than I do
since you posted the issue, but I have no great feedback yet.

The consensus so far seems to be that ulimit -v should not be used,
since it arbitrary limits operations which the user has no reason to
limit (e.g. mapping a large file, even if touching just a portion of
it). I tend to agree with that feeling, and wonder about the impact
it'd have if we simply got the error message to show "drop ulimit -v"
and sustained the implementation you have now.

During the weekend I also asked an old coworker who's been maintaining
Linux VMMs forever in the hope of getting some insightful feedback,
but he was just unnecessarily harsh about the whole problem and
focused instead on completely unrelated topics about Go, which is a
shame.

--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Russ Cox

unread,
Feb 7, 2011, 3:05:56 PM2/7/11
to Gustavo Niemeyer, golang-dev
> FWIW, I've been talking to kernel people that know better than I do
> since you posted the issue, but I have no great feedback yet.
>
> The consensus so far seems to be that ulimit -v should not be used,
> since it arbitrary limits operations which the user has no reason to
> limit (e.g. mapping a large file, even if touching just a portion of
> it).  I tend to agree with that feeling, and wonder about the impact
> it'd have if we simply got the error message to show "drop ulimit -v"
> and sustained the implementation you have now.

It's all well and good for kernel developers to point out that
ulimit -v is being abused, but the solution to that is implement
the functionality that users actually want: limit the number of
physical memory pages that a given program can occupy.

I myself have used ulimit -v to make sure that programs cannot
cause everything else on my machine to swap to disk, so
I have some sympathy here. The alternative is to disable
swap entirely, which may or may not be tenable depending
on the system.

Russ

Gustavo Niemeyer

unread,
Feb 7, 2011, 3:32:22 PM2/7/11
to r...@golang.org, golang-dev
> It's all well and good for kernel developers to point out that
> ulimit -v is being abused, but the solution to that is implement
> the functionality that users actually want: limit the number of
> physical memory pages that a given program can occupy.
>
> I myself have used ulimit -v to make sure that programs cannot
> cause everything else on my machine to swap to disk, so
> I have some sympathy here.  The alternative is to disable
> swap entirely, which may or may not be tenable depending
> on the system.

I know what you mean. This isn't ideal indeed, but even worse IMO is
the alternative of having non-deterministic behavior with programs
breaking based only on where the kernel happens to put mmap() results.

If there's no better alternative, I'd prefer Werner's approach of
reserving an established percentage of the available virtual address
space and breaking down deterministically once that's reached. People
using ulimit -v would be getting what they asked for, and neither
people with ulimit -v nor people with the full address space available
would see issues with mmap()s getting in the way of the arena.

Russ Cox

unread,
Feb 7, 2011, 3:51:55 PM2/7/11
to Gustavo Niemeyer, golang-dev
> I know what you mean.  This isn't ideal indeed, but even worse IMO is
> the alternative of having non-deterministic behavior with programs
> breaking based only on where the kernel happens to put mmap() results.

It's only non-deterministic if the kernel behaves non-deterministically.
It doesn't: I've checked.

Russ

Gustavo Niemeyer

unread,
Feb 7, 2011, 4:14:14 PM2/7/11
to r...@golang.org, golang-dev

Understood. It still feels brittle, as there are no guarantees of
such behavior.

goo...@capitanio.org

unread,
Feb 8, 2011, 12:46:09 PM2/8/11
to golan...@googlegroups.com, Gustavo Niemeyer
FYI here is the Linus' answer I've got on the lkml:

On Tue, Feb 8, 2011 at 4:37 AM, martin capitanio <m...@capitanio.org> wrote:
>
> There popped up a serious problem by implementing a fast memory
> management for the language go. Maybe some experienced kernel hacker
> could join the discussion and help to find the best linux solution for
> the "mmap fiasco" problem.
>

So, quite realistically, we can't change how "ulimit -v" works. It has
well-defined semantics, and they very much are about the mappings, not
about how many pages people use.

There's in theory a RLIMIT_RSS for tracking actual resident pages, but
in practice it doesn't _do_ anything on Linux, because it's not
something we've even bothered to count. It's much simpler and more
unambiguous to just count "how big are the mappings" than counting
individual pages. And as far as I can remember, this is literally the
first time that somebody has cared all that deeply (not to say that
people haven't asked for RSS before, but it's not been a fundamental
part of some design decision of theirs, just a wish-list).

So in theory we could change the kernel and start counting RSS, and
make RLIMIT_RSS do something useful, but in practice that would still
mean that it would take many _years_ before a project like 'go' could
rely on it, since most people don't change the kernel very often
anyway, and even if they did it's not the kernel that actually sets up
the offending RLIMIT_AS (the kernel defaults to "infinity"), but the
distribution or users random .bash_login files or whatever.

So even if the kernel _did_ change, you'd still have this problem in
'go', and you'd still need to do something else.

And quite frankly, I think your "use a big array" in go is a mistake.
You may think it's clever and simple, and that "hey, the OS won't
allocate pages we don't touch", but it's still a serious mistake. And
it's not a mistake because of RLIMIT_AS - that's just a secondary or
tertiary symptom of you being lazy and not doing the right thing.

Think about things like mlockall() (ever imaging mixing 'go' code with
C code that does security-sensitive stuff?).

Or think about things like the kernel trying to be really clever,
noticing that you have a 16GB allocation that is well-aligned, and
deciding to help you (since the system had tons of memory) by using
large pages for it to avoid excessive TLB overhead. Yes, people are
actually working on things like that. Suddenly the page allocation
granularity might be 2MB, not 4kB.

I bet there are other issues like that. On 32-bit, for example, we've
often had problems with people running out of virtual memory size,
since with shared libraries etc, there really isn't all that much free
address space. You only had a 256MB mapping on 32-bit, but quite
frankly, that's about 1/8th of the whole user address space (the 2G/2G
split tends to be normal), and you are basically requiring that there
is that much contiguous virtual address space that you can just waste.
Maybe that's true of all 'go' programs now, but I can tell you that in
the C world, people have done things like "let's link this binary
statically just so that we get maximal virtual address space size,
because we need a contiguous 1GB array for our actual _problem_).
Using some random 256MB virtual allocation just because your tracking
algorithm is lazy sounds like a _bad_ idea.

Finally, I actually think you may well often be better off keeping
your data denser (by using the indirection), and then having a smaller
virtual memory (and thus TLB) lookup footprint. Of course, it sounds
like your previous indexing scheme was very close to what the page
table lookup does anyway, but many problem sets have been better off
using fancy span-based lookup in order to _avoid_ having large arrays,
and the complexity of the scheme can be very much worth it.

In other words, the much deeper fundamental problem of the "one big
array" approach is that you're making tons of assumptions about what
is going on, and then when one of those assumptions aren't correct
("virtual memory size doesn't matter" in this case), you end up
blaming something else than your assumptions. And I think you need to
take another look at the assumption itself.

                        Linus

goo...@capitanio.org

unread,
Feb 8, 2011, 12:47:21 PM2/8/11
to golan...@googlegroups.com, Gustavo Niemeyer

Russ Cox

unread,
Feb 9, 2011, 9:57:42 AM2/9/11
to golan...@googlegroups.com, Gustavo Niemeyer
Thanks for posting the LKML response.
Most of what Linus says is true but probably not
crucial enough to avoid laziness for now. We can
always change the strategy later if it becomes a
problem.

The comment about large pages would be the most
important reason not to do what we're doing but sounds
more like a kernel bug than our fault. We're being
very up front with the kernel about which memory we
are and are not using: what we're not using has prot==0.
If Linux sees a 16 GB prot==0 mapping and decides to
dedicate >0 bytes of memory to backing it, then that's
not our problem.

Other tools like Native Client use enormous prot==0
mappings. I doubt Linux would ever make the mistake
of giving them real amounts of physical memory.

Russ

Albert Strasheim

unread,
Feb 9, 2011, 10:26:27 AM2/9/11
to golang-dev
Hello

On Feb 9, 4:57 pm, Russ Cox <r...@golang.org> wrote:
> Thanks for posting the LKML response.
> Most of what Linus says is true but probably not
> crucial enough to avoid laziness for now.  We can
> always change the strategy later if it becomes a
> problem.
> The comment about large pages would be the most
> important reason not to do what we're doing but sounds
> more like a kernel bug than our fault.  We're being
> very up front with the kernel about which memory we
> are and are not using: what we're not using has prot==0.
> If Linux sees a 16 GB prot==0 mapping and decides to
> dedicate >0 bytes of memory to backing it, then that's
> not our problem.

I'm a bit concerned about Alan Cox's comment:

"You'll hit problems if the kernel is running with vm overcommit
disabled (as well configured servers do)."

We are planning to do exactly that, on a server that will be running
many, many Go processes.

But maybe virtual memory with prot==0 doesn't factor into the
overcommit accounting?

http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting

Another thing that causes a bit of concern is Linus's comment:

"Think about things like mlockall() (ever imaging mixing 'go' code
with C code that does security-sensitive stuff?)."

Is there going to be problem if we want to mlockall a Go slice's
memory as would happen when one uses the InfiniBand libibverbs
library's ibv_reg_mr function?

Any thoughts?

Regards

Albert

Albert Strasheim

unread,
Feb 9, 2011, 10:31:16 AM2/9/11
to golang-dev
Hello again

On Feb 9, 5:26 pm, Albert Strasheim <full...@gmail.com> wrote:
> Another thing that causes a bit of concern is Linus's comment:
> "Think about things like mlockall() (ever imaging mixing 'go' code
> with C code that does security-sensitive stuff?)."
> Is there going to be problem if we want to mlockall a Go slice's
> memory as would happen when one uses the InfiniBand libibverbs
> library's ibv_reg_mr function?

Or alternatively, if we need to mlockall a buffer allocated by calling
mmap(2) directly from a Go process?

mlockall probably expects its input to be aligned on a page boundary,
so mlockalling an arbitrary Go slice might not work.

Regards

Albert

chris dollin

unread,
Feb 9, 2011, 10:55:58 AM2/9/11
to Albert Strasheim, golang-dev

Nitpick: according to the man page I just read, mlockall applies
to all the /current mapped space/, not some part of it, which you'd
use mlock for.

I don't know whether Linus's comment was meant to apply only
to mlockall, or to primarily mlockall but also mlock, or vice-versa,
or really to mlock, and it might make a difference ...

Chris

--
Chris "allusive" Dollin

Russ Cox

unread,
Feb 9, 2011, 11:13:59 AM2/9/11
to Albert Strasheim, golang-dev
Both Linus's and Alan's comments apply to actually
taking the 16 GB with prot==0. We're not going to do
that, because it is incompatible with ulimit -v.

Russ

Florian Weimer

unread,
Feb 12, 2011, 4:42:03 PM2/12/11
to r...@golang.org, golan...@googlegroups.com
* Russ Cox:

> I believe that the chance of having a conflict at the chosen
> address in a 64-bit address space, even in aggregate, is
> very nearly zero.

I believe there are only 48 bits, leaving about 19 bits for avoiding
collisions (I'm too tired for being precise, sorry). This seems to be
far into the realm of possibility.

This assumes uniformly random placement of mappings, which is probably
too optimistic. Hardware-wise, it might be beneficial to put large
mappings close together, in order to leverage gigantic page sizes.

(Disclaimer: I've been bitten by Exim bug 927, so I'm biased. 8-)

Russ Cox

unread,
Feb 12, 2011, 9:04:46 PM2/12/11
to Florian Weimer, golan...@googlegroups.com
> This assumes uniformly random placement of mappings

It does not. If the kernel were making uniform random
choices then we would indeed be in trouble. But all the
kernels on which Go runs have deterministic strategies
for picking the addresses of mappings, and those strategies
are compatible with Go's strategy (use 0xf800000000).
Go's strategy can be tuned on a per-OS basis, so we're fine.

Russ

Florian Weimer

unread,
Feb 13, 2011, 4:38:39 AM2/13/11
to r...@golang.org, golan...@googlegroups.com
* Russ Cox:

>> This assumes uniformly random placement of mappings
>
> It does not. If the kernel were making uniform random
> choices then we would indeed be in trouble.

What about grsecurity?

> But all the kernels on which Go runs have deterministic strategies
> for picking the addresses of mappings, and those strategies are
> compatible with Go's strategy (use 0xf800000000).

Doesn't mmap() under grsecurity go up to 2**47 on amd64?

Russ Cox

unread,
Feb 13, 2011, 11:32:09 AM2/13/11
to Florian Weimer, golan...@googlegroups.com
>> It does not.  If the kernel were making uniform random
>> choices then we would indeed be in trouble.
>
> What about grsecurity?
>
>> But all the kernels on which Go runs have deterministic strategies
>> for picking the addresses of mappings, and those strategies are
>> compatible with Go's strategy (use 0xf800000000).
>
> Doesn't mmap() under grsecurity go up to 2**47 on amd64?

I don't know what grsecurity is but what a great name (grr.... security!).

As we encounter problems, we'll fix them.

Russ

Reply all
Reply to author
Forward
0 new messages