Fragmentation

461 views
Skip to first unread message

Martin Bligh

unread,
Mar 22, 2016, 4:26:50 PM3/22/16
to gperf...@googlegroups.com
I've been looking at fragmentation issues with tcmalloc in relation to MongoDB, and have a few questions and observations.
We can end up with tcmalloc using about 2x the size of actual useful allocated memory, which is problematic at scale.
I should preface this by saying I'm unfamiliar with tcmalloc, but have worked on Linux kernel allocator etc before.

We're seeing memory pinned in two main places - the pageheap and the central freelist. We don't see much stuck in thread caches,
at least with workloads with small numbers of threads. That might be because we have tweaks to shrink the thread caches on thread idle.

The pageheap is resolved by your more recent change to use aggressive decommit. 
I presume the more aggressively we free back to the system here, the better for fragmentation as the kernel can consolidate virtually
contiguous space by changing the phys to virt mapping, but we stand the danger of calling mmap more and ending up with a high 
count of virtual regions in Linux, which may not scale well.

Re the central freelist:

1. The most obvious solution seems to be to use smaller span sizes - 1 object can pin 1 slab in memory, so it seems that worst case
is proportional to the number of objects per slab. Changing the page size from 8k to 4k and the divisor on blocks_to_move in SizeMap::init()
from 4 to 8 seems to help a lot. I can't see any measurable perf impact, but it would seem we're bound to hit the system allocator harder.

2. We could use fewer span sizes, but that seems to just tradeoff internal vs external fragmentation. One possible idea is to change
the code in SizeMap::Init 
      if (my_objects == prev_objects) {
        // Adjust last class to include this size
        class_to_size_[sc-1] = size;
        printf("Adjusting class %d for %d %lu %d %lu\n", sc-1, sc, size, blocks_to_move, my_pages);
        continue;
to have a less stringent check than my_objects == prev_objects, ie check if the waste would be less than the 12.5% threshold for 
both classes. Some care is needed when merging multiple size classes in a row. Curious on any comments people might have on this?

3. The size classes that result in only 1 object per span make me question how useful they are. With aggressive decommit, I'm assuming
each one allocates and deallocates directly to the system allocator? The transfer cache idea seems to break down with only 1 object
per slab (as far as I understand it), though it does enable the thread caches to hold on to larger allocations. Whether that's good or not
seems questionable ... performance vs potentially large wasted memory, especially at high thread count? I'm experimenting with a 16k
limit, and again curious on the experience / opinions of others?

Thanks,

M.

Aliaksey Kandratsenka

unread,
Mar 23, 2016, 1:43:40 AM3/23/16
to Martin Bligh, gperf...@googlegroups.com
On Tue, Mar 22, 2016 at 1:26 PM, Martin Bligh <martin...@mongodb.com> wrote:
> I've been looking at fragmentation issues with tcmalloc in relation to
> MongoDB, and have a few questions and observations.
> We can end up with tcmalloc using about 2x the size of actual useful
> allocated memory, which is problematic at scale.
> I should preface this by saying I'm unfamiliar with tcmalloc, but have
> worked on Linux kernel allocator etc before.

Hi Martin. All of your ideas below look interesting.

I'm very curious to hear more about your case. All cases of 2x
fragmentation that I'm aware of involved some some kind of attempt to
do longer-term allocation of variable size blobs. Most common use case
is caching. For example at couchbase they're still doing per-key
caching of values in ram and they still use malloc to allocate memory
for those values. I.e. as opposed to having page-cache like block
level caching which is arguably less efficient. Change in average
value size is triggering challenges with fragmentation for them.

My impression of what MongoDB was doing, is that it didn't have any
caches with variable size items. I.e. original mongo (what is now mmap
engine I believe) was just doing block level cache. If so then all
malloc usage is essentially temporary per request allocation.

Has any of this changed recently?

I'm asking because if you indeed try to use tcmalloc (or in fact any
malloc) for variable size items cache, then my experience is that you
might have to consider special memory allocation and/or compaction
measures. At least for reasonably ambitious definition of "work". If
you need example from kernel space, then compressed swap is exactly
this use case. Where "after compression" values, which size
distribution may change over time, is hard to manage in a way that is
both fast and has little fragmentation overhead. It looks like they
went beyond just doing custom low-fragmentation malloc and towards
compaction in order to get decent fragmentation levels.

So if you can add more details on exact use case that is triggering
this condition, I'd be very interested.

>
> We're seeing memory pinned in two main places - the pageheap and the central
> freelist. We don't see much stuck in thread caches,
> at least with workloads with small numbers of threads. That might be because
> we have tweaks to shrink the thread caches on thread idle.
>
> The pageheap is resolved by your more recent change to use aggressive
> decommit.
> I presume the more aggressively we free back to the system here, the better
> for fragmentation as the kernel can consolidate virtually
> contiguous space by changing the phys to virt mapping, but we stand the
> danger of calling mmap more and ending up with a high
> count of virtual regions in Linux, which may not scale well.

Let me also comment on page heap matters a bit more.

I believe that bigger longer term work in this area will involve
better support for dealing with transparent huge pages. I've seen a
number of services at google that could afford to disable releasing
memory back to kernel and for which transparent huge pages was
performance win on the order of 10%. Such significant win on CPU isn't
something that we can ignore. And therefore I believe malloc
implementations will get progressively better at working with
transparent huge pages.

So it means that tcmalloc will need to get better at avoiding
fragmenting it's pageheap while being less aggressive w.r.t. releasing
memory back to kernel.
> --
> You received this message because you are subscribed to the Google Groups
> "gperftools" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gperftools+...@googlegroups.com.
> To post to this group, send email to gperf...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gperftools/CAKZJQnnts_OCa1kgSOd6Fk1egB%2BP%2BTtmpLWEwnFqsF4to92Tpw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Martin Bligh

unread,
Mar 23, 2016, 10:40:56 AM3/23/16
to Aliaksey Kandratsenka, gperf...@googlegroups.com
On Wed, Mar 23, 2016 at 1:43 AM, Aliaksey Kandratsenka <alkond...@gmail.com> wrote:
On Tue, Mar 22, 2016 at 1:26 PM, Martin Bligh <martin...@mongodb.com> wrote:
> I've been looking at fragmentation issues with tcmalloc in relation to
> MongoDB, and have a few questions and observations.
> We can end up with tcmalloc using about 2x the size of actual useful
> allocated memory, which is problematic at scale.
> I should preface this by saying I'm unfamiliar with tcmalloc, but have
> worked on Linux kernel allocator etc before.

Hi Martin. All of your ideas below look interesting.

I'm very curious to hear more about your case. All cases of 2x
fragmentation that I'm aware of involved some some kind of attempt to
do longer-term allocation of variable size blobs. Most common use case
is caching. For example at couchbase they're still doing per-key
caching of values in ram and they still use malloc to allocate memory
for those values. I.e. as opposed to having page-cache like block
level caching which is arguably less efficient. Change in average
value size is triggering challenges with fragmentation for them.

​Yup, I think ours is a similar situation. We're taking a substantial part of system RAM
(eg 60% by default) 
and effectively calling malloc for each new "row" that comes in
​,
or any other allocations. Later we are likely compact those down into larger "pages" (~32K).
​Any "update" of the row causes a free of the old size, and an allocate of the new one.

Thus if you create a large table, and update (normally growning) rows at random,
you will end up with spans of the "old" sizes with lots of holes in them. 

My impression of what MongoDB was doing, is that it didn't have any
caches with variable size items. I.e. original mongo (what is now mmap
engine I believe) was just doing block level cache. If so then all
malloc usage is essentially temporary per request allocation.

Has any of this changed recently?

​I'm looking specifically at our new "WiredTiger" storage engine​, which makes much
​heavier use of tcmalloc than mmap does. That's the default engine if you're using
the current version of MongoDB (3.2.x)​.

I'm asking because if you indeed try to use tcmalloc (or in fact any
malloc) for variable size items cache, then my experience is that you
might have to consider special memory allocation and/or compaction
measures. At least for reasonably ambitious definition of "work". If
you need example from kernel space, then compressed swap is exactly
this use case. Where "after compression" values, which size
distribution may change over time, is hard to manage in a way that is
both fast and has little fragmentation overhead. It looks like they
went beyond just doing custom low-fragmentation malloc and towards
compaction in order to get decent fragmentation levels.

So if you can add more details on exact use case that is triggering
this condition, I'd be very interested.
 
​One issue is that the lifespan of an allocated object is unpredictable at allocate time.
There's certainly some things ​we could separate out (and intend to), eg.

1. Allocations that are known to be local to only this operation. My feeling is we should
try to take those from the stack where possible.

2. Allocations that are likely to last longer (eg internal nodes of the "data tree" vs external).
Those could maybe come from a specialized pool.
​​
Neither of these will eliminate fragmentation, but both should help. I think it's a general
case problem though, from what I can see, it does affect other projects, so I thought
​using smaller spans might be of general interest.​

> We're seeing memory pinned in two main places - the pageheap and the central
> freelist. We don't see much stuck in thread caches,
> at least with workloads with small numbers of threads. That might be because
> we have tweaks to shrink the thread caches on thread idle.
>
> The pageheap is resolved by your more recent change to use aggressive
> decommit.
> I presume the more aggressively we free back to the system here, the better
> for fragmentation as the kernel can consolidate virtually
> contiguous space by changing the phys to virt mapping, but we stand the
> danger of calling mmap more and ending up with a high
> count of virtual regions in Linux, which may not scale well.

Let me also comment on page heap matters a bit more.

I believe that bigger longer term work in this area will involve
better support for dealing with transparent huge pages. I've seen a
number of services at google that could afford to disable releasing
memory back to kernel and for which transparent huge pages was
performance win on the order of 10%. Such significant win on CPU isn't
something that we can ignore. And therefore I believe malloc
implementations will get progressively better at working with
transparent huge pages

​The issue that concerns me here is that I am extremely skeptical that 
transparent 
huge pages work well in practice.​
 
​ ​
​I've discussed this somewhat
​​
with Andy and Mel
(who implemented it) but ​it's
​ ​
difficult to reproduce the failure cases outside of a long
running production
​ ​
environment.  ​Huge pages are fine when you allocate them at 
​b​
oot time and hold
​ ​
onto them forever, but the way Linux tries to do decompaction 
is problematic,
​ ​
and seems to cause long system stall
​​
s
​ in the reclaim code.
We advise people to disable
​ ​
transparent
​ ​
huge pages for exactly this reason.​

It seems that entropy will slowly cause worst case fragmentation over time, but it's 
difficult to reproduce in artificial tests, and thus work on. The more layers of caching
we have, the harder it gets .... imagine a span with 1000 objects/span (eg 8 byte allocs).
The odds that the last remaining pinned page needed to free the whole span are
in use, in the transfer cache, or in the thread cache of one of maybe thousands of
threads seem high ... I've not yet found any part of tcmalloc that does "targeted"
reclaim, and if the application still has it allocated, I don't see what you can do?
There's no callback in the malloc interface to ask an app to remap an object.
(short of remapping with the OS at 4K granularity, which seems a little crazy).

So it means that tcmalloc will need to get better at avoiding
fragmenting it's pageheap while being less aggressive w.r.t. releasing
memory back to kernel.

​Right, I think that'd be good in general - worst case, the page heap does alleviate
calls down to mmap. I also think we could more safely use smaller span sizes
(which is good for frag​mentation) if we had the pageheap underneath us, without
going back to mmap etc so often.

Maybe aggressive decommit is too aggressive, and rather than immediate free,
the pageheap could hold on to 5% of currently allocated memory or something? 
Having said that, I can't actually see any performance problems from using aggressive decommit.
Might be tricky to blend that with trying to keep whole 2M sections allocated for
hugepages.

Apologies for the length of this email ;-)

M.


Aliaksey Kandratsenka

unread,
Apr 9, 2016, 3:43:37 PM4/9/16
to Martin Bligh, gperftools
Thanks a lot for detailed explanation. And apologies for delay.

This is indeed the case that I was expecting. Whatever improvements
that you can contribute to gperftools for this use-case I'll be happy
to consider.

But I think slab-ful approach where big chunks of memory can be only
used exclusively for specific size class is IMHO simply hopeless for
such use-case. Unless some form of compaction is used (this is what
AFAIK couchbase does now and what zram does as well; albeit in both
cases quite naively).

Back at couchbase I've played with least address best fit allocator
just for variable-sized blobs (the code is here:
https://github.com/alk/minimalloc). It was experiment to see if
fragmentation goes away if variable-sized blobs use inherently better
allocator. And indeed it was working great when I changed average size
of cached values.

One issue that stopped me from pushing it further was that couchbase
may need lots of memory for tiny bits of metadata (for replication).
And sometimes it needs to push cached values out of ram to fit more of
those entries. So using one allocator for values and tcmalloc for
everything else (including hash table entries) simply doesn't allow
flexible changing of memory given to values and to queue entries at
run time. If mongo's use-case doesn't have such unfortunate trait, and
if you can simply statically give, say, 60% of ram to variable-sized
allocations, then I think you should consider special low-frag
allocator for variable-sized part.

One particularly interesting research in this area is ramcloud's
log-structured memory allocator. I.e.
http://www.scs.stanford.edu/~rumble/papers/fast14-paper_rumble.pdf.
They have some tests that also show how slab-based allocators suck
when size of allocated items change and how dlmalloc based allocators
do way better. IMHO their approach is maybe too aggressive in
practice, but still worth keeping in mind.
Reply all
Reply to author
Forward
0 new messages