On Tue, Mar 22, 2016 at 1:26 PM, Martin Bligh <
martin...@mongodb.com> wrote:
> I've been looking at fragmentation issues with tcmalloc in relation to
> MongoDB, and have a few questions and observations.
> We can end up with tcmalloc using about 2x the size of actual useful
> allocated memory, which is problematic at scale.
> I should preface this by saying I'm unfamiliar with tcmalloc, but have
> worked on Linux kernel allocator etc before.
Hi Martin. All of your ideas below look interesting.
I'm very curious to hear more about your case. All cases of 2x
fragmentation that I'm aware of involved some some kind of attempt to
do longer-term allocation of variable size blobs. Most common use case
is caching. For example at couchbase they're still doing per-key
caching of values in ram and they still use malloc to allocate memory
for those values. I.e. as opposed to having page-cache like block
level caching which is arguably less efficient. Change in average
value size is triggering challenges with fragmentation for them.
My impression of what MongoDB was doing, is that it didn't have any
caches with variable size items. I.e. original mongo (what is now mmap
engine I believe) was just doing block level cache. If so then all
malloc usage is essentially temporary per request allocation.
Has any of this changed recently?
I'm asking because if you indeed try to use tcmalloc (or in fact any
malloc) for variable size items cache, then my experience is that you
might have to consider special memory allocation and/or compaction
measures. At least for reasonably ambitious definition of "work". If
you need example from kernel space, then compressed swap is exactly
this use case. Where "after compression" values, which size
distribution may change over time, is hard to manage in a way that is
both fast and has little fragmentation overhead. It looks like they
went beyond just doing custom low-fragmentation malloc and towards
compaction in order to get decent fragmentation levels.
So if you can add more details on exact use case that is triggering
this condition, I'd be very interested.
>
> We're seeing memory pinned in two main places - the pageheap and the central
> freelist. We don't see much stuck in thread caches,
> at least with workloads with small numbers of threads. That might be because
> we have tweaks to shrink the thread caches on thread idle.
>
> The pageheap is resolved by your more recent change to use aggressive
> decommit.
> I presume the more aggressively we free back to the system here, the better
> for fragmentation as the kernel can consolidate virtually
> contiguous space by changing the phys to virt mapping, but we stand the
> danger of calling mmap more and ending up with a high
> count of virtual regions in Linux, which may not scale well.
Let me also comment on page heap matters a bit more.
I believe that bigger longer term work in this area will involve
better support for dealing with transparent huge pages. I've seen a
number of services at google that could afford to disable releasing
memory back to kernel and for which transparent huge pages was
performance win on the order of 10%. Such significant win on CPU isn't
something that we can ignore. And therefore I believe malloc
implementations will get progressively better at working with
transparent huge pages.
So it means that tcmalloc will need to get better at avoiding
fragmenting it's pageheap while being less aggressive w.r.t. releasing
memory back to kernel.
> --
> You received this message because you are subscribed to the Google Groups
> "gperftools" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
gperftools+...@googlegroups.com.
> To post to this group, send email to
gperf...@googlegroups.com.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/gperftools/CAKZJQnnts_OCa1kgSOd6Fk1egB%2BP%2BTtmpLWEwnFqsF4to92Tpw%40mail.gmail.com.
> For more options, visit
https://groups.google.com/d/optout.