Enabling large-page-allocations

958 views
Skip to first unread message

Mike Lambert

unread,
Jul 10, 2009, 2:35:35 PM7/10/09
to memcached
Currently the -L flag is only enabled if
HAVE_GETPAGESIZES&&HAVE_MEMCNTL. I'm curious what the motivation is
for something like that? In our experience, for some memcache pools we
end up fragmenting memory due to the repeated allocation of 1MB slabs
around all the other hashtables and free lists going on. We know we
want to allocate all memory upfront, but can't seem to do that on a
Linux system.

To put it more concretely, here is a proposed change to make -L do a
contiguous preallocation even on machines without getpagesizes tuning.
My memcached server doesn't seem to crash, but I'm not sure if that's
a proper litmus test. What are the pros/cons of doing something like
this?

Thanks,
Mike

--- memcached.c 2009-07-10 11:22:58.408580000 -0700
+++ ../memcached-1.4.0-orig/memcached.c 2009-07-10 11:22:09.715629000
-0700
@@ -3761,11 +3761,13 @@
"-f <factor> chunk size growth factor (default: 1.25)\n"
"-n <bytes> minimum space allocated for key+value+flags
(default: 48)\n"

+#if defined(HAVE_GETPAGESIZES) && defined(HAVE_MEMCNTL)
"-L Try to use large memory pages (if
available). Increasing\n"
" the memory page size could reduce the
number of TLB misses\n"
" and improve the performance. In order to
get large pages\n"
" from the OS, memcached will allocate the
total item-cache\n"
" in one large chunk.\n"
+#endif
);

printf("-D <char> Use <char> as the delimiter between key
prefixes and IDs.\n"
@@ -4080,9 +4082,10 @@
break;
case 'L' :
#if defined(HAVE_GETPAGESIZES) && defined(HAVE_MEMCNTL)
- enable_large_pages();
+ if (enable_large_pages() == 0) {
+ preallocate = true;
+ }
#endif
- preallocate = true;
break;
case 'C' :
settings.use_cas = false;

Matt Ingenthron

unread,
Jul 10, 2009, 4:37:34 PM7/10/09
to memc...@googlegroups.com
Mike Lambert wrote:
> Currently the -L flag is only enabled if
> HAVE_GETPAGESIZES&&HAVE_MEMCNTL. I'm curious what the motivation is
> for something like that? In our experience, for some memcache pools we
> end up fragmenting memory due to the repeated allocation of 1MB slabs
> around all the other hashtables and free lists going on. We know we
> want to allocate all memory upfront, but can't seem to do that on a
> Linux system.
>

The primary motivation was more about not beating up the TLB cache on
the CPU when running with large heaps. There are users with large heaps
already, so this should help if the underlying OS supports large pages.
TLB cache sizes are getting bigger in CPUs, but virtualization is more
common and memory heaps are growing faster.

I'd like to have some empirical data on how big a difference the -L flag
makes, but that assumes a workload profile. I should be able to hack
one up and do this with memcachetest, but I've just not done it yet. :)

> To put it more concretely, here is a proposed change to make -L do a
> contiguous preallocation even on machines without getpagesizes tuning.
> My memcached server doesn't seem to crash, but I'm not sure if that's
> a proper litmus test. What are the pros/cons of doing something like
> this?
>

This feels more related to the -k flag, and that it should be using
madvise() in there somewhere too. It wouldn't be a bad idea to separate
these necessarily. I don't know that the day after 1.4.0 is the day to
redefine -L though, but it's not necessarily bad. We should wait for
Trond's repsonse to see what he thinks about this since he implemented
it. :)

Also, I did some testing with this (-L) some time back (admittedly on
OpenSolaris) and the actual behavior will vary based on the memory
allocation library you're using and what it does with the OS
underneath. I didn't try Linux variations, but that may be worthwhile
for you. IIRC, default malloc would wait for page-fault to do the
actual memory allocation, so there'd still be risk of fragmentation.

- Matt

Mike Lambert

unread,
Jul 13, 2009, 11:38:55 PM7/13/09
to memcached
Haha, yeah, the release of 1.4.0 reminded me I wanted to send this
email. Sorry for the bad timing.

-k keeps the memory from getting paged out to disk (which is a very
goodt hing in our case.)
-L appears to me (who isn't aware of what getpagesizes does) to be
related to preallocation with big allocations, which I thought was
what I wanted.

If you want, I'd be just as happy with a -A flag that turns on
preallocation, but without any of getpagesizes() tuning. It'd force
one big slabs allocation and that's it.

> Also, I did some testing with this (-L) some time back (admittedly on
> OpenSolaris) and the actual behavior will vary based on the memory
> allocation library you're using and what it does with the OS
> underneath.  I didn't try Linux variations, but that may be worthwhile
> for you.  IIRC, default malloc would wait for page-fault to do the
> actual memory allocation, so there'd still be risk of fragmentation.

We do use Linux, but haven't tested in production with my modified -L
patch. What I *have* noticed is that when we allocate a 512MB
hashtable, that shows up in linux as mmap-ed contiguous block of
memory. From http://m.linuxjournal.com/article/6390, we "For very
large requests, malloc() uses the mmap() system call to find
addressable memory space. This process helps reduce the negative
effects of memory fragmentation when large blocks of memory are freed
but locked by smaller, more recently allocated blocks lying between
them and the end of the allocated space."

I was hoping to get the same large mmap for all our slabs, out of the
way in a different address space in a way that didn't interfere with
the actual memory allocator itself, so that the linux allocator could
then focus on balancing just the small allocations without any page
waste.

Thanks,
Mike

Mike Lambert

unread,
Jul 15, 2009, 8:47:32 PM7/15/09
to memcached
Trond, any thoughts?

I'd like to double-check that there isn't a reason we can't support
preallocation without getpagesizes() before attempting to manually
patch memcache and play with our production system here.

Thanks,
Mike
> memory. Fromhttp://m.linuxjournal.com/article/6390, we "For very

Matt Ingenthron

unread,
Jul 15, 2009, 10:40:03 PM7/15/09
to memc...@googlegroups.com
Hi Mike,

Mike Lambert wrote:
> Trond, any thoughts?
>

Trond is actually on vacation, but I did steal a few cycles of his time
and asked about this.


> I'd like to double-check that there isn't a reason we can't support
> preallocation without getpagesizes() before attempting to manually
> patch memcache and play with our production system here.
>

There's no reason you can't do that. There may be a slightly cleaner
integration approach Trond and I talked through. I'll try to code that
up here in the next few days... but for now you may try your approach to
see if it helps alleviate the issue you were seeing.

Incidentially, how did the memory fragmentation manifest itself on your
system? I mean, could you see any effect on apps running on the system?

Mike Lambert

unread,
Jul 16, 2009, 6:13:36 AM7/16/09
to memc...@googlegroups.com
Basically process memory was growing very slowly over time to
eventually cause machine swapping. It was leveling out (not a leak),
but at a level higher than we expected, even with hashtable and
maxbytes accounted for. So I was poking around at memory usage, and
decided that fragmentation was to blame.


Looking again right now at a machine configured with -m 6000 (so
~6gb), I see "stats maps" showing a 512mb hashtable and 7.5gb heap.

"stats malloc" (which isn't 64-bit aware) gives:
STAT mmapped_space 564604928 # this has the 512mb hashtable
STAT arena_size -1058820096
STAT total_alloc -2040194320
STAT total_free 981374224
where arena_size = total_alloc+total_free.

Knowing that the total size of the heap is 7.5gb, I can derive that
real_arena_size = -1058820096 + 2**32 * 3 = 7531114496. Doing
total_free/real_arena_size gives 13%, which is my estimate for
free-but-unallocated ram. (Free due to fragmentation or
not-yet-allocation is hard to tell, but that number is still very
high.)

Alternately, one could ask why we have a 7.5gb heap for a 6gb
memcache...why so much ram? I calculated 100mb-200mb for 7600
connections plus some various free lists, but I was running into the
problem that total_free indicates there are still 981mb of unallocated
ram in the heap. So I think at the time I concluded this was due to
fragmentation.


We solved our problem by reducing the amount of ram we gave to
memcache so we didn't swap, but in theory getting an extra 10-13% of
RAM out of our memcaches sounds like a great idea. And so given my
fragmentation conclusion, I was looking for ways to reduce that.


Thoughts? Is there perhaps another explanation for the data above?

Thanks,
Mike

Mike Lambert

unread,
Jul 16, 2009, 6:15:26 AM7/16/09
to memc...@googlegroups.com
Incidentally, why was "mallinfo" removed from memcache 1.4.0? Even
without it being 64-bit aware, it still provided some useful data that
I wasn't able to get via other means in our 1.2.6 binaries.

Mike

Trond Norbye

unread,
Jul 16, 2009, 7:05:14 AM7/16/09
to memc...@googlegroups.com

On 16. juli. 2009, at 12.15, Mike Lambert wrote:

>
> Incidentally, why was "mallinfo" removed from memcache 1.4.0? Even
> without it being 64-bit aware, it still provided some useful data that
> I wasn't able to get via other means in our 1.2.6 binaries.
>

I wanted to remove the mallinfo call from memcached because I don't
think it belongs in the memcached protocol, but this is something you
can get from other tools (like pmap).

Another problem with using mallinfo is that not all memory allocators
implements mallinfo. By checking for mallinfo in configure (at least
on Solaris) would cause memcached to link with libmalloc, and if you
look at the manual page for libmalloc you will find:

DESCRIPTION
Functions in this library provide routines for memory allo-
cation. These routines are space-efficient but have lower
performance. Their usage can result in serious performance
degradation.

You may think that this wouldn't be a problem because we use our own
memory allocator inside memcached, but that's not true. The slab
allocator is _only_ used to store the items, and all other memory
allocations is done through malloc (hash tables, connection structs
and buffers, suffix pool etc).

Cheers,

Trond

Reply all
Reply to author
Forward
0 new messages