1.4.9-beta1 and mc-crusher

168 views

Skip to first unread message

dormando

unread,

Oct 5, 2011, 9:54:44 PM10/5/11

to memc...@googlegroups.com

Hey,

Enjoying 1.4.8? Thought I'd share some rough things that you guys may
enjoy:

https://github.com/dormando/mc-crusher
^ I've thrown my hat into the ring of benchmark utilities. This is
probably on par with some work Dustin's been doing, but I went in a
slightly different direction with features.

Now, 1.4.9-beta1:

http://memcached.googlecode.com/files/memcached-1.4.9_beta1.tar.gz

Which is the result of the 14perf tree, now up:

https://github.com/memcached/memcached/commits/14perf

This beta will be up for at least two weeks before going final. The
changes need more tuning, and some normal bugfixes/feature fixes need to
go in as well. I'm giving it to you folks early so it has a good long
soak.

Major changes:

- The "Big cache lock" is much shorter. Partly influenced by the patches
Ripduman Sohan sent, as well as me trying 3 different approaches and
failing back to this one.

- a "per-item" hash table of mutex locks is used to widen the amount of
locks available. There are many instances where we don't want two threads
to progress on the same item in parallel, but many fewer places where it's
paramount for the hash table and LRU to be accessed by a single thread.

- cache_lock uses a pseudo spinlock. In my bench testing, preventing
threads from going to sleep when hitting the short cache_lock helped with
thread scalability quite a bit.

- item_alloc no longer does a depth search for items to expire or evict.
I gave it a lot of thought and am dubious it ever helped. If you don't
have any expired items at the tail, it will always iterate 50 items, which
is slow. This was one of the larger performance improvements from the
changes I made.

- Hash calculations are now mostly done outside of the big lock. This was
a change in 1.6 already.

Reasoning:

- Most was reasoned above. I looked through Ripduman's patches and decided
to go a slightly different route. I studied all of the locks carefully to
audit what changes to make. In addition, I made no change which
significantly increases the memory usage. While we can still release a
specialized engine which inflates datastructures in a tradeoff for speed,
I have a strong feeling it's not even necessary.

- I only kept patches that had a measurable benefit. I threw away a lot of
code!

Results:

- On my desktop, I was able to increase the number of set commands per
second from 300,000 to 930,000.

- With one get per request, I saw 500k to 600k per second. This was
largely limited by the localhost driver, it may be faster with real
hardware.

- With multigets, I was able to drive up to 4.5 million keys per second.
(4.5 million get_hits per second). Reality will be a bit lower than this.

- Saturate 10gbps of localhost traffic with 256-512 byte objects.

- Saturate 35gbps of localhost traffic with 4k objects.

- Saturate 45gbps of localhost traffic with 8k objects.

- Patches increase the thread scalability. Under high load, performance
dropoffs now happen around 5 or 6 threads, whereas previously as many as 4
(the default!) could cause slowdown.

Future work:

I have some ideas to play with, some might go into 1.4.9, some later. I
don't believe any further performance enhancement is really necessary, as
it's trivial to saturate 10gbps of traffic now.

Need to hammer out more of the bench tool and make a formal blog post with
pretty pictures. That's more interesting.

- Item hash needs tuning. It's using a modulo instead of a hashmask. Needs
a way to initialize the size of the table, etc.

- I played with using the intel hardware crc32c instruction, but that
lowered performance as it slammed the locks together too early. This needs
more work before I push the branch up, as well as verification as to the
hash distribution.

- It may be safe to split the cache_lock into cache_lock and lru_locks,
but I haven't verified the safety of this personally yet and the
performance is already too high for my box to verify the change.

Notes:

- NUMA is kind of a bitch. If you want to reproduce my results on a big
box, you'll need to bind memcached to a single numa node:

numactl --cpunodebind=0 ./memcached -m 4000 -t 4

You can also try twiddling --interleave and seeing how the performance
changes. There isn't a hell of a lot we can do here, but we can move many
connections buffers to be "numa-local" and get what we can out of it.

The performance, even with memcached interleaved, isn't too bad at all,
and the patches do improve things (for me).

- I have not done any verification on latency yet. Given the low number of
connections I've been using in testing, it's not really possible for
requests to have taken longer than 0.1ms. Still, over the weeks I will
build the necessary functionality into mc-crusher and more formally test
how latency is affected by a mix of set/get commands.

have fun, and everyone who makes presentations about "memcached scales to
a limit" can bite me. If you honestly need it to run faster than this,
just send us a fucking e-mail.

If you like what I do and would like to see projects I work on deal better
with NUMA or 10gpbs ethernet, see here: http://memcached.org/feedme
-Dormando