Alternate hashing algorithm for higher performance?

eamonn...@demilletech.net

unread,

Mar 16, 2019, 9:29:11 PM3/16/19

to memcached

Hi there,

I started using memcached in prod a week or two ago, and am loving it. I wanted to give back, and took a look through the issues board, but most of them looked solved. So, in my usual "it's never fast enough" style, I went and profiled its performance, and had some fun.

After seeing that MurmurHash3 was taking a good amount of the execution time, I decided to run a test integrating one of my old favorite hash functions, xxhash. My guess is that Memcached could benefit from using the hash function, as it is faster than MMH3 and has several native variants. I ran some of my own tests, and found roughly equal performance, but with no tuning performed on xxhash. For example, using an assembly (x86/arm/etc) version could likely speed up hashing, along with properly implementing it in memcached. However, I was also running this on a much older Nehalem CPU, so there could be unseen advantages to one or both of the algorithms by running them on a newer CPU. I'm in the process of fighting with my newer systems to get libevent installed properly, so I'll report back with more up-to-date tests later.

I did a cursory search, but didn't find any mention of the algo in the mailing list. If this has been discussed, though, apologies for bringing it up again. On the other hand, I would be happy to write a PR to add it, using the `hash_algorithm` CLI arg.

Thanks,

Eamonn

dormando

unread,

Mar 17, 2019, 2:46:08 PM3/17/19

to eamonn.nugent via memcached

Hey,

What exact test did you do?

Well to be honest I've been wanting to swap in xxhash for a long time, but
in my own profiling other things show up higher than murmur so I keep
deprioritizing it :)

One big problem with the hash algo is mc keys can be short and are
hashed one at a time. xxhash is more optimized for longer data (kilobytes
to megabytes). The original author tries to address this with an updated
algorithm:
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

xxhash makes significant use of instruction parallelism, such that if a
key is 8 bytes or less you could end up waiting for the pipeline more
than murmur. Other algos like cityhash/farmhash are better at short keys
IIRC. Also xx's 32bit algo is a bit slower on 64bit machines... so if I
wanted to use it I was going to test both 32bit and 64bit hashes and then
have to do compile time testing to figure out which to use. It's also
heavily x86 optimized so we might have to default something else for ARM.

Sorry, not debated on the list, just in my own head :) It's not quite as
straightforward as just dropping it in. If you're willing to get all the
conditions tested go nuts! :)

-Dormando

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Eamonn Nugent

unread,

Mar 17, 2019, 2:59:33 PM3/17/19

to memc...@googlegroups.com

Hiya,

Last night, I was running memtier_benchmark on my laptop (mid-2015 15" MBP, 2.5GHz 4c i7) and achieved about a 10-15% throughput improvement on both modern and non-modern settings on the 64 bit variant. 32 bit variant was about equal in performance (the results showed them to be within about 3% of each other, but most of the difference was probably just typical entropy). I was able to solve the 32/64 bit compile time problem by adding in a wrapper and some compile-time declarations, so I'd say that's about 50% solved for x86-based systems. But yeah, with ARM, it could turn interesting.

As a next-ish step, I'm going to attempt to drop in xxh3, but since it's still in active development, it's probably not good as anything more than a tech demo. I'm happy, if it would help, just to go nuts adding a dozen different algos into hash.c, though (cityhash/farmhash, as you mentioned). In xxhash's implementation, though, I played with some compile-time flags to make it a bit faster, and I've been toying with the idea of modifying it so no seed logic ever occurs, to maybe gain a couple cycles of speed increase. I'm also looking into seeing if I can find a pure assembly version to squeeze a bit more speed out of x86 and ARM versions. I should probably get one of my ARM systems running and test the difference...

But hey, thanks for humoring me. Maybe next I'll take a look at the reading & processing command steps, and see if there's anything I can do. Or maybe parallelizing rotl... Hm. I'll keep on with trying it out :)

Thanks,

Eamonn

You received this message because you are subscribed to a topic in the Google Groups "memcached" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/memcached/Y02zPF-WTKg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to memcached+...@googlegroups.com.

Eamonn Nugent

unread,

Mar 17, 2019, 3:34:06 PM3/17/19

to memc...@googlegroups.com

Reporting back with very preliminary benchmarks. Somehow, xxh64 is actually faster than xxh3 on my machine. One thing I forgot to mention before - I also looked at latencies with xxh32/64, and saw the 99th percentile latency lowered by about half compared to mmh3. So it could be beneficial in that sense. Latencies with xxh3 are in the 3.6ms 99% range, xxh64 go down to about 3.0 (I saw 2.5 yesterday, maybe testing on a laptop with about a billion chrome tabs open isn't a brilliant idea), and mmh3 were in the 4.xms range. All of this with modern options, but with non-modern, xxh64 shone quite a bit. I was doing my testing there yesterday.

I used the following memtier_benchmark command to stress test:

./memtier_benchmark -P memcache_binary -p 11211 --key-prefix="" --key-maximum=9999999999999999999

A lot of this seems to be very architecture dependent. Maybe it would make sense to include a lot of hash algos long term, and let power users figure out which they feel like using? Not sure, though, and you're the expert here :P

Thanks,

Eamonn

dormando

unread,

Mar 17, 2019, 4:16:35 PM3/17/19

to 'Eamonn Nugent' via memcached

Yo,

Fwiw, I use mc-crusher as the "official" benchmark:
https://github.com/memcached/mc-crusher tho I need to update the README
slightly. will do that in a bit.

I also test on hardware uninterrupted by other things with turbo disabled
:) testing on laptops can be really tough since you'll get different
frequencies minute to minute. You have to interleave test runs A/B/A/B/A/B
then average to get through the noise.

Also, make sure to test both binprot/ascii. with ascii multiget you can
deeply pipeline requests and get more hashing vs syscalls.

Also, test short vs long keys. a for loop and some scripting should get
you there. :)

I don't really want to add a ton of algo's. More or less a "best average"
is fine. Given the syscall overhead, hashing is barely measurable for a
lot of workloads. When I switched from jenkins to murmur I did a
somewhat comprehensive set of tests then swapped default + left old one
in just in case. I highly value good defaults! Libmemcached ended up
kitchen-sinking hash algo's and I think that didn't work out well in the
long run.

I did also test hash chain bucket depth a bit. Finally, loading up
different counts of keys (1m, 10m, 100m, etc) and re-running uniform
random benchmarks since fairness will affect the bucket depth and thus
latency.

Sorry if that's a pain in the ass, but this thing is quite widely used and
there aren't really beta testers :) Have to be thorough these days.

-Dormando

Eamonn Nugent

unread,

Mar 17, 2019, 4:40:33 PM3/17/19

to memc...@googlegroups.com

Hi,

I've been meaning to get myself dedicated test environments at one point, for my own stuff as well. Guess now is as good a time as any to try it out.

I'll start using mc-crusher for benchmarks, and I'm going to throw it on at least a VM for now, run some long-term benchmarks, and likely try out a couple algos to figure out which one runs best. Good to know about ascii being a different workload, I wasn't sure if they were essentially aliased or some sort of individually unique protocol. Maybe if I get annoyed enough at syscalls, I'll find a way of writing an OS that allows user applications to be run safely without the same overhead. Tempting, tempting...

No worries about it being a pain in the ass - this is the process behind good and reliable software, and I personally wouldn't want it any other way. I just wanted to get back with early results to get your feedback.

Thanks,

Eamonn

dormando

unread,

Mar 17, 2019, 5:27:27 PM3/17/19

to 'Eamonn Nugent' via memcached

Sounds good!

fwiw binprot simply lacks a method for batching responses. I'll be fixing
it this year but that's going to be a big refactor of the old frontend
code. There's some use to it but there should be an option for batching
anyway.

VM's fine. I do most of my quick tests on a NUC, which is essentially a
laptop with better cooling. Have some donated hardware from packet.net for
bigger tests, thankfully :)

mc-crusher is nice since it uses very little CPU compared to something
like memtier. A lot of "memcached" benchmarks end up benchmarking the
benchmark, and are only useful in measuring relative latency of changes in
the daemon vs scalability/throughput.

Just uploaded some fixes and an updated README file. pretty sure I'm the
only regular user of the thing so please ask away if anything is
weird/confusing.

have fun,

Reply all

Reply to author

Forward