My 2cents on tooling: I use a tool called
mutilate that is very nice to use and gives accurate results (i.e., 95th percentile latency and supports using multiple client machines to generate load).
So in my experience with benchmarking a lot of different memcache configuration, the answer isn't simple. Basically, it depends on your setup for binary Vs. ascii. With a low number of clients / connections to the server the binary protocol outperforms ascii significantly. With more clients / connections and load they become equal. Binary wins in the sense that it is never worse than ascii and sometimes significantly better, just not always so.
Here are some numbers I just ran again. This is with memcached running on a 12 core machine with HT (so 24 vcpus), Xeon 2.7ghz. Had about 30 client machines hitting it over a 10G network:
* Generating load from one machine using 8 connections:
Binary: 933k req/s with 578us 99th percentile latency
Text: 767k req/s with 631us 99th percentile latency
* Generating load from 30 machine using 4 connections per machine:
Binary: 2.7M req/s with 700us 99th percentile latency
Text: 2.7M req/s with 692us 99th percentile latency
I haven't investigated this further but imagine other bottle necks in event handling and the kernel become a limiting factor at load than the protocol.
Cheers,
David