Memcached performance numbers

Pradeep Fernando

unread,

Oct 7, 2019, 11:40:50 AM10/7/19

to memcached

Hi Devs,

I run memaslap to understand the performance characteristics of memcached,

My setup : both memcached and memaslap running on a single machine with NUMA. memcached is bound to NUMA 1. Gave 3GB of memory to memcached.

workload : get/set 0.5/0.5

I increase number of thread from memaslap and observe throughput latency numbers.

I see increase in throughput (expected) but latency drops as I crease the load.

The initial average latency is 83 us and it drops to 30us with number of threads = 8, this is an unexpected number.-- I expected the latency to go up.

Am I reading the output wrong?

Apologies, if this question does not qualify for this mailing list. If so, please direct me to correct list I can get help. :)

--Pradeep

Thread count = 1

Total Statistics (11447336 events)

Min: 11

Max: 1663

Avg: 83

Geo: 79.83

Std: 36.39

Log2 Dist:

4: 42 594 351733 9527982

8: 1551101 7451 7103 1330

cmd_get: 5723671

cmd_set: 5723681

get_misses: 0

written_bytes: 394933167

read_bytes: 343419948

object_bytes: 183157792

Run time: 60.0s Ops: 11447352 TPS: 190765 Net_rate: 11.7M/s

Thread count = 2

Total Statistics (30888799 events)

Min: 12

Max: 2011

Avg: 30

Geo: 29.68

Std: 15.32

Log2 Dist:

4: 170225 25862674 4766668 66398

8: 154 5017 17493 170

cmd_get: 15444404

cmd_set: 15444411

get_misses: 0

written_bytes: 1065663678

read_bytes: 926663772

object_bytes: 494221152

Run time: 60.0s Ops: 30888815 TPS: 514751 Net_rate: 31.7M/s

dormando

unread,

Oct 7, 2019, 1:08:36 PM10/7/19

to memcached

Hi,

First as an aside; 1/1 get/set ratio is unusual for mc. The gets scale a
lot better than sets. If you get into testing more "realistic" perf
numbers make sure to increase the get rate.

You're probably just running into CPU scaling. OS's come with a "battery
saver" or "ondemand" performance scheduler by default. They also have
turbo. Once you start loading it up more the CPU will stay in the higher
frequency states or begin to issue turbo, which will lower the latency.

/usr/bin/echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
cpupower frequency-set -g performance

... or whatever works for your platform.

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/cd593815-c27b-4995-bb7f-21859d9a3187%40googlegroups.com.
>
>

Pradeep Fernando

unread,

Oct 7, 2019, 2:30:05 PM10/7/19

to memc...@googlegroups.com

Hi Dormando,

That is great insight.!.

However, it did not solve the problem. I disabled turbo, as per your instructions.

I even, set the CPU to operate with maximum performance, with

> cpupower frequency-set --governor performance ( i verified this by monitoring cpu freq)

Still the same unexplained behavior. :(. Do you have any other suggestions?

thanks

--Pradeep

To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/alpine.DEB.2.21.1910071005340.21578%40dskull.

--

Pradeep Fernando.

dormando

unread,

Oct 7, 2019, 2:42:12 PM10/7/19

to memc...@googlegroups.com

Hey,

Sorry; I'm not going to have any other major insights :) I'd have to sit
here playing 20 questions to figure out your test setup. If you're running
memaslap from another box, that one needs to be cpu pinned as well. If
it's a VM, the governor/etc might not even matter.

Also I don't use memaslap at all, so I can't attest to it. I use
https://github.com/memcached/mc-crusher with the external latency sampling
util it comes with. it's not as easy to use though.

> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3Dbas4jg_8ToMP3VBVE3GMfzk9vHaQ7TMXdmqUww67dFsg%40mail.gmail.com.
>
>

Pradeep Fernando

unread,

Oct 7, 2019, 4:40:40 PM10/7/19

to memc...@googlegroups.com

Hi,

Thanks for the help!.

After a couple of trial and error configs, I figured out 'concurrency parameter' used in memaslap as the culprit.

In my configs I was using 16 (constant) as the concurrency input. Scaling the value along with thread count gave me sane numbers.

The average latency is 120 us for get/set workload ( please ignore my 50/50 ratio) and throughput max out around 500K ops/second.

graph attached.

I know that the benchmarking numbers are heavily dependent on the setup and other things. But Is my numbers are faithful enough to be quoted for

memcached single server numbers? In other words, are these numbers are way off from a typical memcached performance numbers?

thanks

--Pradeep

To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/alpine.DEB.2.21.1910071140110.21578%40dskull.

--

Pradeep Fernando.

kvbench.pdf

dormando

unread,

Oct 7, 2019, 4:46:30 PM10/7/19

to memc...@googlegroups.com

It'll depend on your hardware/test/etc.

https://memcached.org/blog/persistent-memory/ - a thorough performance
test with some higher end numbers on both throughput and latency along
with 50th/90th/95/99/etc percentiles and latency point clouds for each sub
test. That was a big machine though.

...and no I'm not going to ignore your 50/50 ratio :) the ratio changes
the results too much. people will have to test with what they expect to
see. if you do 100% get it scales linearly with the number of worker
threads/cores. Anything below 100% gets will slowly scale down to 500-1m
ops/s depending on the hardware and size of the objects.

> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3DZB_BK1VH_kKgTq576CMbV9MS%2Bap_v9DnO_VPfbhnTtsA%40mail.gmail.com.
>
>

Pradeep Fernando

unread,

Oct 7, 2019, 5:24:18 PM10/7/19

to memc...@googlegroups.com

Thanks for the article link. That is some comprehensive benchmarking.

Compared to article numbers, my latency numbers are sane enough. I hit ~120 us while you get similar/closer numbers at 99 th percentile.

However, my throughput numbers seems to be wrong. I hit a throughput kneepoint at 500K ops/sec while yours is around ~6000K.

Order of magnitude difference. Can you please comment on it. :)

thanks

--Pradeep

To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/alpine.DEB.2.21.1910071343070.21578%40dskull.

--

Pradeep Fernando.

dormando

unread,

Oct 7, 2019, 5:31:03 PM10/7/19

to memc...@googlegroups.com

the high end numbers are due to pipelining responses. ie; ascii multiget,
which reduces the syscalls. you can see how the tests were run via the
links to the source code in the blog.

I was running some pure get tests on dual 8 core machine yesterday with
memcached pinned to one numa node. Without any pipelining it was doing
~1.8m ops/sec. With heavy pipelining it should be much more than that.

In extreme and contrived cases I've gotten the pure get throughput above
50 million keys/sec. So I know that part scales... sets would as well but
nobody really asks for it so I've not focused on it. Latency is probably
not great at that throughput though :)

> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3DYdy4tRDsgSkE0dfD3U9S0gZHB1QFYVfSAdDgP4vSzPgw%40mail.gmail.com.
>
>

Reply all

Reply to author

Forward