Memcached performance numbers

382 views
Skip to first unread message

Pradeep Fernando

unread,
Oct 7, 2019, 11:40:50 AM10/7/19
to memcached
Hi Devs,

I run memaslap to understand the performance characteristics of memcached,

My setup : both memcached and memaslap running on a single machine with NUMA. memcached is bound to NUMA 1. Gave 3GB of memory to memcached.
workload : get/set 0.5/0.5

I increase number of thread from memaslap and observe throughput latency numbers. 

I see increase in throughput (expected) but latency drops as I crease the load. 
The initial average latency is 83 us and it drops to 30us with number of threads = 8, this is an unexpected number.-- I expected the latency to go up.
Am I reading the output wrong?

Apologies, if this question does not qualify for this mailing list. If so, please direct me to correct list I can get help. :)

--Pradeep



Thread count = 1


Total Statistics (11447336 events)
   Min:        11
   Max:      1663
   Avg:        83
   Geo:     79.83
   Std:     36.39
   Log2 Dist:
       4:       42      594   351733   9527982
       8:   1551101     7451     7103     1330

cmd_get: 5723671
cmd_set: 5723681
get_misses: 0
written_bytes: 394933167
read_bytes: 343419948
object_bytes: 183157792

Run time: 60.0s Ops: 11447352 TPS: 190765 Net_rate: 11.7M/s

Thread count = 2

Total Statistics (30888799 events)
   Min:        12
   Max:      2011
   Avg:        30
   Geo:     29.68
   Std:     15.32
   Log2 Dist:
       4:   170225   25862674   4766668    66398
       8:      154     5017    17493      170

cmd_get: 15444404
cmd_set: 15444411
get_misses: 0
written_bytes: 1065663678
read_bytes: 926663772
object_bytes: 494221152

Run time: 60.0s Ops: 30888815 TPS: 514751 Net_rate: 31.7M/s

dormando

unread,
Oct 7, 2019, 1:08:36 PM10/7/19
to memcached
Hi,

First as an aside; 1/1 get/set ratio is unusual for mc. The gets scale a
lot better than sets. If you get into testing more "realistic" perf
numbers make sure to increase the get rate.

You're probably just running into CPU scaling. OS's come with a "battery
saver" or "ondemand" performance scheduler by default. They also have
turbo. Once you start loading it up more the CPU will stay in the higher
frequency states or begin to issue turbo, which will lower the latency.

/usr/bin/echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
cpupower frequency-set -g performance

... or whatever works for your platform.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/cd593815-c27b-4995-bb7f-21859d9a3187%40googlegroups.com.
>
>

Pradeep Fernando

unread,
Oct 7, 2019, 2:30:05 PM10/7/19
to memc...@googlegroups.com
Hi Dormando,

That is great insight.!.
However, it did not solve the problem. I disabled turbo, as per your instructions.
I even, set the CPU to operate with maximum performance, with 
> cpupower frequency-set --governor performance ( i verified this by monitoring cpu freq)

Still the same unexplained behavior. :(. Do you have any other suggestions?

thanks
--Pradeep



--
Pradeep Fernando.

dormando

unread,
Oct 7, 2019, 2:42:12 PM10/7/19
to memc...@googlegroups.com
Hey,

Sorry; I'm not going to have any other major insights :) I'd have to sit
here playing 20 questions to figure out your test setup. If you're running
memaslap from another box, that one needs to be cpu pinned as well. If
it's a VM, the governor/etc might not even matter.

Also I don't use memaslap at all, so I can't attest to it. I use
https://github.com/memcached/mc-crusher with the external latency sampling
util it comes with. it's not as easy to use though.
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3Dbas4jg_8ToMP3VBVE3GMfzk9vHaQ7TMXdmqUww67dFsg%40mail.gmail.com.
>
>

Pradeep Fernando

unread,
Oct 7, 2019, 4:40:40 PM10/7/19
to memc...@googlegroups.com
Hi,

Thanks for the help!.
After a couple of trial and error configs, I figured out 'concurrency parameter' used in memaslap as the culprit.
In my configs I was using 16 (constant) as the concurrency input. Scaling the value along with thread count gave me sane numbers.

The average latency is 120 us for get/set workload ( please ignore my 50/50 ratio) and throughput max out around 500K ops/second.
graph attached.

I know that the benchmarking numbers are heavily dependent on the setup and other things. But Is my numbers are faithful enough to be quoted for 
memcached  single server numbers? In other words, are these numbers are way off from a typical memcached performance numbers?

thanks
--Pradeep




--
Pradeep Fernando.
kvbench.pdf

dormando

unread,
Oct 7, 2019, 4:46:30 PM10/7/19
to memc...@googlegroups.com
It'll depend on your hardware/test/etc.

https://memcached.org/blog/persistent-memory/ - a thorough performance
test with some higher end numbers on both throughput and latency along
with 50th/90th/95/99/etc percentiles and latency point clouds for each sub
test. That was a big machine though.

...and no I'm not going to ignore your 50/50 ratio :) the ratio changes
the results too much. people will have to test with what they expect to
see. if you do 100% get it scales linearly with the number of worker
threads/cores. Anything below 100% gets will slowly scale down to 500-1m
ops/s depending on the hardware and size of the objects.
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3DZB_BK1VH_kKgTq576CMbV9MS%2Bap_v9DnO_VPfbhnTtsA%40mail.gmail.com.
>
>

Pradeep Fernando

unread,
Oct 7, 2019, 5:24:18 PM10/7/19
to memc...@googlegroups.com
Thanks for the article link. That is some comprehensive benchmarking.

Compared to article numbers, my latency numbers are sane enough. I hit ~120 us while you get similar/closer numbers at 99 th percentile.

However, my throughput numbers seems to be wrong. I hit a throughput kneepoint at 500K ops/sec while yours is around ~6000K.
Order of magnitude difference. Can you please comment on it. :)

thanks
--Pradeep



--
Pradeep Fernando.

dormando

unread,
Oct 7, 2019, 5:31:03 PM10/7/19
to memc...@googlegroups.com
the high end numbers are due to pipelining responses. ie; ascii multiget,
which reduces the syscalls. you can see how the tests were run via the
links to the source code in the blog.

I was running some pure get tests on dual 8 core machine yesterday with
memcached pinned to one numa node. Without any pipelining it was doing
~1.8m ops/sec. With heavy pipelining it should be much more than that.

In extreme and contrived cases I've gotten the pure get throughput above
50 million keys/sec. So I know that part scales... sets would as well but
nobody really asks for it so I've not focused on it. Latency is probably
not great at that throughput though :)
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/CAPSEm%3DYdy4tRDsgSkE0dfD3U9S0gZHB1QFYVfSAdDgP4vSzPgw%40mail.gmail.com.
>
>
Reply all
Reply to author
Forward
0 new messages