benchmarking issues

kmr

unread,

Mar 26, 2021, 2:11:26 PM3/26/21

to memcached

We are trying to experiment with using UDP vs TCP for gets to see what kind of speedup we can achieve. I wrote a very simple benchmark that just uses a single thread to set a key once and do gets to retrieve the key over and over. We didn't notice any speedup using UDP. If anything we saw a slight slowdown which seemed strange.

When checking the stats delta, I noticed a really high value for lrutail_reflocked. For a test doing 100K gets, this value increased by 76K. In our production system, memcached processes that have been running for weeks have a very low value for this stat, less than 100. Also the latency measured by the benchmark seems to correlate to the rate at which that value increases.

I tried to reproduce using the spy java client and I see the same behavior, so I think it must be something wrong with my benchmark design rather than a protocol issue. We are using 1.6.9. Here is a list of all the stats values that changed during a recent run using TCP:

stats diff:

* bytes_read: 10,706,007

* bytes_written: 426,323,216

* cmd_get: 101,000

* get_hits: 101,000

* lru_maintainer_juggles: 8,826

* lrutail_reflocked: 76,685

* moves_to_cold: 76,877

* moves_to_warm: 76,917

* moves_within_lru: 450

* rusage_system: 0.95

* rusage_user: 0.37

* time: 6

* total_connections: 2

* uptime: 6

dormando

unread,

Mar 26, 2021, 5:03:10 PM3/26/21

to memcached

Hey,

Usually it's good to include the benchark code, but I think I can answer
this off the top of my head:

1) set at least 1,000 keys and fetch them randomly. all of memcached's
internal scale-up is based around... not just fetching a single key. I
typically test with a million or more. There are internal threads which
poke at the LRU, and since you're always accessing the one key, that key
is in use, and those internal threads report on that (lrutail_reflocked)

2) UDP mode has not had any love in a long time. It's not very popular and
has caused some strife on the internet as it doesn't have any
authentication. The UDP protocol wrapper is also not scalable. :( I wish
it were done like DNS with a redirect for too-large values.

3) Since UDP mode isn't using SO_REUSEPORT, recvmmsg, sendmmsg, or any
other modern linux API it's going to be a lot slower than the TCP mode.

4) TCP mode actually scales pretty well. Linearly for reads vs the number
of worker threads at tens of millions of requests per second on large
machines. What probems are you running into?

-Dormando

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> memcached+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/memcached/8efbc45d-1d6c-4563-a533-fdbd95457223n%40googlegroups.com.
>
>

kmr

unread,

Mar 26, 2021, 7:07:15 PM3/26/21

to memcached

Thanks for the reply! responses inline

On Friday, March 26, 2021 at 2:03:10 PM UTC-7 Dormando wrote:

Hey,

Usually it's good to include the benchark code, but I think I can answer
this off the top of my head:

1) set at least 1,000 keys and fetch them randomly. all of memcached's
internal scale-up is based around... not just fetching a single key. I
typically test with a million or more. There are internal threads which
poke at the LRU, and since you're always accessing the one key, that key
is in use, and those internal threads report on that (lrutail_reflocked)

This worked! However it seems like TCP and UDP latency now is about the same with my code as well as with a real benchmarking tool (memaslap).

2) UDP mode has not had any love in a long time. It's not very popular and
has caused some strife on the internet as it doesn't have any
authentication. The UDP protocol wrapper is also not scalable. :( I wish
it were done like DNS with a redirect for too-large values.

Not sure I understand the scalability point. From my observations, if I do a multiget, I get separate packet sequences for each response. So each get value could be about 2^16 * 1400 bytes big and still be ok via UDP (assuming everything arrives)? One thing that seemed hard is each separate sequence has the same requestId, which makes deciding what to do difficult in out-of-order arrival scenarios.

3) Since UDP mode isn't using SO_REUSEPORT, recvmmsg, sendmmsg, or any
other modern linux API it's going to be a lot slower than the TCP mode.

SO_REUSEPORT seems to be supported in the linux kernel in 3.9. But I definitely understand the decision to not spend much time optimizing the UDP protocol. I did see higher rusage_user and much higher rusage_system when using UDP, which maybe corresponds to what you are saying. I tried with memaslap and observed the same thing.

4) TCP mode actually scales pretty well. Linearly for reads vs the number
of worker threads at tens of millions of requests per second on large
machines. What probems are you running into?

No pressing issue really. We saw this (admittedly old) paper discussing how Facebook was able to reduce get latency by 20% by switching to UDP. Memcached get latency is a key factor in our overall system latency so we thought it would be worth a try, and it would ease some pressure on our network infrastructure as well. Do you know if Facebook's changes ever made it back into the main memcached distribution?

Thanks

Kireet

dormando

unread,

Mar 26, 2021, 8:44:29 PM3/26/21

to memcached

Hey,

> This worked! However it seems like TCP and UDP latency now is about the same with my code as well as with a real
> benchmarking tool (memaslap).

I don't use memaslap so I can't speak to it. I use mc-crusher for the
"official" testing, though admittedly it's harder to configure.

> Not sure I understand the scalability point. From my observations, if I do a multiget, I get separate packet
> sequences for each response. So each get value could be about 2^16 * 1400 bytes big and still be ok via UDP
> (assuming everything arrives)? One thing that seemed hard is each separate sequence has the same requestId, which
> makes deciding what to do difficult in out-of-order arrival scenarios.

mostly RE: kernel/syscall stuff. Especially after the TCP optimizations in
1.6, UDP mode will just be slower at high request rates. It will end up
running a lot more syscalls.

> SO_REUSEPORT seems to be supported in the linux kernel in 3.9. But I definitely understand the decision to not
> spend much time optimizing the UDP protocol. I did see higher rusage_user and much higher rusage_system when
> using UDP, which maybe corresponds to what you are saying. I tried with memaslap and observed the same thing.

Yeah, see above.

> No pressing issue really. We saw this (admittedly old) paper discussing how Facebook was able to reduce get
> latency by 20% by switching to UDP. Memcached get latency is a key factor in our overall system latency so we
> thought it would be worth a try, and it would ease some pressure on our network infrastructure as well. Do you
> know if Facebook's changes ever made it back into the main memcached distribution?

I wish there was some way I could make that paper stop existing. Those
changes went into memcached 1.2, 13+ years ago. I'm reasonably certain
facebook doesn't use UDP for memcached and hasn't in a long time. None of
their more recent papers (Which also stop around 2014) mention UDP at all.

The best performance you can get is by ensuring multiple requests are
pipelined at once, and there are a reasonable number of worker threads
(not more than one per CPU). If you see anything odd or have quetions
please bring up specifics, share server settings, etc.

> https://groups.google.com/d/msgid/memcached/08c496de-c686-401b-80d6-ad40c55a4e6dn%40googlegroups.com.
>
>

kmr

unread,

Mar 26, 2021, 9:46:56 PM3/26/21

to memcached

Thanks for the prompt replies. If it makes you feel better, the paper has stopped existing in my mind. :)

Have a good weekend!

Reply all

Reply to author

Forward