Re: [go-nuts] Yet another memcache client, server and benchmark tool in Go :)

1,134 views
Skip to first unread message

Vladimir Mihailenco

unread,
Nov 15, 2012, 6:36:19 AM11/15/12
to val...@gmail.com, golang-nuts
Impressive numbers and thanks for sharing. Did you measure how big performance boost you get by separating goroutines that read and write to the net.Conn? It would be great if you explain what optimisations you applied to get such perfomance.



On Thu, Nov 15, 2012 at 2:05 AM, <val...@gmail.com> wrote:
Hi all,

Take a look at yet another memcache server written in Go: https://github.com/valyala/ybc/tree/master/apps/go/memcached .
It has the following features missing in the original memcached:
  * Cached data survives server restarts.
  * Cache size may exceed available RAM by multiple orders of magnitude.
  * Value size isn't limited by 1Mb.
  * Key size isn't limited by 250 bytes.

All these features are available thanks to Go wrapper around YBC library - see https://github.com/valyala/ybc/tree/master/bindings/go/ybc .

There is also benchmark tool for memcache servers located at https://github.com/valyala/ybc/tree/master/apps/go/memcached-bench . It is also written in Go. Give it a try for performance comparison of different memcache servers. The tool uses memcache client library optimized for speed, which is located at https://github.com/valyala/ybc/tree/master/libs/go/memcache . This library utilizes less resources both on client and server side comparing to https://github.com/bradfitz/gomemcache under load from hundreds of goroutines. For instance, it achieves 200Kqps for mixed get/set requests sent from 512 goroutines, when both server and client are hosted on the same laptop.

Currently both client and server libraries support only the following subset of memcache commands: get, gets, set, delete, flush_all.

It would be great if these apps and libraries will be useful for anybody else except me :)

--
Best Regards,
Aliaksandr Valialkin

--
 
 

Vladimir Mihailenco

unread,
Nov 15, 2012, 8:26:27 AM11/15/12
to val...@gmail.com, golang-nuts
I applied similar trick to my redis client benchmark and lifted qps from 26k to 90k. It is all about auto-batching via buffered reading/writing, because client writes and reads N (I think in your case it is few thousands items) items at once. And I am not sure if it is acceptable behaviour for general purpose client...

Aliaksandr Valialkin

unread,
Nov 16, 2012, 6:20:13 PM11/16/12
to Vladimir Mihailenco, golang-nuts
On Thu, Nov 15, 2012 at 1:36 PM, Vladimir Mihailenco <vladimi...@gmail.com> wrote:
Impressive numbers and thanks for sharing. Did you measure how big performance boost you get by separating goroutines that read and write to the net.Conn? It would be great if you explain what optimisations you applied to get such perfomance.


I used the following optimizations in the Client ( https://github.com/valyala/ybc/blob/master/libs/go/memcache/client.go ):

- Automatic pipelining of multiple requests over a small set of connections to memcache server instead of using more traditional connection pools. This reduces the number of TCP packets travelling between client and server in both directions. Smaller number of TCP packets reduces CPU time required for creating and parsing outgoing/incoming packets.
- Buffering of outgoing requests and incoming responses with bufio buffers. Send buffers are flushed only when there are no pending requests to send to the server. This significantly reduces the number of send() and recv() syscalls on highly loaded system.
- Avoiding explicit and implicit dynamic memory allocations in hot paths.

These optimizations become visible with increasing number of concurrent workers calling Client methods.

--
Best Regards,

Aliaksandr

Aliaksandr Valialkin

unread,
Nov 16, 2012, 6:29:27 PM11/16/12
to Vladimir Mihailenco, golang-nuts

On Thu, Nov 15, 2012 at 3:26 PM, Vladimir Mihailenco <vladimi...@gmail.com> wrote:
I applied similar trick to my redis client benchmark and lifted qps from 26k to 90k. It is all about auto-batching via buffered reading/writing, because client writes and reads N (I think in your case it is few thousands items) items at once. And I am not sure if it is acceptable behaviour for general purpose client...

Why it may be unacceptable? Requests are flushed to the underlying connection as soon as pending requests' queue becomes empty. This avoids stalled requests in send buffers.
The similar trick is also used in the Server ( https://github.com/valyala/ybc/blob/master/libs/go/memcache/server.go ) - it flushes send buffers with outgoing responses only if pending requests' queue is empty. This avoids stalled responses in send buffers.

--
Best Regards,

Aliaksandr

Vladimir Mihailenco

unread,
Nov 17, 2012, 6:10:23 AM11/17/12
to Aliaksandr Valialkin, golang-nuts
I missed how you flush requests. Indeed it is very cool and I am going use it :) Thanks again.

Aliaksandr Valialkin

unread,
Nov 18, 2012, 5:53:23 PM11/18/12
to Vladimir Mihailenco, golang-nuts
FYI,

The package ( https://github.com/valyala/ybc/tree/master/libs/go/memcache ) now contains two additional Client implementations:
  * DistributedClient - routes requests to multiple memcache servers using consistent hashing. Servers may be dynamically added/removed to/from the client at any time.
  * CachingClient - saves network bandwidth between memcache servers and the client by storing retrieved items in the local (in-process) cache and using 'conditional get' requests for items' re-validation. CachingClient doesn't work with the original memcache server, because it uses 'condition-aware' commands, which aren't supported by the original memcached.

--
Best Regards,

Aliaksandr

Dustin Sallings

unread,
Nov 18, 2012, 6:54:50 PM11/18/12
to golan...@googlegroups.com
Aliaksandr Valialkin <val...@gmail.com>
writes:
What are these commands (and what supports them)? Is this like a
binary GET with the CAS specified?

It's not supported as it seems like it's generally a bad idea, as the
cache access concerns are generally latency. Requests are typically
small enough that the difference between requesting "what's the value"
and "has the value changed" is too small for the extra complexity.

Do you have any measurements for where this has a benefit worth the
additonal complexity?

--
dustin

Dustin Sallings

unread,
Nov 20, 2012, 2:33:27 PM11/20/12
to golan...@googlegroups.com
Aliaksandr Valialkin <val...@gmail.com>
writes:

> Currently CGET and CSET commands are supported only by the Server
> class
> - https://github.com/valyala/ybc/blob/master/libs/go/memcache/server.go 
> and memcached app built on top of this class
> - https://github.com/valyala/ybc/tree/master/apps/go/memcached .

So, this is a CAS-based GET, it seems.

(note, I wrote the CAS support in memcached and countless other client
and server implementations)

> CachingClient can significantly reduce latency by avoiding CGET
> round-trip to the server and returning locally cached item if the item
> has non-zero ValidateTtl value set via
> CachingClient.SetWithValidateTtl() call.

Well, sure, you're trading latency for coherence. If that works for
your application, then it's a tradeoff you're making that's really
independent of memcached.

> I have no any measurements at the moment. But simple math suggests
> than 'conditional get' requests will start saving network bandwidth
> when the average size of frequently requested items exceeds 10-100
> bytes depending on the number of responses sent per each TCP packet.
> Bandwidth savings may be significant for big items such as images or
> heavy web pages' parts.

This is why I'm thinking measurements would be good before suggesting
usage of this sort of thing in general (perhaps discussion over on the
memcached list?). I would posit that in practice, the latency wouldn't
be worth the bandwidth savings. For a second-level cache, a tap-based
evented invalidation protocol would provide the lowest-latency coherent
protocol, I'd think.

I do something similar for larger item caches served with a more
stream-friendly protocol (http). The larger an item gets, the less
suitable memcached is for it as the things we do for keeping the latency
low on object retrieval start to be a liability for first-byte latency
on larger objects. I don't have good figures for where that starts to
take effect on various networks, though. Intuitively, you can see an
object retrieval on the order of a few bytes completing in a single
packet vs. a 10GB object being pulled into an HTTP client is a vastly
different problem.

You still have to figure out *what* to cache. If you have needs in
this area, I'd been interested in hearing more about your workloads and
requirements (probably moreseo than the rest of the list).

--
dustin

Aliaksandr Valialkin

unread,
Nov 20, 2012, 4:43:17 PM11/20/12
to golan...@googlegroups.com


On Tuesday, November 20, 2012 9:33:53 PM UTC+2, Dustin wrote:
Aliaksandr Valialkin <val...@gmail.com>
writes:

> Currently CGET and CSET commands are supported only by the Server
> class
> - https://github.com/valyala/ybc/blob/master/libs/go/memcache/server.go 
> and memcached app built on top of this class
> - https://github.com/valyala/ybc/tree/master/apps/go/memcached .

  So, this is a CAS-based GET, it seems.

  (note, I wrote the CAS support in memcached and countless other client
  and server implementations)

Yes. I just came to the same conclusion :) I'll merge CAS value with ETAG value after finishing 'cas' command implementation - see https://github.com/valyala/ybc/commit/5ca2af49267c397de88e0c4c764486cc026ce4ae .

  This is why I'm thinking measurements would be good before suggesting
usage of this sort of thing in general (perhaps discussion over on the
memcached list?).  I would posit that in practice, the latency wouldn't
be worth the bandwidth savings.  For a second-level cache, a tap-based
evented invalidation protocol would provide the lowest-latency coherent
protocol, I'd think.

Agreed with you - it's all just hand-waving without measurements.
But all sorts of HTTP browsers and HTTP proxy caches successfully use conditional get aka 'If-None-Match' ( http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3 ) together with validateTtl aka 'Expires' and 'Cache-Control: max-age' ( http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.4 ) for saving network bandwidth.

Probably conditional get wasn't implemented yet in memcached-like services, because of the lack of good libraries for in-process local caching. Without such libraries conditional get is useless. I hope YBC library ( https://github.com/valyala/ybc ) will fill this gap.


  I do something similar for larger item caches served with a more
stream-friendly protocol (http).  The larger an item gets, the less
suitable memcached is for it as the things we do for keeping the latency
low on object retrieval start to be a liability for first-byte latency
on larger objects.  I don't have good figures for where that starts to
take effect on various networks, though.  Intuitively, you can see an
object retrieval on the order of a few bytes completing in a single
packet vs. a 10GB object being pulled into an HTTP client is a vastly
different problem.

Do you mean memcache protocol isn't well suited for transferring large objects? I fully agree with this. For efficient transferring of large objects protocol must support the following features missing in memcache protocol:
- Ability to interleave multiple responses over a single stream. Without this feature response times' distribution will have thick long tail.
- Ability to control each response's flow independently. Without this feature a single slow consumer, which don't bother reading the response quickly, may stall other consumers reading responses from the same stream.
- Ability to send responses in arbitrary order, which is independent of requests' order. Without this feature a single slow request may stall other requests' execution and/or responses' sending, which are transferred over the same stream.

I plan implementing such a protocol in the future using foundations of my ancient work http://code.google.com/p/multiplexing-rpc/ .

--
Best Regards,

Aliaksandr

Dustin Sallings

unread,
Nov 20, 2012, 8:03:32 PM11/20/12
to golan...@googlegroups.com
Aliaksandr Valialkin <val...@gmail.com>
writes:

> Agreed with you - it's all just hand-waving without measurements.
> But all sorts of HTTP browsers and HTTP proxy caches successfully use
> conditional get aka 'If-None-Match' (
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3 )
> together with validateTtl aka 'Expires' and 'Cache-Control: max-age'
> ( http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.4 ) for
> saving network bandwidth.
>
> Probably conditional get wasn't implemented yet in memcached-like
> services, because of the lack of good libraries for in-process local
> caching. Without such libraries conditional get is useless. I hope YBC
> library ( https://github.com/valyala/ybc ) will fill this gap.

It's got more to do with most effective uses of memcached having
results that can be delievered before most HTTP requests complete
processing.

> Do you mean memcache protocol isn't well suited for transferring large
> objects? I fully agree with this. For efficient transferring of large
> objects protocol must support the following features missing in
> memcache protocol:
> - Ability to interleave multiple responses over a single stream.
> Without this feature response times' distribution will have thick long
> tail.
> - Ability to control each response's flow independently. Without this
> feature a single slow consumer, which don't bother reading the
> response quickly, may stall other consumers reading responses from the
> same stream.
> - Ability to send responses in arbitrary order, which is independent
> of requests' order. Without this feature a single slow request may
> stall other requests' execution and/or responses' sending, which are
> transferred over the same stream.

You can do the last one already (and I take advantage of it in at
least some of my go applications), but I don't think it makes sense to
do the rest.

If the response is so large you benefit from these things, it doesn't
seem that memcached is really worth it. Once you get past the complete
response latency requirement, you're just streaming and io.Copy on an
HTTP response is going to be increasingly hard to beat in proportion to
your response size.

> I plan implementing such a protocol in the future using foundations of
> my ancient work http://code.google.com/p/multiplexing-rpc/ .

I wrote this and use it for HTTP between cbfs nodes (keeps me from
burning up all the file descriptors on internode traffic):

https://github.com/dustin/frames

It'd also work OK for memcached, but it wouldn't be as fast. It's
sort of a simplified "TCP over TCP", which is kind of what you're
describing.

It's not as fast as a raw socket, but it's got all the properties you
list above plus sane timeout handling where I can detect a lull in
traffic and inject manual pings or just measure production and
consumption passively and drop the transport connections when data stops
flowing sufficiently. Before, I tried doing this at the HTTP layer and
had a bad mix of cutting off connections too early and leaving dead ones
around too long. Now it's just a few seconds between positively active
and sufficiently non-responsive.

--
dustin

Aliaksandr Valialkin

unread,
Nov 21, 2012, 3:32:34 AM11/21/12
to Dustin Sallings, golang-nuts
  I wrote this and use it for HTTP between cbfs nodes (keeps me from
burning up all the file descriptors on internode traffic):

    https://github.com/dustin/frames

  It'd also work OK for memcached, but it wouldn't be as fast.  It's
sort of a simplified "TCP over TCP", which is kind of what you're
describing.

  It's not as fast as a raw socket, but it's got all the properties you
list above plus sane timeout handling where I can detect a lull in
traffic and inject manual pings or just measure production and
consumption passively and drop the transport connections when data stops
flowing sufficiently.  Before, I tried doing this at the HTTP layer and
had a bad mix of cutting off connections too early and leaving dead ones
around too long.  Now it's just a few seconds between positively active
and sufficiently non-responsive.


Thanks - I'll definitely look closer to https://github.com/dustin/frames . From the first sight its' protocol is similar to multiplexing-rpc's protocol. But I don't see a sign of per-channel flow control in the protocol. How it handles situations when receiver lags reading all incoming data from the channel? Does it buffer indefinitely incoming packets until out of memory error? Or drops the channel when the given threshold of buffered unread packets reached? Or stops reading packets from the given TCP connection, which effectively blocks all receivers reading from channels on the given TCP connection?

-- 
Best Regards,

Aliaksandr

Dustin Sallings

unread,
Nov 21, 2012, 4:22:59 AM11/21/12
to golan...@googlegroups.com
Aliaksandr Valialkin <val...@gmail.com>
writes:

> Thanks - I'll definitely look closer
> to https://github.com/dustin/frames . From the first sight its'
> protocol is similar to multiplexing-rpc's protocol. But I don't see a
> sign of per-channel flow control in the protocol. How it handles
> situations when receiver lags reading all incoming data from the
> channel? Does it buffer indefinitely incoming packets until out of
> memory error? Or drops the channel when the given threshold of
> buffered unread packets reached? Or stops reading packets from the
> given TCP connection, which effectively blocks all receivers reading
> from channels on the given TCP connection?

I don't have a need for flow control in my current application. If
the buffers are exceeded and not drained, traffic would stop flowing
over the connection and it would drop, and the clients would get errors
and retry if it were important (most things are assumed to be in a
semi-broken state, so if something doesn't get through, it'll just try
again later).

Adding flow control wouldn't be particularly hard, though. It knows
when nothing's reading and can send a stop message to the other side to
have the writer just block. Maybe I'll need that someday. :)

I hadn't seen multiplexing-rpc. I got most of frames done and tested
over a weekend to solve the timeout problem I was having.

--
dustin

Aliaksandr Valialkin

unread,
Nov 29, 2012, 11:55:48 AM11/29/12
to golan...@googlegroups.com
Hi all,

Here are some updates since the initial posting:

The Server type and memcache server now support the following memcache commands:
      * set
      * cas
      * add
      * get
      * gets
      * flush_all
    And the following memcache extensions:
      * getde - dogpile effect-aware get.
      * cget - conditional get, which may be used for 
      * cgetde - dogpile effect-aware conditional get.

The go-memcached server now supports multiple cache files. This may speed up the server if frequently requested objects don't fit RAM and if cache files are located on distinct physical drives. For example, the following command will start memcache server with distinct cache files located on three ssd drives (if these drives are mounted to the corresponding folders :) ):
$ ./go-memcached -cacheFilesPath="/ssd1/cache,/ssd2/cache,/ssd3/cache"

The go-memcached-bench now supports simultaneous benchmarking of multiple memcache servers - just pass servers' addresses delimited by comma via -serverAddrs flag:
$ ./go-memcached-bench -serverAddrs="host1:11211,host2:11211,host3:11211"

The go-memcached server now has comparable or even faster speed than the original memcache server according to memcached-bench:
workerMode      qps for original memcached      qps for go-memcached
GetMiss                    486049                                   619761
GetHit                       187641                                   198050
Set                           204167                                   202428
GetSetRand              179893                                   174983

The memcache package now contains the following client implementations:
    * Client - talks to a single memcache server. 
    * DistributedClient - routes requests to multiple servers using consistent hashing. Supports addition/removal of servers on the fly.
    * CachingClient - saves network bandwidth between the client and servers by storing responses in local cache. Can talk only to servers supporting 'conditional get' (cget) memcache extension.

Memcache package documentation is available at http://go.pkgdoc.org/github.com/valyala/ybc/libs/go/memcache .
Reply all
Reply to author
Forward
0 new messages