memcached performance

schatzb...@gmail.com

unread,

Mar 22, 2015, 5:52:18 PM3/22/15

to osv...@googlegroups.com

Hi All,

I've been measuring the performance of OSv memcached compared to Linux. I built and ran OSv as follows:

osv git tree @0df9862
apps tree @5e6b9ae

$ make image=memcached mode=release

$ ./scripts/run.py --novnc -m2G -c1 -nv -e "memcached -u root -t1 -m1024"

$ qemu-system-x86_64 -m 2G -cpu host -enable-kvm --nographic -smp cpus=1 \
-netdev tap,ifname=tap0,id=vlan1,script=no,downscript=no,vhost=on \
-device virtio-net-pci,netdev=vlan1,mac=00:11:22:33:44:55 ...(irrelevant devices)...

(I run the qemu line by hand rather than through the run script)

I run a Linux guest similarly. I use the mutilate benchmark running on a separate machine connected to the host via a direct 10gbe cable. The benchmark which measures latency as a function of throughput. It's been configured to use the Facebook ETC workload. Here are my results:

http://i.imgur.com/eQXZbRa.png

I wonder why the performance is worse on OSv as compared to Linux. Your USENIX paper shows better memcached throughput (albeit, with UDP). Did I misconfigure the OSv system or is this a known phenomenon? I also ran the same test using multicore, and found OSv performance degrades whereas Linux's scales. I suspect this is due to the lack of multiqueue support in your virtio-net driver (which Linux supports).

Dor Laor

unread,

Mar 22, 2015, 6:32:51 PM3/22/15

to Dan Schatzberg, Osv Dev

On Sun, Mar 22, 2015 at 11:52 PM, <schatzb...@gmail.com> wrote:

Hi All,

I've been measuring the performance of OSv memcached compared to Linux. I built and ran OSv as follows:

osv git tree @0df9862
apps tree @5e6b9ae

$ make image=memcached mode=release

$ ./scripts/run.py --novnc -m2G -c1 -nv -e "memcached -u root -t1 -m1024"

$ qemu-system-x86_64 -m 2G -cpu host -enable-kvm --nographic -smp cpus=1 \
-netdev tap,ifname=tap0,id=vlan1,script=no,downscript=no,vhost=on \
-device virtio-net-pci,netdev=vlan1,mac=00:11:22:33:44:55 ...(irrelevant devices)...

(I run the qemu line by hand rather than through the run script)

I run a Linux guest similarly. I use the mutilate benchmark running on a separate machine connected to the host via a direct 10gbe cable. The benchmark which measures latency as a function of throughput. It's been configured to use the Facebook ETC workload. Here are my results:

http://i.imgur.com/eQXZbRa.png

I wonder why the performance is worse on OSv as compared to Linux. Your USENIX paper

We indeed expect the opposite result. Your configuration looks ok. Worth to share the client side configuration and get some host side 'perf top' output. My guess is that it could be different default buffer sizes.

shows better memcached throughput (albeit, with UDP). Did I misconfigure the OSv system or

The referred UDP memcache is a simple implementation that we wrote from scratch while the tcp memcache is the same unmodified app that doesn't utilize various advantages of OSv (mainly zero copy).

is this a known phenomenon? I also ran the same test using multicore, and found OSv performance degrades whereas Linux's scales. I suspect this is due to the lack of multiqueue support in your virtio-net driver (which Linux supports).

It could be although there are cases where multiqueue-virtio can hurt performance too (lower coalescing).

If you're interested in tcp-memcache the best would be to test the one of SeaStar:

http://www.seastar-project.org

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Schatzberg

unread,

Mar 22, 2015, 10:17:17 PM3/22/15

to Dor Laor, Osv Dev

Hi Dor,

Thank you for your quick response.

Client is https://github.com/leverich/mutilate - After initially populating the server cache - On my 20 core client machine I pin 16 threads which open 8 connections each. Each connection is allowed to have up to 4 outstanding GET requests at any one time. The clients issue requests at a rate sufficient to match the target aggregate request rate. A 17th thread opens a single connection and attempts one request per millisecond and measures its latency. All requests are GETs of 20-70 byte keys with most values from 1-1024 bytes. As I understand, this is a standard run of the mutilate benchmark.

Because memcached messages are small, and only four are able to be outstanding per connection, I don't see how this can be a TCP windowing or buffering issue. Likewise less coalescing shouldn't have an impact. I will try to gather perf results and report back.

Dor Laor

unread,

Mar 23, 2015, 9:03:36 AM3/23/15

to Dan Schatzberg, Osv Dev

I gave Mutilate a try using SeaStar [1] and got the results below.

Few commands didn't work as expected (the --onlyload option didn't finish so I had

to ignore it, also the agent requires other parameters than the ones documented).

The client runs on a 28 core/56 machine.

The server runs on identical machine but uses only 8 cores (dpdk poll mode).

-sh-4.3$ ./mutilate -s 1.1.1.1 --noload -c 4 -B -T 56 -Q 1000 -D 4 -C 4 -t 20 -u 1.0

#type avg std min 5th 10th 90th 95th 99th

update 121.0 71.7 23.2 75.4 84.0 161.6 184.1 250.3

op_q 1.0 0.0 1.0 1.0 1.0 1.1 1.1 1.1

Total QPS = 1844931.2 (36904696 / 20.0s)

Misses = 0 (-nan%)

Skipped TXs = 0 (0.0%)

RX 221428176 bytes : 10.6 MB/s

TX 9078596544 bytes : 432.8 MB/s

Dan Schatzberg

unread,

Mar 23, 2015, 12:58:29 PM3/23/15

to Dor Laor, Osv Dev

Hi Dor,

Here is how I run mutilate:

Start an agent on 16 threads:

taskset -a -c 0-15 mutilate --binary -T 16 -A &

Load the server:

mutilate --binary -s $SERVER_IP --loadonly -K fb_key -V fb_value

run test:

taskset -a -c 16 mutilate --binary -s $SERVER_IP --noload -K fb_key -V fb_value -i fb_ia -B -t 30 -T 1 -C 1 -Q 1000 -D 1 -q 0 -c 8 -d 4 -a 127.0.0.1

My understanding is that the capital letter options configure the master probe thread (1 thread, 1 connection, 1 outstanding request at a time, 1000 requests per second), and the lower case letter options configure the agent (8 connections per thread, 4 outstanding requess per connection) and -q sets the target throughput (I vary this to generate the graph).

One of the reasons I have been testing single-core is that a single client machine may be insufficient to cause a bottleneck on the server when using multiple cores. Your results show at peak throughput, only a mean latency of 121 microseconds which is still quite low, indicating that no queueing on the server has occured. You also ran it with 100% SET operations, opposite of how I had done it.

I looked a bit at SeaStar - this is using a custom TCP/IP stack with DPDK poll mode drivers and a modified memcached? And this runs on Linux as well as OSv with virtio-net device?

Dan Schatzberg

unread,

Mar 23, 2015, 1:03:45 PM3/23/15

to Dor Laor, Osv Dev

Looking a bit closer. I'm guessing SeaStar memcached doesn't properly handle multiple memcached requests within the same TCP segment. This is likely what you are hitting when you run --loadonly and why you don't pass '-d 4' on the mutilate line. Are you able to run an unmodified version of Memcached?

Dor Laor

unread,

Mar 23, 2015, 3:17:35 PM3/23/15

to Dan Schatzberg, Osv Dev

SeaStar is our new flagship performance platform. It has it's own tcp/ip stack (while

it's still possible to use posix too). It can either run on top of DPDK or virtio-net (and even the combination). It can run both on Linux and OSv. The former results were achieved using

a physical Linux server. Since we bypass most of the OS services, it matters less what OS

we're using.

I'll go and check your settings tomorrow. Thanks. From check 'top' and 'perf top', it looked

that the client is pretty busy already with a standard threaded model. 60 client threads couldn't

overload 8 seastar cores.

Cheers,

Dor

Reply all

Reply to author

Forward

Message has been deleted