C++/Java Performance numbers

Rob Adams

unread,

Jun 16, 2009, 12:55:57 PM6/16/09

to project-...@googlegroups.com

I've done some tweaking on the C++ client. On my workstation using 10
client threads hitting a single node cluster I can do 34940.8 simple GET
operations per second. With 20 I get 41463.8 ops/sec.

Writing the same code in the Java client I wasn't able to get to the
same level of performance, though I suspect part of this has to do with
tuning the client thread pool. with 10 stress threads and 5 threadpool
threads I get 20543 gets/second. With 10/10 17465 gets/second. 5/10
17532.

I wrote a sort of theoretically optimal client that just starts threads
and a connection for each and does nothing but send "\x00\x00\x00\x13
\x08\x00\x10\x01\x1a\x04\x74\x65\x73\x74\x22\x07\x0a\x05\x68\x65\x6c\x6c
\x6f" (This is a GET request in protocol buffers for a store "test" and
key "hello") and then receive the 28-byte reply (value, world,
version(0:1)). This client can do 42715.2 with 20 threads. This last
number is then the raw performance of the server and no client is going
to do better.

When running one of these "raw client" tests, CPU utilization on the
server node makes it to 197-198% (two-core box). With my C++ client
server CPU utilization is more like 190% which would seem to indicate
there is room for improvement. Client CPU utilization is about zero on
the "raw" version while it's about 140% (different two-code box) on the
real version. This seems to indicate that there's a good amount of room
to squeeze more performance out of this.

CPU on the Java client is a bit odd, in that it's high to start with
then trails off after 5 seconds or so.

Voldemort native protocol performance doesn't seem to be different from
the protocol buffers performance.

-Rob

Dan Diephouse

unread,

Jun 16, 2009, 1:07:50 PM6/16/09

to project-...@googlegroups.com

What are your JVM parameters?

--
Dan Diephouse
http://mulesource.com | http://netzooid.com/blog

Rob Adams

unread,

Jun 16, 2009, 1:12:55 PM6/16/09

to project-...@googlegroups.com

nothing special; using run-class.sh from master, so
"-Xmx2G -server -Dcom.sun.management.jmxremote"

-Rob

ijuma

unread,

Jun 16, 2009, 1:22:00 PM6/16/09

to project-voldemort

On Jun 16, 5:55 pm, Rob Adams <read...@readams.net> wrote:
> CPU on the Java client is a bit odd, in that it's high to start with
> then trails off after 5 seconds or so.

Some of this CPU usage could be the JIT compilation. Did you give the
JIT enough time to warm up before timing the measurements (both on the
server and Java client)?

Ismael

Rob Adams

unread,

Jun 16, 2009, 1:50:58 PM6/16/09

to project-...@googlegroups.com

That's probably why the CPU use changes; good point. I ran the test for
a bit longer and it sort of asymptotically approaches 24k/sec.

Geir Magnusson Jr.

unread,

Jun 16, 2009, 2:06:43 PM6/16/09

to project-...@googlegroups.com

Can you describe the config? Client on different machine than server,
etc?

Jay Kreps

unread,

Jun 16, 2009, 3:38:06 PM6/16/09

to project-...@googlegroups.com

This is a fantastic result. Are the results over localhost or are they
between to separate machines? (it won't actually effect throughput
much, but changes latency a lot).

One thing I would mention is that the goal of parallelism in a single
request is not higher thoughput but reduced latency. I think there is
still more work to do there, but definitely having a threadpool will
result in reduced throughput.

I added the C++ client to my personal repo and would like to merge it
in to the main repo. I got it built and it appeared to work on my
linux box at home, but I haven't succeeded in building it on my mac
laptop. It would be good to get build instructions on the wiki for
different platforms to help people get started. I can add instructions
for linux, if you have instructions for whatever platform you are on
it would be good to add those.

http://wiki.github.com/voldemort/voldemort/c-client-build-instructions

I think we also have some work to do on the server for supporting
clients before we can say we really have it done. The first thing is I
think the server and client should negotiate the protocol at socket
connection time. This will allow supporting different protocols on the
same server, and give a clean error message for unknown protocols.
Plus if we need to break the compatibility between client and server
this provides a no-downtime way to do the upgrade (e.g. support both
temporarily, upgrade clients, then upgrade server, then change client
protocol, then remove old protocol code).

Would it be possible for anyone who has some knowledge of C++ to take
a look at Rob's code? I am very weak in C++, so it would be good to
get some more knowledgeable review. Rob if we add you as a
collaborator on the main repository would you be willing to help
maintain that code as bugs show up?

Fwiw, here are the similar numbers from the python client:

Single process:
879 get requests per second
643 put requests per second
312 get_all requests per second
733 delete requests per second

With only one process it is basically latency bound.

5 parallel processes:
1309 get requests per second
910 put requests per second
503 get_all
1506 delete

This gives about 40% CPU per python process, 10% for server, and the rest idle.

This makes the idea of wrapping the c++ client very attractive for
python and other interpreted languages.

-Jay

Rob Adams

unread,

Jun 16, 2009, 4:27:53 PM6/16/09

to project-...@googlegroups.com

Two somewhat older machines running 32-bit Ubuntu 9.04, each with 4GB of
RAM (so 3.3ish available).

One used as client, one as the server.

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2327.391
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow
bogomips : 4654.78
clflush size : 64
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz
stepping : 6
cpu MHz : 2327.391
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow
bogomips : 4655.01
clflush size : 64
power management:

Rob Adams

unread,

Jun 16, 2009, 4:38:12 PM6/16/09

to project-...@googlegroups.com

On Tue, 2009-06-16 at 12:38 -0700, Jay Kreps wrote:
> This is a fantastic result. Are the results over localhost or are they
> between to separate machines? (it won't actually effect throughput
> much, but changes latency a lot).

Two separate machines.

>
> One thing I would mention is that the goal of parallelism in a single
> request is not higher thoughput but reduced latency. I think there is
> still more work to do there, but definitely having a threadpool will
> result in reduced throughput.

This is true; the C++ client doesn't do a thread pool right now, as the
model is one StoreClient per user thread. There is a connection pool
which you could use to limit client parallelism by setting a maximum per
host or total connection count.

>
> I added the C++ client to my personal repo and would like to merge it
> in to the main repo. I got it built and it appeared to work on my
> linux box at home, but I haven't succeeded in building it on my mac
> laptop. It would be good to get build instructions on the wiki for
> different platforms to help people get started. I can add instructions
> for linux, if you have instructions for whatever platform you are on
> it would be good to add those.
>
> http://wiki.github.com/voldemort/voldemort/c-client-build-instructions
>

I'm using linux; I don't have access to a macos box. There's really
nothing that shouldn't be portable though, so I'd be surprised if
there's much challenge in getting it working (possible some headers need
to be checked for in configure)

I added some very rudimentary instructions for the Ubuntu build. Right
now the only real docs are in the generated docs, which also includes
sort of an intro with some example code.

> I think we also have some work to do on the server for supporting
> clients before we can say we really have it done. The first thing is I
> think the server and client should negotiate the protocol at socket
> connection time. This will allow supporting different protocols on the
> same server, and give a clean error message for unknown protocols.
> Plus if we need to break the compatibility between client and server
> this provides a no-downtime way to do the upgrade (e.g. support both
> temporarily, upgrade clients, then upgrade server, then change client
> protocol, then remove old protocol code).
>

This is a good idea. Right now if you try to connect to a server with
the wrong protocol, it'll generate an error but try to read the error
message from the socket and get gibberish, plus the connection will
desync after the first request. Not very helpful :-)

> Would it be possible for anyone who has some knowledge of C++ to take
> a look at Rob's code? I am very weak in C++, so it would be good to
> get some more knowledgeable review. Rob if we add you as a
> collaborator on the main repository would you be willing to help
> maintain that code as bugs show up?
>

I'm certainly willing to help here. I won't be able to guarantee
instant responsiveness but I'd like to see this grow to be a core part
of the project.

> This makes the idea of wrapping the c++ client very attractive for
> python and other interpreted languages.