Benchmarking Results for 18 Node Cluster with 200M records of 2.5KB blobs

Erich Nachbar

unread,

Feb 13, 2009, 1:32:35 PM2/13/09

to project-...@googlegroups.com

Guys,

Here are my results from testing Voldemort with 200M, 14KB records (compressed down to 2.5KB on the client). The basic idea was to see if it would be able to replace the DB as persistent storage for HTML-ish content. The current read vs. write ration is hugely biased towards reads.

Cluster Setup:

- 19 Voldemort Nodes (4 CPUs, 1 SATA Drive, 8GB)

- Voldemort JVM 4GB total with 3GB dedicated to BDB cache

- Java 1.6 64Bit

- Max of 500 allowed threads

- Replication factor 2

- BDB file size with 200M records on each node: 55GB when storing 200M 2.5KB blobs

Load Driver:

- 2 Nodes (16 CPUs, 1 SATA Drive, 32GB)

- JVM 1GB

- Java 1.6 64Bit

Writing Test Setup

The 2 load drivers were performing write requests into the cluster. Each load driver created keys prefixed with their namespace (i.e. a[0-99999999] and b[0-99999999]) to avoid generating multiple versions of the same key. Each driver used 50 threads to write into the cluster as fast as possible. While the HTML content never changed, it was always compressed on the fly before being persisted. Compression was fairly high getting the 14KB file down to 2.5KB before sending it off.

Write Results

Each node was able to persist around 7-8K 2.5KB records per second, while performing the compression on the fly. I didn't have enough load drivers to find out if this is the max capacity or not. Given the numbers I think I wasn't able to max it out. The Voldemort CPUs were not busy at all, but I couldn't figure out the IO Waits as they don't seem to be correctly reported through Proxmox (top showed 0%, which is unlikely)

Latency was stable during the test with these numbers towards the end (in ms):

[100% 315 ,99.5% 67 ,99% 34 ,98% 28 ,95% 15 ,90% 10 ,80% 6 ,75% 5 ,66% 5 ,50% 4 ]

Read Results

This is a little more tricky to get right, which is why we opted for 2 scenario - worst and "realistic" case.

All the test cases have been done with a base load of 200M records persisted in Voldemort.

Worst Case Scenario (i.e. complete random access over 200M records)

This scenario is fairly unlikely as it is not taking into account "hot" profiles invalidating every caching attempt.

With the cache being able to hold less than 5% of the data and the random nature of the requests we are measuring basically Berkley DBs optimizations and our server disk subsystem (which is only a single SATA drive per server).

As indicated this is typically not what one would see in production, where people log on, click around for a while and the log off/stop using the system.

Anyhow, here are the numbers for complete random access across 200M records from 2 nodes:

Read throughput per load gen node with 100 threads: 1009 records/sec (avg)

Read throughput per load gen node with 200 threads: 2200 records/sec (avg)

with the following read latency in ms (100 Threads):

[100% 945 ,99.5% 217 ,99% 185 ,98% 162 ,95% 140 ,90% 126 ,80% 113 ,75% 109 ,66% 103 ,50% 94 ]

with the following read latency in ms (200 Threads):

[100% 2357 ,99.5% 535 ,99% 429 ,98% 335 ,95% 228 ,90% 161 ,80% 112 ,75% 100 ,66% 85 ,50% 66 ]

More Realistic Read Scenario

A more realistic scenario is taking into consideration "hot" HTML-ish snippets. I tried to simulate this by restricting random access to 1M snippets (i.e. i.e. a[0-500,000] and b[0-500,000]).

The load driver machines are very beefy 16 core servers, so loading them with 400 threads each was no problem. Although they had to do decompress the HTML content they were hardly busy (CPU wise). I used the QuickLZ library and it seem to be fast & easy to use for my tests. I need to check if going the gzip route would yield better results.

Throughput per load driver node: 21,000 records/sec (!)

Read Latency in ms [100% 1740 ,99.5% 275 ,99% 262 ,98% 220 ,95% 47 ,90% 34 ,80% 22 ,75% 9 ,66% 5 ,50% 4 ]

Increasing or decreasing the thread count for reads & writes yielded a very linear behavior. So far I found the system very predictable and stable when it come to real time online content storage and retrieval.

Voldemort is currenty running on retired webservers, which have powerful CPUs and enough RAM, but the disk subsystem with one SATA drive is somewhat underpowered. This is an area where improvements can be made.

-Erich

Hui

unread,

Feb 17, 2009, 6:08:41 PM2/17/09

to project-voldemort

Eric,

Those are pretty impressive numbers. One question; what's your
setting for W and R (you mentioned N = 2), i.e., how many synchronous
replica read/write?
And your network settings?

Would really appreciate if you share these information.

Thx!

jay....@gmail.com

unread,

Feb 17, 2009, 7:39:19 PM2/17/09

to project-voldemort

Hey Erich,

This is a very thorough benchmark. Thanks for sharing it. I would like
to add this as one data point in the performance section on the
website, would that be okay? I think it is nice to see results from
real problems not just results contrived by the authors of the
software to make themselves look good... :-)

I notice that the average performance is pretty good, but the high
percentiles are not great. If the readers are overloaded then it is
normal for the throughput to stay flat but the response time will
degrade (since more readers are trying to do work, many will have to
wait). But you mention that the readers are not fully loaded during
the test. This makes me concerned about the variance in
performance--4ms read performance is pretty good, but 262ms at the
99th percentile is not great. I wonder if we might be seeing GC
pauses? If the high percentiles are a concern for your application,
you might want to check the GC logs...

Cheers,

-Jay

On Feb 13, 10:32 am, Erich Nachbar <er...@nachbar.biz> wrote:

Erich Nachbar

unread,

Feb 18, 2009, 2:13:37 AM2/18/09

to project-...@googlegroups.com

I run a replication of 2 with 1 required read & write. I have not
specified the desired reads/writes.
The network is a standard Gigabit switched network. No adapter bonding
as far as I know.

One thing that definitely bump the numbers up was the on the fly
compression.
The number were definitely lower when it had to push around 14KB vs.
my 2.5KB compressed.

But you are right. The number are really impressive and we can't wait
to relief our DB of that I/O load.

Erich Nachbar

unread,

Feb 18, 2009, 2:24:27 AM2/18/09

to project-...@googlegroups.com

>
> Hey Erich,
>
> This is a very thorough benchmark. Thanks for sharing it. I would like
> to add this as one data point in the performance section on the
> website, would that be okay? I think it is nice to see results from

Sure. More than happy to contribute!

> real problems not just results contrived by the authors of the
> software to make themselves look good... :-)
>

;) I am really impressed with you and your team's work. I have tried
other more visible projects (like Cassandra) and
while Voldemort doesn't have the name yet it certainly works
flawlessly from what I can tell.
I'd rather have the core be rock solid and add advanced features later
than be buzz word compliant and core dump left and right.

> I notice that the average performance is pretty good, but the high
> percentiles are not great. If the readers are overloaded then it is
> normal for the throughput to stay flat but the response time will
> degrade (since more readers are trying to do work, many will have to
> wait). But you mention that the readers are not fully loaded during
> the test. This makes me concerned about the variance in
> performance--4ms read performance is pretty good, but 262ms at the
> 99th percentile is not great. I wonder if we might be seeing GC
> pauses? If the high percentiles are a concern for your application,
> you might want to check the GC logs...
>

I have not dug into the GC behavior. I set the server GC to a scaled
down version of your prod settings, but wasn't really tweaking the
client VM except for a 1GB heap. Let me rerun the test with a parallel
GC enabled.

However compared to what we see out of Oracle (which is also doing 1M
other things) even the higher Voldemort numbers are great.
Thanks for your suggestions. I'll report back in a while with more
findings regarding this.

Erich Nachbar

unread,

Feb 18, 2009, 2:06:47 PM2/18/09

to project-voldemort

Jay,

I added the parallel GC option to my load test client and it actually
improved latency by a *lot*:
Read Latency in ms for 1M hot snippets [100% 845 ,99.5% 22 ,99% 19 ,
98% 16 ,95% 13 ,90% 12 ,80% 12 ,75% 12 ,66% 12 ,50% 10 ]

Transactions went up to: 23k records/sec/node (with 2 load test
nodes)

Thanks for your suggestions! This is very, very cool.

Jay Kreps

unread,

Feb 18, 2009, 3:32:34 PM2/18/09

to project-...@googlegroups.com

One thing you may be running into is the socket buffer size. We had
arbitrarily set it to around 3k (I think). So your 14KB blob may be
doing lots of blocking and waiting. But obviously less data is always
faster and better and for text, compression is a big win. There is a
bug for this here, in case others hit the same issue:
http://code.google.com/p/project-voldemort/issues/detail?id=27

Anyhow, glad you had a good results so far!

-Jay

Erich Nachbar

unread,

Feb 18, 2009, 5:01:11 PM2/18/09

to project-voldemort

I'm comressing it before I call the put method, but in real-life our
uncompressed blocks can be up to 64k,
which could trigger the behavior. I'll look into it.

Reply all

Reply to author

Forward