Guys,
Here are my results from testing Voldemort with 200M, 14KB records (compressed down to 2.5KB on the client). The basic idea was to see if it would be able to replace the DB as persistent storage for HTML-ish content. The current read vs. write ration is hugely biased towards reads.
Cluster Setup:
- 19 Voldemort Nodes (4 CPUs, 1 SATA Drive, 8GB)
- Voldemort JVM 4GB total with 3GB dedicated to BDB cache
- Java 1.6 64Bit
- Max of 500 allowed threads
- Replication factor 2
- BDB file size with 200M records on each node: 55GB when storing 200M 2.5KB blobs
Load Driver:
- 2 Nodes (16 CPUs, 1 SATA Drive, 32GB)
- JVM 1GB
- Java 1.6 64Bit
Writing Test Setup
The 2 load drivers were performing write requests into the cluster. Each load driver created keys prefixed with their namespace (i.e. a[0-99999999] and b[0-99999999]) to avoid generating multiple versions of the same key. Each driver used 50 threads to write into the cluster as fast as possible. While the HTML content never changed, it was always compressed on the fly before being persisted. Compression was fairly high getting the 14KB file down to 2.5KB before sending it off.
Write Results
Each node was able to persist around 7-8K 2.5KB records per second, while performing the compression on the fly. I didn't have enough load drivers to find out if this is the max capacity or not. Given the numbers I think I wasn't able to max it out. The Voldemort CPUs were not busy at all, but I couldn't figure out the IO Waits as they don't seem to be correctly reported through Proxmox (top showed 0%, which is unlikely)
Latency was stable during the test with these numbers towards the end (in ms):
[100% 315 ,99.5% 67 ,99% 34 ,98% 28 ,95% 15 ,90% 10 ,80% 6 ,75% 5 ,66% 5 ,50% 4 ]
Read Results
This is a little more tricky to get right, which is why we opted for 2 scenario - worst and "realistic" case.
All the test cases have been done with a base load of 200M records persisted in Voldemort.
Worst Case Scenario (i.e. complete random access over 200M records)
This scenario is fairly unlikely as it is not taking into account "hot" profiles invalidating every caching attempt.
With the cache being able to hold less than 5% of the data and the random nature of the requests we are measuring basically Berkley DBs optimizations and our server disk subsystem (which is only a single SATA drive per server).
As indicated this is typically not what one would see in production, where people log on, click around for a while and the log off/stop using the system.
Anyhow, here are the numbers for complete random access across 200M records from 2 nodes:
Read throughput per load gen node with 100 threads: 1009 records/sec (avg)
Read throughput per load gen node with 200 threads: 2200 records/sec (avg)
with the following read latency in ms (100 Threads):
[100% 945 ,99.5% 217 ,99% 185 ,98% 162 ,95% 140 ,90% 126 ,80% 113 ,75% 109 ,66% 103 ,50% 94 ]
with the following read latency in ms (200 Threads):
[100% 2357 ,99.5% 535 ,99% 429 ,98% 335 ,95% 228 ,90% 161 ,80% 112 ,75% 100 ,66% 85 ,50% 66 ]
More Realistic Read Scenario
A more realistic scenario is taking into consideration "hot" HTML-ish snippets. I tried to simulate this by restricting random access to 1M snippets (i.e. i.e. a[0-500,000] and b[0-500,000]).
The load driver machines are very beefy 16 core servers, so loading them with 400 threads each was no problem. Although they had to do decompress the HTML content they were hardly busy (CPU wise). I used the QuickLZ library and it seem to be fast & easy to use for my tests. I need to check if going the gzip route would yield better results.
Throughput per load driver node: 21,000 records/sec (!)
Read Latency in ms [100% 1740 ,99.5% 275 ,99% 262 ,98% 220 ,95% 47 ,90% 34 ,80% 22 ,75% 9 ,66% 5 ,50% 4 ]
Increasing or decreasing the thread count for reads & writes yielded a very linear behavior. So far I found the system very predictable and stable when it come to real time online content storage and retrieval.
Voldemort is currenty running on retired webservers, which have powerful CPUs and enough RAM, but the disk subsystem with one SATA drive is somewhat underpowered. This is an area where improvements can be made.
-Erich