Problems with voldemort

666 views
Skip to first unread message

kay kay

unread,
Jul 29, 2013, 6:42:17 AM7/29/13
to project-...@googlegroups.com
Hello

I've tried to configure 4 server with 96Gb RAM.

Here is their configuration for maximum :

./bin/generate_cluster_xml.py -f hosts -N cluster -p 1024 -S 937567216 -z 0


Min, max, total JVM size
OPT_JVM_SIZE="-server -Xms70g -Xmx70g -Dcom.sun.management.jmxremote"
# New Generation Sizes
OPT_JVM_SIZE_NEW="-XX:NewSize=4096m -XX:MaxNewSize=4096m"
# Type of Garbage Collector to use
OPT_JVM_GC_TYPE="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"
# Tuning options for the above garbage collector
OPT_JVM_GC_OPTS="-XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2"
# JVM GC activity logging settings
OPT_JVM_GC_LOG="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/gc.log"

java -Dlog4j.configuration=src/java/log4j.properties $OPT_JVM_SIZE $OPT_JVM_SIZE_NEW $OPT_JVM_GC_TYPE $OPT_JVM_GC_OPTS $OPT_JVM_GC_LOG -cp $CLASSPATH voldemort.server.VoldemortServer $@


<stores>
<!-- Note that "test" store requires 2 reads and writes,
so to use this store you must have both nodes started and running -->
<store>
<name>UserTable</name>
<persistence>bdb</persistence>
<routing>client</routing>
<replication-factor>1</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>
<preferred-reads>1</preferred-reads>
<preferred-writes>1</preferred-writes>
<key-serializer>
<type>string</type>
</key-serializer>
<value-serializer>
<type>string</type>
</value-serializer>
<retention-days>1</retention-days>
</store>
</stores>

# The ID of *this* particular cluster node
node.id=0

max.threads=100
enable.repair=true

data.directory=/data/voldemort

############### DB options ######################

http.enable=true
socket.enable=true
jmx.enable=true
admin.enable=true

# BDB
bdb.write.transactions=false
bdb.flush.transactions=false
bdb.cache.size=41200MB
bdb.btree.fanout=35536


When I tried to benchmark them with the following command:
./bin/voldemort-performance-tool.sh --record-count 177000000 --value-size 10240 --ops-count 177000000 --url tcp://hm-4:6666 --store-name UserTable -w 100 --threads 20 -v --interval 1

I got lots of java exception. For example these ones from benchmark tool:
 voldemort.store.PersistenceFailureException: com.sleepycat.je.EnvironmentFailureException: (JE 4.1.17) /data/voldemort/bdb fetchTarget of 0x8c/0x1e35d5c parent IN=321698 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x12e/0xfbf1e4 parent.getDirty()=true state=0 java.lang.ArrayIndexOutOfBoundsException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent.
at sun.reflect.GeneratedConstructorAccessor2.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:116)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:103)
at voldemort.store.ErrorCodeMapper.getError(ErrorCodeMapper.java:72)
at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.checkException(VoldemortNativeClientRequestFormat.java:238)
at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.readGetVersionResponse(VoldemortNativeClientRequestFormat.java:247)
at voldemort.store.socket.clientrequest.GetVersionsClientRequest.parseResponseInternal(GetVersionsClientRequest.java:53)
at voldemort.store.socket.clientrequest.GetVersionsClientRequest.parseResponseInternal(GetVersionsClientRequest.java:30)
at voldemort.store.socket.clientrequest.AbstractClientRequest.parseResponse(AbstractClientRequest.java:66)
at voldemort.store.socket.clientrequest.ClientRequestExecutorPool$NonblockingStoreCallbackClientRequest.parseResponse(ClientRequestExecutorPool.java:445)
at voldemort.store.socket.clientrequest.ClientRequestExecutor.read(ClientRequestExecutor.java:213)
at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:103)
at voldemort.common.nio.SelectorManager.run(SelectorManager.java:215)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

at sun.reflect.GeneratedConstructorAccessor2.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:116)
at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:103)
at voldemort.store.ErrorCodeMapper.getError(ErrorCodeMapper.java:72)
at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.checkException(VoldemortNativeClientRequestFormat.java:238)
at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.readGetVersionResponse(VoldemortNativeClientRequestFormat.java:247)
at voldemort.store.socket.clientrequest.GetVersionsClientRequest.parseResponseInternal(GetVersionsClientRequest.java:53)
at voldemort.store.socket.clientrequest.GetVersionsClientRequest.parseResponseInternal(GetVersionsClientRequest.java:30)
at voldemort.store.socket.clientrequest.AbstractClientRequest.parseResponse(AbstractClientRequest.java:66)
at voldemort.store.socket.clientrequest.ClientRequestExecutorPool$NonblockingStoreCallbackClientRequest.parseResponse(ClientRequestExecutorPool.java:445)
at voldemort.store.socket.clientrequest.ClientRequestExecutor.read(ClientRequestExecutor.java:213)
at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:103)
at voldemort.common.nio.SelectorManager.run(SelectorManager.java:215)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
voldemort.store.InsufficientOperationalNodesException: 1 get versionss required, but only 0 succeeded Original replication set :[3] Known failed nodes before operation :[] Estimated live nodes in preference list :[3] New failed nodes during operation :[]
at voldemort.store.routed.action.PerformSerialRequests.execute(PerformSerialRequests.java:132)
at voldemort.store.routed.Pipeline.execute(Pipeline.java:214)
at voldemort.store.routed.PipelineRoutedStore.getVersions(PipelineRoutedStore.java:530)
at voldemort.store.routed.PipelineRoutedStore.getVersions(PipelineRoutedStore.java:75)
at voldemort.store.DelegatingStore.getVersions(DelegatingStore.java:86)
at voldemort.store.DelegatingStore.getVersions(DelegatingStore.java:86)
at voldemort.store.serialized.SerializingStore.getVersions(SerializingStore.java:144)
at voldemort.store.DelegatingStore.getVersions(DelegatingStore.java:86)
at voldemort.client.DefaultStoreClient.getVersions(DefaultStoreClient.java:163)
at voldemort.client.DefaultStoreClient.put(DefaultStoreClient.java:344)
at voldemort.performance.benchmark.VoldemortWrapper$2.update(VoldemortWrapper.java:112)
at voldemort.client.DefaultStoreClient.applyUpdate(DefaultStoreClient.java:279)
at voldemort.client.DefaultStoreClient.applyUpdate(DefaultStoreClient.java:271)
at voldemort.client.LazyStoreClient.applyUpdate(LazyStoreClient.java:133)
at voldemort.performance.benchmark.VoldemortWrapper.write(VoldemortWrapper.java:107)
at voldemort.performance.benchmark.Workload.doWrite(Workload.java:399)

And these ones from voldemort logs:


  [14:36:36,029 voldemort.store.bdb.BdbStorageEngine] ERROR com.sleepycat.je.EnvironmentFailureException: (JE 4.1.17) /data/voldemort/bdb fetchTarget of 0x12d/0x182ae0f parent IN=17 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x12e/0x9ab9a2 parent.getDirty()=true state=0 java.lang.ArrayIndexOutOfBoundsException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. [voldemort-niosocket-server15]
 [14:36:36,029 voldemort.server.protocol.vold.VoldemortNativeRequestHandler] ERROR com.sleepycat.je.EnvironmentFailureException: (JE 4.1.17) /data/voldemort/bdb fetchTarget of 0x12d/0x182ae0f parent IN=17 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x12e/0x9ab9a2 parent.getDirty()=true state=0 java.lang.ArrayIndexOutOfBoundsException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. [voldemort-niosocket-server15]




When I flush the bdb database benchmark works fine but only for first 10 minutes. What is wrong? 

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 29, 2013, 9:49:32 AM7/29/13
to project-...@googlegroups.com
Hello,

Well, I see a number a possible factors here. One is that you're running with an extraordinarily large JVM heap size and bdb.cache.size. GC stall time alone could be causing issues because it has so much memory to walk through. We have stores with tens of billions of keys and tens of thousands of QPS and we can get by easily with a 32gb heap (2g newgen, ~30g oldgen) and 20gb bdb.cache.size. How many keys and what kind of QPS are you expecting to have?

And, out of curiosity, why are you overriding the bdb.btree.fanout with such a high number? You might be inducing failure with that high of a btree fanout. I have never tested with anything higher than 1024 and I would generally recommend nothing higher than 512.

What kind of device and storage configuration is /data/voldemort?

Lastly, you should set the replication-factor of your stores to at least 2. You're obviously stress testing voldemort to see what kind of throughput it can do. So, you should make it more realistic, like you're going to run it in a production/live environment. You will probably want redundancy when you go live with it.

I recommend running running the test again, but watching your disk I/O stats, system run queue averages, memory usage, jvm heap usage and gc stall during the process and see if those give you more information. Also, you should watch for disk write failures or overly long write operation times. That should give you more information to work with. 

Brendan

kay kay

unread,
Jul 30, 2013, 2:20:45 AM7/30/13
to project-...@googlegroups.com
Hi Brendan,

Thank you for your reply.

понедельник, 29 июля 2013 г., 17:49:32 UTC+4 пользователь Brendan Harris (a.k.a. stotch on irc.oftc.net) написал:
Hello,

Well, I see a number a possible factors here. One is that you're running with an extraordinarily large JVM heap size and bdb.cache.size. GC stall time alone could be causing issues because it has so much memory to walk through. We have stores with tens of billions of keys and tens of thousands of QPS and we can get by easily with a 32gb heap (2g newgen, ~30g oldgen) and 20gb bdb.cache.size. How many keys and what kind of QPS are you expecting to have?


I thought that I should use all RAM for maximum performance. That is why I set 70 Gb for java heap and 40 Gb for bdb cache. I need to handle approximately 200 000 000 keys (~10kb each) with minimum 1000 ops/sec throughput 
 
And, out of curiosity, why are you overriding the bdb.btree.fanout with such a high number? You might be inducing failure with that high of a btree fanout. I have never tested with anything higher than 1024 and I would generally recommend nothing higher than 512.


Voldemort manual says that bigger value is better (http://www.project-voldemort.com/voldemort/configuration.html). I have a large amount of RAM, that is why I set these options so big.
 
What kind of device and storage configuration is /data/voldemort?


2xhdd in RAID0
 
Lastly, you should set the replication-factor of your stores to at least 2. You're obviously stress testing voldemort to see what kind of throughput it can do. So, you should make it more realistic, like you're going to run it in a production/live environment. You will probably want redundancy when you go live with it.

I recommend running running the test again, but watching your disk I/O stats, system run queue averages, memory usage, jvm heap usage and gc stall during the process and see if those give you more information. Also, you should watch for disk write failures or overly long write operation times. That should give you more information to work with. 

Thank you for your comments. I've configured voldemort with your suggestions. Write tests looks good now, but write speed is lower: 2000 ops/sec instead of 4000 ops/sec as I set replication-factor to 2.

It should take 24 hours to fill in the whole 170000000 keys to database. IO wait on each node is approximately 1-2%.

avg-cpu: %user %nice %system %iowait %steal %idle
1,39 0,00 0,62 1,26 0,00 96,74

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0,94 25,66 34,53 45829128 61683570
sdb 24,94 3992,46 5943,97 7131320730 10617094764
sdc 24,33 3976,38 5908,17 7102587945 10553139853
md0 1396,80 7968,83 11852,14 14233882899 21170234395

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 30, 2013, 7:27:32 AM7/30/13
to project-...@googlegroups.com
Hi Kay,


On Monday, July 29, 2013 11:20:45 PM UTC-7, kay kay wrote:
I thought that I should use all RAM for maximum performance. That is why I set 70 Gb for java heap and 40 Gb for bdb cache. I need to handle approximately 200 000 000 keys (~10kb each) with minimum 1000 ops/sec throughput

The problem with that is that your GC stall times will be very long because of having such a large space in memory to clean. A general rule in using the java virtual machine is to not use more memory than you need to. Having too large a space can be far less efficient.
  
And, out of curiosity, why are you overriding the bdb.btree.fanout with such a high number? You might be inducing failure with that high of a btree fanout. I have never tested with anything higher than 1024 and I would generally recommend nothing higher than 512.


Voldemort manual says that bigger value is better (http://www.project-voldemort.com/voldemort/configuration.html). I have a large amount of RAM, that is why I set these options so big.

Unfortunately, that particular entry in the document is missworded. Having a larger btree fanout is not necessarily more efficient. Try unsetting this so that it goes back to default (512). You'll need to shutdown voldemort and delete the data after making this change and restart your tests.
 
 
What kind of device and storage configuration is /data/voldemort?


2xhdd in RAID0

You should make this a RAID 1 for redundancy. Have you verified that the disks are not failing and that you can successfully write across the disks?

Also, what kind of disks are they (SAS, SSD, SATA)?

Lastly, you should set the replication-factor of your stores to at least 2. You're obviously stress testing voldemort to see what kind of throughput it can do. So, you should make it more realistic, like you're going to run it in a production/live environment. You will probably want redundancy when you go live with it.

I recommend running running the test again, but watching your disk I/O stats, system run queue averages, memory usage, jvm heap usage and gc stall during the process and see if those give you more information. Also, you should watch for disk write failures or overly long write operation times. That should give you more information to work with. 

Thank you for your comments. I've configured voldemort with your suggestions. Write tests looks good now, but write speed is lower: 2000 ops/sec instead of 4000 ops/sec as I set replication-factor to 2.

It should take 24 hours to fill in the whole 170000000 keys to database. IO wait on each node is approximately 1-2%.

avg-cpu: %user %nice %system %iowait %steal %idle
1,39 0,00 0,62 1,26 0,00 96,74

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0,94 25,66 34,53 45829128 61683570
sdb 24,94 3992,46 5943,97 7131320730 10617094764
sdc 24,33 3976,38 5908,17 7102587945 10553139853
md0 1396,80 7968,83 11852,14 14233882899 21170234395

So these are my recommendations to get you back on track with testing:
1. Make the /data/voldemort volume redundant (RAID 1 or add more disks and do a RAID 5 or RAID 1+0)
2. Reduce your JVM initial/max heap size to 12G and NewGen to 2G, and keep the rest of the JVM settings the same
3. Set bdb.cache.size to "6G"
4. Unset bdb.btree.fanount so that it picks up the default value of "512"
5. Change your store's replication-factor to "2"
6. Set nio.connector.selectors to "50"
7. Set bdb.one.env.per.store to "true"

And ...

If your /data/voldemort volume is *not* SSD:
8. Set bdb.cleaner.lazy.migration to "true"
9. Set bdb.cleaner.threads to "3"
10. Set bdb.cache.evictln to "false"

Otherwise, if it *is* SSD:
8. Leave bdb.cleaner.lazy.migration undefined
9. Set bdb.cleaner.threads to "1"
10. Leave bdb.cache.evictln undefined
11. Set bdb.evict.by.level to "true"

Shutdown voldemort, delete the old store data, start voldemort and re-run your tests. See if that works better.

You can always increase your heap and cache sizes later, if necessary. But start lower for now. Also, keep in mind that the bdb cache objects (because they live in the heap for longer than most objects) will inevitably promote to OldGen. So, you can pretty much safely assume that your bdb.cache.size will always consume OldGen memory.

Let me know how your tests run after making these changes.

Thanks,

Brendan

kay kay

unread,
Jul 30, 2013, 10:34:12 AM7/30/13
to project-...@googlegroups.com
Thank you once again for your detailed answers. They are very informative.

I made RAID0 for maximum speed as I have simple SATA HDDs. I will try your configuration tomorrow, as I'm already uploading data to Voldemort cluster.

Actually I have a small question about the database size. At the moment each node has approximately 570Gb database for 64'000'000 keys (each key size is 10kb). That means 616Gb data (64'000'000*10kb) becames 2280Gb (570Gb*4 nodes). I think 2280Gb is too large for 64'000'000 keys even with replication_factor=2. And it turns out that each 10kb key becomes 37kb. What do you think about that?

<stores>
<!-- Note that "test" store requires 2 reads and writes,
so to use this store you must have both nodes started and running -->
<store>
<name>UserTable</name>
<persistence>bdb</persistence>
<routing>client</routing>
<replication-factor>2</replication-factor>

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 30, 2013, 11:29:18 AM7/30/13
to project-...@googlegroups.com
Hi Kay,


On Tuesday, July 30, 2013 7:34:12 AM UTC-7, kay kay wrote:
Thank you once again for your detailed answers. They are very informative.

I made RAID0 for maximum speed as I have simple SATA HDDs. I will try your configuration tomorrow, as I'm already uploading data to Voldemort cluster.

Actually I have a small question about the database size. At the moment each node has approximately 570Gb database for 64'000'000 keys (each key size is 10kb). That means 616Gb data (64'000'000*10kb) becames 2280Gb (570Gb*4 nodes). I think 2280Gb is too large for 64'000'000 keys even with replication_factor=2. And it turns out that each 10kb key becomes 37kb. What do you think about that?

Is the sum total of key size and value size 10kb? Or are you only considering the value size?

Something to keep in mind is that the bdb storage engine is log structured, so all writes to existing keys get appended to the end of the current jdb file. That means that the previous instance of the key still exists but is basically no longer the active key. The bdb cleaners come through and analyze the older jdb files. If they reach a certain level of threshold of underutilization, the live keys in the file are appended to the end of the current jdb file and the old file is scheduled for deletion. That's how the compaction works.

The first time you write data, you'll have 100% utilization, but as you re-write keys your utilization will vary. You can try setting bdb.cleaner.min.file.utilization to "0", which will allow the global minUtilization to be used, which tends to be less finicky. And the globabl minUtilization defaults to "50".

You could also add ...
<compression>
  <type>gzip</type>
</compression>

To your value serializer field in your stores.xml to compress the values. If you set this you'll need to delete all your data again and start over, otherwise your client will throw exceptions trying to read the existing values.

One last thing is that you might want to consider moving to faster disks or build a much larger array with a RAID 1+0 configuration to give yourself the advantage of parallelism and still be redundant. The kind of throughput you're testing is very heavy for SATA disks. You might need to play around with a slightly larger heap and cache sizes, plus bdb.checkpoint.interval.bytes and bdb.cleaner.interval.bytes to defer checkpointing and cleaning a little. The two of those events are going to be very hard on your SATA disks.

Brendan

tpar...@gmail.com

unread,
Jul 30, 2013, 3:43:27 PM7/30/13
to project-...@googlegroups.com
Hello, Brendan, I'm working with Kay

The warmup fase has just been finished. And there is 895GB data per server

But there is extraordinary performance :) and it's very strange
./voldemort-1.3.0/bin/voldemort-performance-toolh --value-size 10240 --ops-count 110000000 --url tcp://hm-4:6666 --store-name UserTable  --threads 100 -v --interval 1 -r 100 --record-selection uniform --ignore-nulls --verify

[status]        Throughput(ops/sec): 5488.1416504223525 Operations: 33785

There is raid0: 2sata hdd. per server. The maximum raid0 performance is about 250-300 Random Read per second. So in case cache hit rate 0% there is 1200 random read per sec for the cluster

We have 5488 Read per sec.

It means a lot of keys are fetched from cache ?
it there monitoring of Key Cache size ? Cache Hit ?

Thank you very much !

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 30, 2013, 4:04:16 PM7/30/13
to project-...@googlegroups.com
Hello,


On Tuesday, July 30, 2013 12:43:27 PM UTC-7, tpar...@gmail.com wrote:
We have 5488 Read per sec.

It means a lot of keys are fetched from cache ?

That is very likely at that read rate.
 
it there monitoring of Key Cache size ? Cache Hit ?

Yes. We have the "voldemort.store.bdb.stats" mbean that you can poll for that information. From that mbean you can see various bdb cache stats.

Brendan

tpar...@gmail.com

unread,
Jul 31, 2013, 3:31:08 PM7/31/13
to project-...@googlegroups.com
Hello, i've set --metric-type histogram and the output is:
 [status]       Throughput(ops/sec): 2665.6716417910447 Operations: 2679
[reads] Operations: 2679
[reads] Average(ms): 3,1217
[reads] Min(ms): 0
[reads] Max(ms): 524
[reads] Median(ms): 0
[reads] 95th(ms): 14
[reads] 99th(ms): 75
[reads] Return: 0       2679

Does it mean all reads return Zero ?

thank you

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 31, 2013, 4:08:43 PM7/31/13
to project-...@googlegroups.com
On Wednesday, July 31, 2013 12:31:08 PM UTC-7, tpar...@gmail.com wrote:
Hello, i've set --metric-type histogram and the output is:
 [status]       Throughput(ops/sec): 2665.6716417910447 Operations: 2679
[reads] Operations: 2679
[reads] Average(ms): 3,1217
[reads] Min(ms): 0
[reads] Max(ms): 524
[reads] Median(ms): 0
[reads] 95th(ms): 14
[reads] 99th(ms): 75
[reads] Return: 0       2679

Does it mean all reads return Zero ?

I think that's just the return code from the execution of the benchmark. In other words, no reads failed.

Brendan

tim

unread,
Jul 31, 2013, 4:15:53 PM7/31/13
to project-...@googlegroups.com
THe problem is that the network Throughput is not corelated to  Throughput(ops/sec)

for example: the benchmark reports 4500 ops/sec, but  network throuhput is only 6megaByte/sek. But the row size is 10240Bytes




Brendan

--
You received this message because you are subscribed to a topic in the Google Groups "project-voldemort" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/project-voldemort/inDQGOom_rY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Jul 31, 2013, 10:05:53 PM7/31/13
to project-...@googlegroups.com
Hi Tim,

On Wednesday, July 31, 2013 1:15:53 PM UTC-7, tim wrote:
THe problem is that the network Throughput is not corelated to  Throughput(ops/sec)

for example: the benchmark reports 4500 ops/sec, but  network throuhput is only 6megaByte/sek. But the row size is 10240Bytes

Your new store value serializer is compressed, right? The compression is done on the client-side, so the data on the wire is smaller.

Brendan

tim

unread,
Aug 1, 2013, 11:23:18 AM8/1/13
to project-voldemort
it's not compressed, and all of 4500 ops/s are zero, that's the traffic is low





Brendan

--

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 1, 2013, 5:40:04 PM8/1/13
to project-...@googlegroups.com
Hi Tim,

OK. Can you give me the full command you're running and the output? I don't know the benchmark tool well myself, but I can probably figure out what is going on or someone on the team can if they can look at the whole picture. And can you also give me the current stores.xml and cluster.xml?

Thanks,

Brendan

tim

unread,
Aug 2, 2013, 12:52:43 PM8/2/13
to project-voldemort
Hello,
configs are attached
the command is bin/voldemort-performance-tool.sh --value-size 10240 --ops-count 110000000 --url tcp://hm-4:6666 --store-name UserTable  --threads 100 -v --interval 1 -r 100 --record-selection uniform --ignore-nulls --verify

PS: we are trying to find the solution for storing billion (10billions in the future) pictures of 10KB size. It must be low latency (less then 1sec user access) storage. There is no way to by SSD, but we have a few(tens) servers 96GBRam. Inspite of big RAM capacity  the dataset can't fit at memory ;)

 So we are testing a lot of NoSQL solution(Voldemort, HBASE Couchbase, RIAk, Cassandra) to find out what solution better works with SATA DISK

there is 3 SATA disk per server, and 2 of them are RAID0. The performance of the single sata drive is 130-150 Random Reads per second, so RAID0 is about 250 Randomreads per sec

All metada are kept at RAM, so when we are geting one image, there is 1 disk seek. So, there is might be 250 image read per sec for one server. We hope Voldemort can give us such performance :)


thank you very much!


cluster.xml
stores.xml
Message has been deleted

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 3, 2013, 10:51:38 AM8/3/13
to project-...@googlegroups.com
Hi Tim,

On Friday, August 2, 2013 9:52:43 AM UTC-7, tim wrote:
the command is bin/voldemort-performance-tool.sh --value-size 10240 --ops-count 110000000 --url tcp://hm-4:6666 --store-name UserTable  --threads 100 -v --interval 1 -r 100 --record-selection uniform --ignore-nulls --verify
 
Could you please also paste the output (or attach it as a file) from the benchmark run?

You might want to re-run the performance-tool without "--ignore-nulls" to see if you're getting any null replies. You could also setup something to poll the voldemort.store.stats mbean for your store to see your average value sizes and throughput, as well as the percentage of empty responses the server is returning. And you can poll the same mbean in the client to get client-side stats on these data points.

One more recommendation I have is that you reduce your total number of partitions to somewhere around 2000. Since you'll have 10s of billions of keys, don't go lower than 500 partitions. 2000 is probably good for you. One thing we have learned is that if you have too many partitions, running rebalances of the partitions to expand the cluster can take a _very_ long time. You want to have as few partitions as possible, but you still want to be able to avoid having overly hot partitions and be able to more evenly distribute load across the servers as partitions get hot.
 
PS: we are trying to find the solution for storing billion (10billions in the future) pictures of 10KB size. It must be low latency (less then 1sec user access) storage. There is no way to by SSD, but we have a few(tens) servers 96GBRam. Inspite of big RAM capacity  the dataset can't fit at memory ;)

If the reason you don't want to enable compression is because you want to reduce the CPU overhead for the client, you could at least serialize it into avro bytes to reduce the size of the records. I think that will shrink it down some. Otherwise, I recommend enabling compression. I don't think the CPU hit will be too noticeable. You might want to try running the benchmark with the values compressed and without and compare.

Unfortunately, with Java, you do have to be careful about how much RAM you allocate to the heap. It's a trial process, where you must start small and little-by-little work your way up to see what kind of GC stall times and CPU hit you get.

there is 3 SATA disk per server, and 2 of them are RAID0. The performance of the single sata drive is 130-150 Random Reads per second, so RAID0 is about 250 Randomreads per sec

If you can put 4 disks in one server, you can run two instances of Voldemort with semi-large heaps and give them each two disks to work with.
 
All metada are kept at RAM, so when we are geting one image, there is 1 disk seek. So, there is might be 250 image read per sec for one server. We hope Voldemort can give us such performance :)

I think it can, but you have to keep in mind that you're working with a log-structured storage engine when using bdb (it's bdb Java Edition). I don't know how frequently you'll be writing over existing keys or deleting keys, but every modification of an existing record is appended to the end of the live jdb file and this creates sparseness in the old jdb files. This results in compaction, eventually. And the cleaner threads are woken up by default every 15mb written. Log-structure storage mechanisms do not often play well with SATA disks, especially with as many records as you plan to have and the size that they are (10kb is a little large for billions of records on SATA). You may need to play around bdb.max.logfile.size, bdb.cleaner.interval.bytes, bdb.checkpoint.interval.bytes and your filesystem block size and readahead size to optimize throughput capability.

Also, try not to consume all the RAM with JVMs. Since you're on SATA, you might need to depend more heavily upon the OS's I/O scheduler/page cache.

Brendan

kay kay

unread,
Aug 5, 2013, 11:19:44 AM8/5/13
to project-...@googlegroups.com
Hi Brandan,

суббота, 3 августа 2013 г., 18:51:38 UTC+4 пользователь Brendan Harris (a.k.a. stotch on irc.oftc.net) написал:

One more recommendation I have is that you reduce your total number of partitions to somewhere around 2000. Since you'll have 10s of billions of keys, don't go lower than 500 partitions. 2000 is probably good for you. One thing we have learned is that if you have too many partitions, running rebalances of the partitions to expand the cluster can take a _very_ long time. You want to have as few partitions as possible, but you still want to be able to avoid having overly hot partitions and be able to more evenly distribute load across the servers as partitions get hot.


In my first message I copy-pasted the cluster config command:
 
./bin/generate_cluster_xml.py -f hosts -N cluster -p 1024 -S 937567216 -z 0 

Do you mean we need set 2000 partitions for all cluster? In other words 500 per each node?

So we have 1024 partitions. Is it enough? Or should I increase it to 2000?

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 5, 2013, 1:13:37 PM8/5/13
to project-...@googlegroups.com
Hi kay kay,


On Monday, August 5, 2013 8:19:44 AM UTC-7, kay kay wrote:
In my first message I copy-pasted the cluster config command:
 
./bin/generate_cluster_xml.py -f hosts -N cluster -p 1024 -S 937567216 -z 0 

Do you mean we need set 2000 partitions for all cluster? In other words 500 per each node?

Yes, I mean you should probably have 500 per node. So, if you re-run the command with -p 500, you will have 2000 total.

Brendan

kay kay

unread,
Aug 6, 2013, 10:33:35 AM8/6/13
to project-...@googlegroups.com
Dear Brendan,

Now we use 2000 partitions per cluster. We've also created seq file (seq 1 100000000) and shuffled it, then run benchmark with the following command:
./voldemort-performance-tool.sh --value-size 10240 --ops-count 114359252 --url tcp://hm-4:6666 --store-name UserTable -r 100 --threads 100 -v --ierval 1 --ignore-nulls --request-file shuf --num-connections-per-node=100
And it gives 300 rps.

When we used sorted sequence, we got 2000-3000 rps.

Do we need to increase metacache or something to get 2000-3000 rps on random read?


kay kay

unread,
Aug 6, 2013, 10:41:50 AM8/6/13
to project-...@googlegroups.com
Or what should we do to store whole metadata in cache? Is it even possible?

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 6, 2013, 11:20:30 AM8/6/13
to project-...@googlegroups.com
Hi kay kay,

SATA average seek time is from 10 - 15ms, depending on the make and model. Right? Let's say it's a Seagate Terascale at 12ms. That's about 83 seeks per second. And if you're on linux your default read-ahead is 128kb and your records are about 10kb a piece. That's about 12 records read into page cache every time you hit a new file after a seek. So, I think you are probably just hitting the limits of random reads on SATA. You should be able to increase the throughput by tuning the kernel/ioscheduler and perhaps tuning the voldemort server to cater to how the OS caches.

Also, if you set your bdb.max.logfile.size to a multiple of OS readahead size, that can also help.

Brendan

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 6, 2013, 12:36:54 PM8/6/13
to project-...@googlegroups.com
Sorry, I forgot to answer one of your questions.

If you set the cache size large enough to fit your entire index, as requests come in eventually most (if not all) of your index will end up in cache. But you also want to make sure that you leave a lot of memory free for the IO scheduler to compensate for the slower SATA disks.

Brendan

kay kay

unread,
Aug 8, 2013, 3:32:37 AM8/8/13
to project-...@googlegroups.com
Dear Brendan,

Which option should I tune to increase cache size? Is it depend on java heap?

вторник, 6 августа 2013 г., 20:36:54 UTC+4 пользователь Brendan Harris (a.k.a. stotch on irc.oftc.net) написал:

kay kay

unread,
Aug 8, 2013, 3:33:23 AM8/8/13
to project-...@googlegroups.com
DOES it depend on java heap?

fixed

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 8, 2013, 10:33:19 AM8/8/13
to project-...@googlegroups.com
Hi kay kay,


On Thursday, August 8, 2013 12:32:37 AM UTC-7, kay kay wrote:
Dear Brendan,

Which option should I tune to increase cache size? Is it depend on java heap?

bdb.cache.size is the option and whatever you set it to you should ensure that you have at least 30% more space than your cache size in OldGen.

Brendan

kay kay

unread,
Aug 9, 2013, 8:19:16 AM8/9/13
to project-...@googlegroups.com
How should I calculate cache size for 100'000'000 keys? Will I get problems with GC when I will set -Xms70g -Xmx70g?

четверг, 8 августа 2013 г., 18:33:19 UTC+4 пользователь Brendan Harris (a.k.a. stotch on irc.oftc.net) написал:

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 9, 2013, 10:05:11 AM8/9/13
to project-...@googlegroups.com
Hello,


On Friday, August 9, 2013 5:19:16 AM UTC-7, kay kay wrote:
How should I calculate cache size for 100'000'000 keys? Will I get problems with GC when I will set -Xms70g -Xmx70g?

It's not easy to calculate. Since you're on slow disks and thus cannot enable bdb.cache.evictln, you'll never be able to get an ideal cache size. We'll be opensourcing the tool that we made to do this calculation, but it makes the assumption that you have bdb.cache.evictln=true.

As for the 70g heap, you will definitely have long GC times. I do not know how long, but you could assume that your 99th percentile latencies will be higher than 300ms. If you do that you might want to enable kernel huge pages and reduce swap activity (in linux reduce vm.swappiness value) or turn swap off.

On our old clusters that ran on SAS disks, for the clusters that had > 100M keys, we ran an initial/max heap of 22g, NewGen=2g, InitiatingOccupancyFraction=70, SurvivorRatio=2 and a bdb.cache.size=10g. You could start at that and then try to increase the sizes and see where performance starts to degrade.

Something to keep in mind is that the more memory you give to java, the more you take away from the OS for page cache. And on slow disks you're going to rely heavily on the page cache.

Brendan

tim

unread,
Aug 11, 2013, 1:54:21 PM8/11/13
to project-...@googlegroups.com

Hello, Brendan, our first goal is to fill metadata cache(to avoid 2 disk seek per read request), so we don't need so much heap. 16 GB per node I think  is enough.
The question is: when bdb.cache stores metadata ? Does it store metadata in cache at write data operation or at read stage ?
If metadata are stored at read operation, we need some way to warm up metadata cache. Is there some way we can warm up the metadata cache ?
I'd like to thank you one more for answers!

--

tim

unread,
Aug 11, 2013, 1:58:52 PM8/11/13
to project-...@googlegroups.com

I was using mongoperf utility to test our sata drive.
There is 120-150 random read requests at 500GB file, 10KB blocksize
so I'm expecting to get 200-250 random read per voldemort server
I'm wright or there is some mistake ?

Thank you

--

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 12, 2013, 10:36:16 AM8/12/13
to project-...@googlegroups.com
Hello Tim,

On Sunday, August 11, 2013 10:54:21 AM UTC-7, tim wrote:

Hello, Brendan, our first goal is to fill metadata cache(to avoid 2 disk seek per read request), so we don't need so much heap. 16 GB per node I think  is enough.
The question is: when bdb.cache stores metadata ? Does it store metadata in cache at write data operation or at read stage ?
If metadata are stored at read operation, we need some way to warm up metadata cache. Is there some way we can warm up the metadata cache ?


It is both a read and write through cache. So, both kinds of operations will warm the cache. As for pre-warming the cache, you can use the AdminClient.fetchKeys method, which is also wrapped in a command-line script (see bin/voldemort-admin-tool.sh), but if you enable bdb.minimize.scan.impact (enabled by default) then this will not warm the cache for you. Other than that, you will need a secondary index to know all of the keys ahead of time to pre-get them, which is something you would need to implement on your own with a second voldemort store and logic in the client to keep the secondary index up-to-date.

Brendan

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 12, 2013, 10:37:52 AM8/12/13
to project-...@googlegroups.com
Hi Tim,


On Sunday, August 11, 2013 10:58:52 AM UTC-7, tim wrote: 

I was using mongoperf utility to test our sata drive.
There is 120-150 random read requests at 500GB file, 10KB blocksize
so I'm expecting to get 200-250 random read per voldemort server
I'm wright or there is some mistake ?

I am not certain I understand your logic. Can you explain a little more?

Thanks,

Brendan

kay kay

unread,
Aug 13, 2013, 7:17:52 AM8/13/13
to project-...@googlegroups.com
He tries to say that one single SATA drive gives 120-150 random read requests. So when we use RAID0 we should get ~ x1.5 speed increase.
That mean that one node should give ~200-250 requests per second and whole cluster (4 nodes) should give ~800-1000 requests per second.

понедельник, 12 августа 2013 г., 18:37:52 UTC+4 пользователь Brendan Harris (a.k.a. stotch on irc.oftc.net) написал:

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Aug 13, 2013, 10:23:26 AM8/13/13
to project-...@googlegroups.com
Hi kay kay,


On Tuesday, August 13, 2013 4:17:52 AM UTC-7, kay kay wrote:
He tries to say that one single SATA drive gives 120-150 random read requests. So when we use RAID0 we should get ~ x1.5 speed increase.
That mean that one node should give ~200-250 requests per second and whole cluster (4 nodes) should give ~800-1000 requests per second.

Well, it depends on how you tested. From Tim's statement, he's saying that he had a 500GB file and is filesystem block size was 10KB. Right? I am not sure that same test would work properly for voldemort. You could set bdb.max.logfile.size to 500GB and run the voldemort perf tool, but bdb does not operate well with large files like that.

Perhaps I am misunderstanding what Tim wrote. But if you configure voldemort properly, you should be able to get up to the max random reads per second per disk. It just may take some playing around with the configuration to get that. If you can set the cache size large enough to fix all of the index into memory, that should get you max performance.

Brendan

kay kay

unread,
Aug 20, 2013, 9:47:09 AM8/20/13
to project-...@googlegroups.com
Hi Brendan,

I've disassembled RAID0 and created two default ext4 FS for two voldemort instances on one hardware node. I used your configuration suggestions, but disabled replica for test purposes:

<stores>
<!-- Note that "test" store requires 2 reads and writes,
so to use this store you must have both nodes started and running -->
<store>
<name>UserTable</name>
<persistence>bdb</persistence>
<routing>client</routing>
<replication-factor>1</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>
<preferred-reads>1</preferred-reads>
<preferred-writes>1</preferred-writes>
<key-serializer>
<type>string</type>
</key-serializer>
<value-serializer>
<type>string</type>
</value-serializer>
<retention-days>1</retention-days>
</store>
</stores>

<!-- Partition distribution generated using seed [32819401880] -->
<cluster>
<name>voldemort</name>
<server>
<id>0</id>
<host>hm-4</host>
<http-port>6665</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>0, 1, 3, 5, 7, 9, 12, 13, 15, 16, 17, 18, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36, 43, 47, 48, 49, 50, 52, 55, 57, 58, 66, 71, 72, 77, 78, 80, 81, 85, 86, 87, 89, 98, 101, 102, 104, 105, 109, 113, 114, 115, 116, 117, 118, 120, 121, 123, 124, 125, 130, 131, 132, 134, 137, 139, 140, 142, 143, 144, 147, 153, 157, 160, 161, 165, 167, 168, 170, 171, 174, 175, 178, 184, 185, 186, 187, 188, 189, 195, 196, 198, 201, 204, 207, 208, 209, 212, 214, 216, 217, 219, 222, 226, 229, 230, 231, 232, 235, 236, 239, 240, 242, 245, 246, 247, 248, 256, 257, 258, 259, 262, 264, 265, 266, 268, 271, 273, 276, 278, 279, 280, 285, 286, 287, 289, 290, 291, 292, 294, 295, 296, 297, 306, 309, 312, 314, 315, 317, 321, 322, 323, 324, 326, 327, 328, 329, 330, 332, 334, 336, 337, 338, 339, 342, 343, 345, 346, 347, 349, 350, 352, 353, 354, 356, 362, 364, 365, 366, 367, 370, 372, 373, 374, 375, 381, 383, 386, 387, 388, 389, 395, 398, 399, 403, 407, 410, 412, 414, 416, 418, 419, 423, 424, 426, 428, 429, 432, 433, 434, 435, 436, 437, 438, 440, 445, 446, 448, 449, 452, 453, 454, 456, 457, 459, 461, 462, 463, 465, 466, 467, 469, 472, 473, 476, 477, 478, 479, 480, 485, 487, 489, 492, 493, 494, 495, 496, 497, 498, 503, 505, 507, 509, 510, 514, 516, 517, 518, 519, 521, 522, 523, 527, 528, 532, 533, 534, 535, 539, 542, 545, 546, 547, 549, 550, 554, 555, 556, 557, 558, 559, 560, 561, 564, 565, 566, 568, 570, 571, 572, 574, 575, 577, 578, 580, 581, 588, 591, 593, 594, 596</partitions>
</server>
<server>
<id>1</id>
<host>hm-4</host>
<http-port>6668</http-port>
<socket-port>6669</socket-port>
<admin-port>6670</admin-port>
<partitions>2, 4, 6, 8, 10, 11, 14, 19, 20, 21, 22, 23, 24, 25, 31, 37, 38, 39, 40, 41, 42, 44, 45, 46, 51, 53, 54, 56, 59, 60, 61, 62, 63, 64, 65, 67, 68, 69, 70, 73, 74, 75, 76, 79, 82, 83, 84, 88, 90, 91, 92, 93, 94, 95, 96, 97, 99, 100, 103, 106, 107, 108, 110, 111, 112, 119, 122, 126, 127, 128, 129, 133, 135, 136, 138, 141, 145, 146, 148, 149, 150, 151, 152, 154, 155, 156, 158, 159, 162, 163, 164, 166, 169, 172, 173, 176, 177, 179, 180, 181, 182, 183, 190, 191, 192, 193, 194, 197, 199, 200, 202, 203, 205, 206, 210, 211, 213, 215, 218, 220, 221, 223, 224, 225, 227, 228, 233, 234, 237, 238, 241, 243, 244, 249, 250, 251, 252, 253, 254, 255, 260, 261, 263, 267, 269, 270, 272, 274, 275, 277, 281, 282, 283, 284, 288, 293, 298, 299, 300, 301, 302, 303, 304, 305, 307, 308, 310, 311, 313, 316, 318, 319, 320, 325, 331, 333, 335, 340, 341, 344, 348, 351, 355, 357, 358, 359, 360, 361, 363, 368, 369, 371, 376, 377, 378, 379, 380, 382, 384, 385, 390, 391, 392, 393, 394, 396, 397, 400, 401, 402, 404, 405, 406, 408, 409, 411, 413, 415, 417, 420, 421, 422, 425, 427, 430, 431, 439, 441, 442, 443, 444, 447, 450, 451, 455, 458, 460, 464, 468, 470, 471, 474, 475, 481, 482, 483, 484, 486, 488, 490, 491, 499, 500, 501, 502, 504, 506, 508, 511, 512, 513, 515, 520, 524, 525, 526, 529, 530, 531, 536, 537, 538, 540, 541, 543, 544, 548, 551, 552, 553, 562, 563, 567, 569, 573, 576, 579, 582, 583, 584, 585, 586, 587, 589, 590, 592, 595, 597, 598, 599</partitions>
</server>
</cluster>

# The ID of *this* particular cluster node
node.id=0

max.threads=100
enable.repair=true

data.directory=/data1/voldemort

############### DB options ######################

http.enable=true
socket.enable=true
jmx.enable=true
admin.enable=true

# BDB
bdb.write.transactions=false
bdb.flush.transactions=false
bdb.cache.size=10g
bdb.minimize.scan.impact=false

# The ID of *this* particular cluster node
node.id=1

max.threads=100
enable.repair=true

data.directory=/data2/voldemort

############### DB options ######################

http.enable=true
socket.enable=true
jmx.enable=true
admin.enable=true

# BDB
bdb.write.transactions=false
bdb.flush.transactions=false
bdb.cache.size=10g
bdb.minimize.scan.impact=false


I've uploaded  4328791 keys, created shuffled sequence file from 1 to 4328791 and started performance test:

./voldemort-performance-tool.sh --value-size 10240 --ops-count 4328791 --url tcp://hm-4:6666 --store-name UserTable -r 100 --threads 100 -v --interval 1 --ignore-nulls --request-file seq_shuf

First start gave me 120 tps and iostat showed 120-130 tps per disk, then after few hours I stopped the test and started it again, now the results are:

234 rps / iostat shows 160-170 tps per disk.

It seems that this is FS cache.

I've cleared the FS cache:
sync
echo 3 > /proc/sys/vm/drop_caches
 
started the benchmark once again and got the same results as in first test.

kay kay

unread,
Aug 20, 2013, 12:18:36 PM8/20/13
to project-...@googlegroups.com
Also I've noticed that while idle voldemort uses HDDs and fills in the file cache. Here is "idle" iostat:
avg-cpu: %user %nice %system %iowait %steal %idle
0,16 0,00 0,31 15,39 0,00 84,14

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0,00 0,00 0,00 0 0
sdb 240,00 13120,00 0,00 13120 0
sdc 293,00 12272,00 0,00 12272 0
Reply all
Reply to author
Forward
0 new messages