How to avoid increasing of the db after writes on same keys

71 views
Skip to first unread message

Paolo Forte

unread,
Mar 30, 2015, 10:02:26 AM3/30/15
to project-...@googlegroups.com
Hi,
I noticed that each time I run the performance tool to over-write the keys' values (i.e. -w 100 -r 0 )  in the database, the database size increases instead of remaining of the same size.

The keys on which I am performing the operations are always the same. I am writing 200KB per each key over a range of 200.000 keys (40GB). I am running the single_node_cluster configuration and the performance-tool's command is:
./voldemort-performance-tool.sh --interval 1 --iterations 1 --metric-type summary --num-connections-per-node 100000 --percent-cached 0 -r 0 -w 100 --record-selection uniform --store-name test --threads 16 --url tcp://192.168.0.9:6666 --value-size 200000 --record-count 200000 --ops-count 2000

NOTE: Actually, I modified the code in order to skip the warmup phase and to start directly with the writes. I checked that the range of the keys on which it operates is really the desired one (i.e. 1 - 200.000)

I guess this is due to the "versioning" of the keys. Am I doing something wrong? If not, how can I avoid or reduce this phenomenon? Otherwise it happens that all the available space on the disk is quickly occupied.

Thanks,
Paolo. 
Message has been deleted

Paolo Forte

unread,
Mar 30, 2015, 1:09:38 PM3/30/15
to project-...@googlegroups.com
Hi,
Thanks for your reply.

There is an error in my first post. I am not using the single_node_cluster configuration, but the prod_single_node_cluster and its parameters are the following:
# configs
admin.enable=true
admin.max.threads=40
bdb.cache.evictln=true
bdb.cache.size=20GB
bdb.checkpoint.interval.bytes=2147483648
bdb.checkpointer.off.batch.writes=true
bdb.cleaner.interval.bytes=15728640
bdb.cleaner.lazy.migration=false
bdb.cleaner.min.file.utilization=0
bdb.cleaner.threads=1
bdb.enable=true
bdb.evict.by.level=true
bdb.expose.space.utilization=true
bdb.lock.nLockTables=47
bdb.minimize.scan.impact=true
bdb.one.env.per.store=true
enable.server.routing=false
enable.verbose.logging=false
http.enable=true
nio.connector.selectors=50
num.scan.permits=2
request.format=vp3
restore.data.timeout.sec=1314000
scheduler.threads=24
socket.enable=true
storage.configs=voldemort.store.bdb.BdbStorageConfiguration, voldemort.store.readonly.ReadOnlyStorageConfiguration
stream.read.byte.per.sec=209715200
stream.write.byte.per.sec=78643200

This is quite the default, I just increase the bdb cache size. I'll try to increase the number of cleaners as you suggest but how is it possible that when I turn off the servers, the space has not be freed? If the problem is the one you pointed, shouldn't everything be freed when the nodes are switched off?
This is the reason because I thought it was something due to the versioned values that perhaps are not erased after being "updated" with the following writes.
Moreover, would you mind to help me to set such parameters? I lack the knowledge to tune them. I don't care if I lose the data, since the Values are random values inserted using the performance tool's warmup.

Just some info about the nodes' hardware: the machines have 64 GB RAM, two 12-core AMD Opteron processors, a 500 GB hard disk and a 1 Gb network controller. They run Ubuntu 12.04 LTS.

I really appreciate your help. Thanks.

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Mar 30, 2015, 1:13:24 PM3/30/15
to project-...@googlegroups.com
[Reformatting this post so that the lines wrap and are uniform]

Hi Paolo,

The BdbStorageEngine is log-structure, meaning that all writes are appended to the end of the latest bdb log (jdb file), this includes writes to existing keys. Then, over time as more writes come through, the log structure will be compacted by cleaner threads that will scan the older log files, find live records in them, append them to end of the current log file and mark the older log files for deletion. Then all of the files marked for deletion will be deleted.

There are a handful of tunable parameters for controlling the cleaner threads and the "efficiency" of the structure:
bdb.cache.size (209715200)
bdb.max.logfile.size (62914560)
bdb.cleaner.interval.bytes (31457280)
bdb.cleaner.threads (1)
bdb.cleaner.minUtilization (50)
bdb.cleaner.min.file.utilization (0)
bdb.one.env.per.store (false)

And there are some others as well, but the above are the most critical ones.

If you have the default setting of bdb.one.env.per.store to false and you have more than one bdb store on the cluster, then your bdb cleaning is going to be very inefficient. Changing this to true, however, will cause you to lose all of your store's existing data (effectively starting your cluster over from scratch). This is a setting that is best to set at the start of your cluster. If you have this set, you'll have separate cleaner threads for each store. Having that will help a lot. If you can afford to lose all of your data and change this setting, then I recommend you set your bdb.cleaner.threads to 2 or 3. If you cannot lose your data and need to stick with bdb.one.env.per.store=false, then you're going to need a much higher number of cleaner threads. I'd make sure you have at least 2 and probably fewer than 20.

I recommend keeping bdb.cleaner.min.file.utilization and bdb.cleaner.minUtilization at default, but if you want to change them then you should make sure that one of them is always set to 0 and only tune/adjust one of them. In bdb, the "utilization" is the percent of records that are live. Any time you delete or overwrite a key, you create a "dead" record in the bdb structure. As the number of dead records increases, the utilization goes down. The first parameter controls the utilization target on a per log file basis. The latter setting controls the utilization target across all log files in the bdb environment (again, this becomes more efficient with bdb.one.env.per.store=true). I recommend keeping bdb.cleaner.min.file.utilization at 0 and depending on the environment-wide bdb.cleaner.minUtilization setting. It defaults to a 50% utilization target across the environment. So, if the overall utilization is => 50% across all bdb files in the environment, when the cleaner threads wake up they will immediately exit without doing any work. But if it drops below 50%, when the cleaner threads wake up, they will start attempting to find live records in the less utilized files and append them to the latest bdb file, then they will mark all the 0% utilized files for deletion.

bdb.cleaner.interval.bytes controls when the bdb cleaner threads wake up. The value of this is the number of bytes written to the bdb engine. So, by default, every time you write ~30mb to bdb, the cleaner threads wake up. The write size is the record size (a combination of the key and value sizes), plus the internal voldemort vector clock, timestamp and some additional bytes defining schema version number, compression time and perhaps a couple of other pieces of meta data.

bdb.max.logfile.size controls the max size of a bdb log file. One this size is reached, the file is capped off with the latest records and a footer, then a new file is created and all records are appended to that file. I recommend keeping this default, but increasing or decreasing its size slightly might be necessary for you. But, the more records you have per log file, the more bdb cache you will need in order for the cleaner threads to complete compaction.

Lastly the bdb.cache.size needs to be large enough to fit your store's index structure so that compaction can complete. The larger your log files are and the more records you have per log file, the more memory you're going to need in bdb cache in order to be able to allow the cleaners to fully migrate live records to the latest log file to facilitate compaction.

One thing you can look at is the Bdb "CleanerBacklog" metric in the bdb-store-stats mbean via JMX (or whatever metrics monitoring you do) to see how effective the cleaning is. There is also a "NumCleanerRuns" metric in the same mbean. With the two of those, you can see general bdb compaction efficiency. So, if you plot out those two data points and watch their trend as you change these settings, you can hone in on a more effective configuration for your use case.

Brendan

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Mar 30, 2015, 1:47:22 PM3/30/15
to project-...@googlegroups.com
Hi Paolo,

On Monday, March 30, 2015 at 10:09:38 AM UTC-7, Paolo Forte wrote:
bdb.cache.size=20GB

With such a large cache size, you're probably going to need a JVM heap size of 25+gb and 90% of it will be needed for old gen (so make sure the NewSize is not using too much of the heap). 

bdb.cleaner.threads=1

Maybe set this to 3 and see what happens.
 
storage.configs=voldemort.store.bdb.BdbStorageConfiguration, voldemort.store.readonly.ReadOnlyStorageConfiguration

Remove  ", voldemort.store.readonly.ReadOnlyStorageConfiguration" from the storage.configs. The ReadOnly and Bdb storage engines cannot be reliably run within the same JVM because they are vastly different engines with extremely different behavior and heap usage.

This is quite the default, I just increase the bdb cache size. I'll try to increase the number of cleaners as you suggest but how is it possible that when I turn off the servers, the space has not be freed? If the problem is the one you pointed, shouldn't everything be freed when the nodes are switched off?

Not necessarily. The cleaners wake and run after a certain number of bytes are written. But it also depends on your overall utilization. One of the other metrics you can look at in the bdb store stats mbean is the utilization. If that is at or above 50% (shown as a float of hundredths of a percent), then there is nothing to compact and the cleaners go back to sleep even after being woken up. Also, if you do something like SIGKILL the voldemort server or power down the host, then the cleaner threads will not be able to complete their work. I don't know how you are shutting down the nodes when you say "switched off".

This is the reason because I thought it was something due to the versioned values that perhaps are not erased after being "updated" with the following writes.

We don't support historical records in voldemort. So, no this is not the case. Every time you overwrite a record, the old record is discarded (garbage in an earlier bdb file) and the new record is appended to the latest jdb file. Eventually the cleaners will compact the old files and discard them.
 
Moreover, would you mind to help me to set such parameters? I lack the knowledge to tune them.

This is something you'll have to learn with your own use case. There is no predetermined formula that will work for everyone. Read through my earlier post and consider the details. If you think about them a bit, aligned with your use case, you'll figure out what you need.

One more thing to keep in mind is that your JVM needs to be well-tuned for the work that voldemort is doing. You will probably need to analyze your JVM's activity, including rate of object creation, rate of object promotion from NewGen to OldGen, GC frequency, GC time and amount of time that object creation and promotion are stopped during GC events.

Brendan

Paolo Forte

unread,
Mar 31, 2015, 9:52:44 AM3/31/15
to project-...@googlegroups.com
Thanks Brendan,

Currently my JVM's parameters are these (from bin/voldemort-prod-server.sh):

export VOLD_OPTS=" -server -Xms32684m -Xmx32684m -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2 -XX:+AlwaysPreTouch -XX:+UseCompressedOops -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime "

In order to shut off Voldemort I simply press CTRL + C ( that should be SIGINT ); voldemort receives the signal and after some operations shut itself off.

I'll try to follow your suggestions. Thanks again.

Paolo.

Paolo Forte

unread,
Mar 31, 2015, 10:13:39 AM3/31/15
to project-...@googlegroups.com
P.S. currently, with the config I wrote before and just increasing the cleaner threads to 4, "CleanerBacklog" and "NumCleanerRuns" are stuck to 0. The getPercentageUtilization() returns 0.8600.

Looking at size of the folder config/data, it keeps increasing. I really can't understand what I am doing wrong.

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Mar 31, 2015, 10:32:44 AM3/31/15
to project-...@googlegroups.com
Hi Paolo,

On Tuesday, March 31, 2015 at 7:13:39 AM UTC-7, Paolo Forte wrote:
P.S. currently, with the config I wrote before and just increasing the cleaner threads to 4, "CleanerBacklog" and "NumCleanerRuns" are stuck to 0. The getPercentageUtilization() returns 0.8600.

86% is very good utilization. The cleaners won't do any work until that number drops below 0.50 or so.

Looking at size of the folder config/data, it keeps increasing. I really can't understand what I am doing wrong.

You're adding more data to the structure, even when overwriting existing records. This is normal log-structured behavior.

It does not look like you're doing anything wrong. Your voldemort and JVM configs look very sane and it looks like everything is fine. It is doing what it is supposed to. Let it increase and, over time, as the utilization drops down below 50% it will start cleaning itself up. You could alternately set the minUtilization up to 90% or something if you are really space restricted, but you're going to have a marked decrease in performance, potentially crippling normal get and put operations if you do a lot of bytes in puts over a short period of time.

If log-structured behavior is not what you want, then you need a different kind of DB that supports true fixed-pointer inserts into the structure and does not have a transaction/write-ahead log. Otherwise, you need to expect that a log-structured database is going to consume extra space for a while and free up space later. This is the cost that is exacted for having a write-optimized, on-disk data structure.

Brendan

Paolo Forte

unread,
Apr 4, 2015, 8:00:23 AM4/4/15
to project-...@googlegroups.com
Hi,
Suddenly my disk space was completely full. I couldn't even modify a text file.

I had to manually delete the folder config/data to free some space.

I guess that could be my fault, since that I simply delete the folder config/data in order to get rid of the contents of a store.
I guess there are other files somewhere that now are "zombies" and which occupy my disk space. May they be the bdb logs?

Brendan Harris

unread,
Apr 4, 2015, 11:10:49 AM4/4/15
to project-...@googlegroups.com
Yeah, don't do this. The index (maintained as a footer record written with every record flush to the bdb structure) will be incomplete and thus the bdb structure will be unable to be maintained/compacted by the engine. At this point, you should shutdown all of the servers, delete all of the jdb files and start them up again to start from scratch. After that it will be able to maintain itself on its own.

Brendan


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.



--
Brendan Harris
1.650.291.5761

Paolo Forte

unread,
Apr 4, 2015, 11:15:42 AM4/4/15
to project-...@googlegroups.com
Please, could you write where they are located and in case which of them should be deleted?

Paolo Forte

unread,
Apr 4, 2015, 11:17:39 AM4/4/15
to project-...@googlegroups.com
P.s. Which is the correct way to empty a store? 
Thanks a lot for your help.

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Apr 4, 2015, 11:54:56 AM4/4/15
to project-...@googlegroups.com


On Saturday, April 4, 2015 at 8:17:39 AM UTC-7, Paolo Forte wrote:
P.s. Which is the correct way to empty a store? 
Thanks a lot for your help.

You can use the vadmin.sh tool (or the voldemort-admin-tool.sh if you're running an older version) to do a "truncate" of your store. This will mark all records as deleted and the cleaners will come through and compact the entire structure the next time you write ~20mb of data (default config). It won't delete everything right away, though. It will use the bdb compaction method to clean up the structure and free up space after you write more data to the structure. There is no immediate delete and reclamation of space. This is the case in pretty much any database system without "violating the laws" of that system.

As for where the data files are, if you're not defining data.dir in your server.properties, it will be under config/<cluster_type>/config/voldemort/data/<storename>/*.jdb. To make it easier on yourself, just do the following after shutting down the voldemort server:
find <path_to_config_dir> -type f -name "*.jdb" -print0 | xargs -0 rm -f

This won't work on windows, though. If it's Windows, I guess you would use Windows search? But, the above command on unix-like operating systems will find all of the jdb files and remove them.

Brendan

Paolo Forte

unread,
Apr 4, 2015, 12:39:58 PM4/4/15
to project-...@googlegroups.com
Note: when I wrote config/data I actually meant config/<cluster_name>/data. 

I looked for the .jdb files but apparently they are not located in the folder you wrote.
I looked then in the other clusters' folders and I found that they are usually located in /voldemort/config/<cluster_name>/data/bdb/<store_name>/*.jdb.

So, when I deleted the folder config/<cluster_name>/data, I deleted such files as well.
I guess therefore that if I want to delete completely a store, I can just delete the folder config/<cluster_name>/data as I actually did, even though is not an elegant way to get rid of it. Am I right?

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,
Apr 4, 2015, 12:51:03 PM4/4/15
to project-...@googlegroups.com


On Saturday, April 4, 2015 at 9:39:58 AM UTC-7, Paolo Forte wrote:
Note: when I wrote config/data I actually meant config/<cluster_name>/data. 

I looked for the .jdb files but apparently they are not located in the folder you wrote.
I looked then in the other clusters' folders and I found that they are usually located in /voldemort/config/<cluster_name>/data/bdb/<store_name>/*.jdb.

Ah, yes. Sorry for misleading you.

So, when I deleted the folder config/<cluster_name>/data, I deleted such files as well.
I guess therefore that if I want to delete completely a store, I can just delete the folder config/<cluster_name>/data as I actually did, even though is not an elegant way to get rid of it. Am I right?

Yes, that would delete all of the data files. The store would still exist in the cluster, because it will still have a store definition, though. If you want to delete that as well, the prescribed way of doing this would be to do a "vadmin.sh store delete -s <storename>" or "voldemort-admin-tool.sh --delete-store <storename>".

Brendan
Reply all
Reply to author
Forward
0 new messages