[Reformatting this post so that the lines wrap and are uniform]
Hi Paolo,
The BdbStorageEngine is log-structure, meaning that all writes are appended to the end of the latest bdb log (jdb file), this includes writes to existing keys. Then, over time as more writes come through, the log structure will be compacted by cleaner threads that will scan the older log files, find live records in them, append them to end of the current log file and mark the older log files for deletion. Then all of the files marked for deletion will be deleted.
There are a handful of tunable parameters for controlling the cleaner threads and the "efficiency" of the structure:
bdb.cache.size (209715200)
bdb.max.logfile.size (62914560)
bdb.cleaner.interval.bytes (31457280)
bdb.cleaner.threads (1)
bdb.cleaner.minUtilization (50)
bdb.cleaner.min.file.utilization (0)
bdb.one.env.per.store (false)
And there are some others as well, but the above are the most critical ones.
If you have the default setting of bdb.one.env.per.store to false and you have more than one bdb store on the cluster, then your bdb cleaning is going to be very inefficient. Changing this to true, however, will cause you to lose all of your store's existing data (effectively starting your cluster over from scratch). This is a setting that is best to set at the start of your cluster. If you have this set, you'll have separate cleaner threads for each store. Having that will help a lot. If you can afford to lose all of your data and change this setting, then I recommend you set your bdb.cleaner.threads to 2 or 3. If you cannot lose your data and need to stick with bdb.one.env.per.store=false, then you're going to need a much higher number of cleaner threads. I'd make sure you have at least 2 and probably fewer than 20.
I recommend keeping bdb.cleaner.min.file.utilization and bdb.cleaner.minUtilization at default, but if you want to change them then you should make sure that one of them is always set to 0 and only tune/adjust one of them. In bdb, the "utilization" is the percent of records that are live. Any time you delete or overwrite a key, you create a "dead" record in the bdb structure. As the number of dead records increases, the utilization goes down. The first parameter controls the utilization target on a per log file basis. The latter setting controls the utilization target across all log files in the bdb environment (again, this becomes more efficient with bdb.one.env.per.store=true). I recommend keeping bdb.cleaner.min.file.utilization at 0 and depending on the environment-wide bdb.cleaner.minUtilization setting. It defaults to a 50% utilization target across the environment. So, if the overall utilization is => 50% across all bdb files in the environment, when the cleaner threads wake up they will immediately exit without doing any work. But if it drops below 50%, when the cleaner threads wake up, they will start attempting to find live records in the less utilized files and append them to the latest bdb file, then they will mark all the 0% utilized files for deletion.
bdb.cleaner.interval.bytes controls when the bdb cleaner threads wake up. The value of this is the number of bytes written to the bdb engine. So, by default, every time you write ~30mb to bdb, the cleaner threads wake up. The write size is the record size (a combination of the key and value sizes), plus the internal voldemort vector clock, timestamp and some additional bytes defining schema version number, compression time and perhaps a couple of other pieces of meta data.
bdb.max.logfile.size controls the max size of a bdb log file. One this size is reached, the file is capped off with the latest records and a footer, then a new file is created and all records are appended to that file. I recommend keeping this default, but increasing or decreasing its size slightly might be necessary for you. But, the more records you have per log file, the more bdb cache you will need in order for the cleaner threads to complete compaction.
Lastly the bdb.cache.size needs to be large enough to fit your store's index structure so that compaction can complete. The larger your log files are and the more records you have per log file, the more memory you're going to need in bdb cache in order to be able to allow the cleaners to fully migrate live records to the latest log file to facilitate compaction.
One thing you can look at is the Bdb "CleanerBacklog" metric in the bdb-store-stats mbean via JMX (or whatever metrics monitoring you do) to see how effective the cleaning is. There is also a "NumCleanerRuns" metric in the same mbean. With the two of those, you can see general bdb compaction efficiency. So, if you plot out those two data points and watch their trend as you change these settings, you can hone in on a more effective configuration for your use case.
Brendan