> have you done any performance tests on BabuDB?
We did a couple of test runs in the past, in which we measured the
average duration of insertions and lookups. BabuDB showed a performance
of up to several hundreds of thousands of such operations per second.
However, it is necessary to enable asynchronous log appends to attain
good insertion rates.
> I am planning to build a database with several million entries, how
> much memory are the indexes using?
It mainly depends on the size of your entries. If you know the approx.
number of entries and the average key and value size, you can estimate
the size of your database. The overhead added by the index structures is
rather small.
> Are the store and access times logarithmic or constant (ish)?
Since the in-memory data structure consists of red-black trees and the
index file is essentially a sorted list of entries that allows binary
search, all insertions and lookups have logarithmic access times.
> Should i split y data into several databases, or can i keep just one
> for simplicity?
The number of databases doesn't matter. However, the size of a single
index is currently limited to approx. 2 GB, as the underlying index file
is mapped into a memory buffer as a whole, and offsets in the buffer are
internally represented by a signed 32bit integer. We might remove this
limitation in the future, by splitting up the index file into multiple ones.
Hope this helps and best regards,
Jan
Are the values also kept in memory, or did i misunderstand.
In my mental model BabuDB keeps an index in memory which maps the keys
to a position in a file on disk, am i correct?
-atle
> Are the values also kept in memory, or did i misunderstand.
>
> In my mental model BabuDB keeps an index in memory which maps the keys
> to a position in a file on disk, am i correct?
Not quite... ;)
BabuDB follows the concept of Log-structured Merge Trees (LSM-trees).
This means that each BabuDB index consists of a rather small 'in-memory
tree', i.e. a tree-like data structure that resides in the main memory,
and a potentially huge read-only file called 'on-disk index', which is
persistently stored on disk, as the name suggests.
The on-disk index contains the latest checkpoint of the (BabuDB) index,
i.e. all entries that were contained in the index when the checkpoint
was created. The in-memory tree contains all changes (i.e. insertions
and deletions of entries) to the index that have been made since the
last checkpoint was created.
In addition, a global log file records all changes to all indices, to
allow for crash recovery. As soon as the log file becomes too large, a
new checkpoint is created for each index; this means that the current
on-disk index is merged with the in-memory tree and replaced by the merger.
To retrieve an entry from an index, a lookup is first performed on the
in-memory tree; if the key is not found, another lookup is performed on
the on-disk index. Since the on-disk index size may exceed the capacity
of the main memory, we simply mmap the file and leave the memory
management to the operating system.
That's essentially how BabuDB works.
Thanks, that helped a lot :) I expected that my view was a bit too
naive.
I wonder why you guys don't push the java portion of this framework
more. BabuDB seems to be solid, and a google search shows that there is
a huge gap between supply and demand of in-process, simple key-value
stores.