Cabinet HashDB growing exponentially

56 views
Skip to first unread message

Jonas Öberg

unread,
Nov 2, 2014, 2:17:03 AM11/2/14
to tokyocabi...@googlegroups.com
Hi,

I'm working with a project called Elog.io where we're storing 256 bit image hashes in a Kyoto Cabinet HashDB. The hashes are generated based on perceptual values from the images according to the algorithm on https://github.com/commonsmachinery/blockhash-rfc/blob/master/draft-commonsmachinery-urn-blockhash-00.txt. Here's a sample of 10 hashes:

ff3fffffffffffff7fff0fff03ff006400000000000007ff3ccbb80080000000
57dffffedff81ff8fff898787b780f7018103c1ee403c203c003c003b0033701
ffffff7fffffffffff1fff1ffe4fce2780118305f00184c780018001ffffffff
c00040007e0039000f001f004fc01ff93ff07ff1fff9fff1ffc0ff81ff80ff80
000007f80ffc1ffe3ffe3ffffffeaffe2ffe00380bf801f8017801c800f00000
8000800001f80607f003f807f767f7e9f5c1b821d80bc261fa81fe01ffffffff
fff8ffe77ecdff85e003c00ca00f901fe01fc067c03fa11931f1f1e0eca04411
fffcfbfcfb88fc00fdc0bfa0bf807f80fff043b045902394200420043c621f7e
e000ef803ff07ef058e049c0f1c0e070e0fcf0c24042806780f7ef7feb3bff7b
fffff93fa03ff81ff81ff20fe01fa01f121f101f043f041f001f400f001f801f

A problem we've encountered is that for the data we have, the HashDB size on disk grows exponentially against the number of records. With a mere 40,000 records, we're hitting close to 400 MB in size.
The problem seem to be specific to the data we use: if we use pure random data, the growth is linear against the number of records.

What we've done so far to optimise is to switch from HashDB to TreeDB, and we load 100k works at the time, with a defrag after each load. This seem to work more or less, but we're still encountering very large database sizes.
The tuning we do is that we set TLINEAR and tune_buckets to 200% of the maximum number of works we expect.

Any thoughts on this greatly appreciated.

Sincerely,
Jonas

Sven Hartrumpf

unread,
Nov 2, 2014, 5:34:27 AM11/2/14
to tokyocabi...@googlegroups.com
Hi Jonas.

Sun, 2 Nov 2014 00:17:03 -0700 (PDT), joberg wrote:
> A problem we've encountered is that for the data we have, the HashDB size
> on disk grows exponentially against the number of records. With a mere
> 40,000 records, we're hitting close to 400 MB in size.
> The problem seem to be specific to the data we use: if we use pure random
> data, the growth is linear against the number of records.

So, you probably need your own hash function, if you want to work
with HashDB efficiently. But this does not seem to be possible without
patching the source.
An alternative experiment would be to encode the keys in base-64 or base-62
or similar(?) instead of hexadecimal.

> What we've done so far to optimise is to switch from HashDB to TreeDB, and
> we load 100k works at the time, with a defrag after each load. This seem to
> work more or less, but we're still encountering very large database sizes.

You only talk about keys.
What about your values? What are typical value sizes?
Depending on your answers, tune_alignment might be relevant for your case.

Ciao
Sven
Reply all
Reply to author
Forward
0 new messages