Can you batch things up so that you test, say, 1000 keys using Get
operations, and then commit a WriteBatch with 1000 keys at a time? I
suspect that will be somewhat faster (comparing the 'fillseq' vs.
'fillbatch' performance in the ./db_bench benchmark with one-byte
values shows it to be about 20% faster on my machine).
It might also be interesting to compare the performance you're getting
with the performance of just doing the 10M inserts without the
verifying Get operations.
Up to fairly large database sizes, it's probably going to be faster to
do a sequential scan over ranges of the database rather than 10M
individual Get operations. For example, using db_bench, I simulated
this approach via this command:
% ./db_bench --value_size=1 --cache_size=256000000 --num=10000000
--benchmarks=fillseq,readseq,readrandom
LevelDB: version 1.2
Date: Thu Jul 7 14:44:19 2011
CPU: 4 * Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
CPUCache: 4096 KB
Keys: 16 bytes each
Values: 1 bytes each (1 bytes after compression)
Entries: 10000000
RawSize: 162.1 MB (estimated)
FileSize: 157.4 MB (estimated)
------------------------------------------------
fillseq : 1.248 micros/op; 13.0 MB/s
readseq : 0.277 micros/op; 58.5 MB/s
readrandom : 4.083 micros/op;
From this, you can see that the readrandom operations (each one
essentially a Get operation) on a database of 100M 16-byte keys with
1-byte values takes about 4.083 microseconds/key, while iterating
sequentially over the whole database can be done in 0.277
microseconds/key. So, if your database is less than ~15X times the
size of the new set of keys you're inserting, you're better off doing
a sequential scan rather than individual key Get operations. Note
that you don't have to do a full scan over the database: you should be
able to create an Iterator on the database, and then process the keys
in the iterator in parallel with the new keys you want to add to the
database.
Please report back here if any of these suggestions make a difference.
-Jeff
There's also an optimization to check if your key exists by using an
iterator instead of the call to Get(). Here's an example of it in use:
https://github.com/basho/eleveldb/blob/master/c_src/eleveldb.cc#L272
Not sure how much of an impact that overhead has on your case with
such small values, but it might an acceptable middle ground if you do
need the existence check.
Thank Dave Smith, that's who I stole it from. :D
Just wanted to let everyone know that I've completed the first set of
benchmarks and LevelDB solidly outperforms Berkeley DB for larger
databases. I've implemented Snappy, TCMalloc, batch writes, and the
Iterator fix mentioned above.
Thanks for your help!