Creating and maintaining a set of 20-100 billion strings with limited storage space, optimize for fast add/contains operation

81 views
Skip to first unread message

Karl Trummer

unread,
Aug 29, 2020, 12:27:30 PM8/29/20
to MapDB
Hi,

I often get lists of strings and have to filter out previous seen/already known strings. I started with a file based approach, but it does not scale well, so I'm trying to migrate to MapDB. But I think I'm doing something wrong..

File based approach:
I basically maintain a sorted text file of already seen strings and whenever I get a new list I sort it first, than diff if using "comm". The diff file is processed and afterwards added to the sorted list (atm insert is stupid, so I have to sort the complete file afterwards). This works, but I need about three times the storage space of the sorted file, to sort it (old sorted file, tmp data from sorting and finally newly created sorted file). Also this process seems costly for small new list to check, as "comm" has to process the complete big sorted file in the worst case.

MapDB based approach:
I worte a Java program that support two operations, "add" and "diff". Basically one thread is reading the strings from the input file, hashes them and based on the prefix puts the string into one of 15 LinkedBlockingQueues. Each LinkedBlockingQueue is assigned to a separate thread, that based on the operation adds the string to the MapDB or checks, if it is already contained. Basically the approach works, speed is ok (about 500k strings per second), but the DB files (each LinkedBlockingQueue thread creates a separate MapDB) need about 4 times the storage of the textfile. 

Each thread creates a file-based DB and a treeSet:
db = DBMaker
        .fileDB("testDB_" + name)
        .fileMmapEnable()
        .allocateIncrement(512 * 1024 * 1024)
        .make();
treeSet = (NavigableSet<String>) db
        .treeSet("treeSet_"+name)
        .maxNodeSize()
        .serializer(new SerializerCompressionWrapper(Serializer.STRING))
        .createOrOpen(); 

Any tips on how I can improve speed and minimize the required storage space? The strings are usually quite short (10-25 chars, ASCII).


Thank you!

Best regards
Karl Trummer

alla...@UToronto.CA

unread,
Aug 30, 2020, 4:24:25 PM8/30/20
to MapDB
(minor) try using a fileChannel instead of a MMap.

(Major) For the diff part, instead of storing the strings, store the CRC64 of the string.  To see if the string already exists, just lookup the CRC64 of the string.  If not found, that string does not exists.  CAVEATS:  There is a collision rate of about 12 per billion so you can get 12 false positives and 12 false negatives (for a total of 24 errors).  If that's acceptable then lookups are very quick.  Also, you're only concerned with the keys and not the data so simple store 1 byte for every key as a placeholder.  Or better yet, perhaps a set and avoid the byte.

Karl Trummer

unread,
Aug 31, 2020, 2:51:21 AM8/31/20
to MapDB
Thank you, I will try FileChannels.

Storing only the hash value or a checksum does not work for my use-case, as i cannot accept false positives (DB claims string is already processed, but it isn't). I think i'm already using a set, or is a treeSet not optimal? 

Jan Kotek

unread,
Aug 31, 2020, 4:25:37 AM8/31/20
to Karl Trummer, ma...@googlegroups.com

Hi,

file is probably too big for fragmentation.

Try calling DB.compact() or db.getStore().compact().

Another option is to import data using data sink in V3

DB.TreeMapSink sink = db.createFromSink()
sink.put(key, val) //for each entry

Map map = sink.create();

Regards,
Jan

--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mapdb/507fb773-cafd-42a0-a0d6-afead62fabf2o%40googlegroups.com.

Karl Trummer

unread,
Sep 3, 2020, 3:10:57 AM9/3/20
to MapDB
Hi,

thank you! Compacting the DB helped a lot. After compating the database needs less space than the corresponding (uncompressed) text file.

Best regards,
Karl

Reply all
Reply to author
Forward
0 new messages