Hi,
I often get lists of strings and have to filter out previous seen/already known strings. I started with a file based approach, but it does not scale well, so I'm trying to migrate to MapDB. But I think I'm doing something wrong..
File based approach:
I basically maintain a sorted text file of already seen strings and whenever I get a new list I sort it first, than diff if using "comm". The diff file is processed and afterwards added to the sorted list (atm insert is stupid, so I have to sort the complete file afterwards). This works, but I need about three times the storage space of the sorted file, to sort it (old sorted file, tmp data from sorting and finally newly created sorted file). Also this process seems costly for small new list to check, as "comm" has to process the complete big sorted file in the worst case.
MapDB based approach:
I worte a Java program that support two operations, "add" and "diff". Basically one thread is reading the strings from the input file, hashes them and based on the prefix puts the string into one of 15 LinkedBlockingQueues. Each LinkedBlockingQueue is assigned to a separate thread, that based on the operation adds the string to the MapDB or checks, if it is already contained. Basically the approach works, speed is ok (about 500k strings per second), but the DB files (each LinkedBlockingQueue thread creates a separate MapDB) need about 4 times the storage of the textfile.
Each thread creates a file-based DB and a treeSet:
db = DBMaker
.fileDB("testDB_" + name)
.fileMmapEnable()
.allocateIncrement(512 * 1024 * 1024)
.make();
treeSet = (NavigableSet<String>) db
.treeSet("treeSet_"+name)
.maxNodeSize()
.serializer(new SerializerCompressionWrapper(Serializer.STRING))
.createOrOpen();
Any tips on how I can improve speed and minimize the required storage space? The strings are usually quite short (10-25 chars, ASCII).
Thank you!
Best regards
Karl Trummer