reduce memory usage for single thread

Phuong Dao

unread,

Nov 3, 2015, 3:51:21 PM11/3/15

to concurrent-trees-discuss

First thank you for the implementation. It is very capable of handling large data sets in efficient running time. I tried concurrent suffix tree to store around 4 million string and each length around 50 using only single thread. However, it consumes around 50gb of memory. Are there anyway to modify the code to make the package use less memory ?

Niall

unread,

Nov 3, 2015, 3:54:36 PM11/3/15

to concurrent-trees-discuss

Hi Phuong,

Suffix trees do use a lot of memory, but it can be reduced by using an alternative node factory. Have you tried any of those?

Thanks

Niall

Phuong Dao

unread,

Nov 3, 2015, 4:25:30 PM11/3/15

to concurrent-trees-discuss

Yes I try all the node factory classes. However, the memory usage is still around 40-50gb. I might need to write my own node factory that makes searching for substring harder later on.

Niall

unread,

Nov 3, 2015, 5:01:40 PM11/3/15

to concurrent-trees-discuss

Hi Phuong,

I see.

Normally I would recommend to use the DefaultCharSequenceNodeFactory to reduce memory usage of suffix trees. However: it is better at reducing memory overhead for suffix trees containing a smaller number of large keys, than for suffix trees which contain a larger number of small unrelated keys.

It works by using pointers to different offsets of the same input string. But in your case because there are many (4 million) short and possibly unrelated input strings, there will be fewer opportunities for it to reuse common substrings.

So indeed, the best solution for your data might be to implement your own node factory - especially ones which apply compression.

If your strings are ASCII text then DefaultByteArrayNodeFactory should reduce memory overhead.

If your strings are not ASCII, say having 2 bytes per character, then a node factory implementation which can compress UTF-16 strings might give better results.

Since your strings are short, it might be difficult to achieve much compression by compressing strings individually within each node.

So one idea would be to use a compression algorithm which leverages a technique like Huffman coding where the same Huffman table is used to compress all keys in the tree. This way a single dictionary is shared by all keys in the tree, and each node in the tree contains only the lookup symbols for the characters in that node.

There might be ready-made compression algorithms out there which are tailored to compressing strings in the language of the strings. If so then an off-the-shelf library might allow you to compress the strings easily, because that library could have a Huffman table for the target language built in, and you would not need to code it yourself.