Hi Phuong,
I see.
Normally I would recommend to use the DefaultCharSequenceNodeFactory to reduce memory usage of suffix trees. However: it is better at reducing memory overhead for suffix trees containing a smaller number of large keys, than for suffix trees which contain a larger number of small unrelated keys.
It works by using pointers to different offsets of the same input string. But in your case because there are many (4 million) short and possibly unrelated input strings, there will be fewer opportunities for it to reuse common substrings.
So indeed, the best solution for your data might be to implement your own node factory - especially ones which apply compression.
If your strings are ASCII text then DefaultByteArrayNodeFactory should reduce memory overhead.
If your strings are not ASCII, say having 2 bytes per character, then a node factory implementation which can compress UTF-16 strings might give better results.
Since your strings are short, it might be difficult to achieve much compression by compressing strings individually within each node.
So one idea would be to use a compression algorithm which leverages a technique like
Huffman coding where the same Huffman table is used to compress all keys in the tree. This way a single dictionary is shared by all keys in the tree, and each node in the tree contains only the lookup symbols for the characters in that node.
There might be ready-made compression algorithms out there which are tailored to compressing strings in the language of the strings. If so then an off-the-shelf library might allow you to compress the strings easily, because that library could have a Huffman table for the target language built in, and you would not need to code it yourself.
Hope that helps!
Niall