Hi all,
this is the first time I write in this newsgroup and I am a novice in data compression. I would like to share some crucial points I have encountered during the development of a second stage after the LZW method, as Deflate does with LZSS. Though this direction is in fact outdated by modern compressors, it is also poor covered for historical reason. The executable of my work and all the sources are available on my site
http://digilander.libero.it/borsa77
in the 'Most bastard dbm' package under the name Bolt, they are embodied with another program but the total size remains acceptable.
The natural choice for an entropic encoder falls on Huffman tree, but the Deflate algorithm limits the range of back pointers to a manageable value for the sake of speed, and also it uses a static model for the same reason (Okumura & Yoshizaki). It was clear that cover an entire LZW dictionary in such a manner becomes impossible, and a better solution must be found. Miller & Wegman proposed something called phasing: reserving the symbols under the size of 10 bit and (for what I understand) reducing others via Pascal triangle. Moffat instead makes use of a fast heap structure on all the dictionary, it sounds attractive to me, but he lacks the source.
Finally I found a tool that is supposed to work in a similar way: the Splay tree (Jones). At the time it was applied on the ASCII set directly on the input stream, so I have extended it for the LZW output hoping that is been able to retain its good speed performance. Some comparison with Info-Zip shows that in the Dos environment Bolt in compression time is 3/2x worse, where in Windows is 2x. I consider these results not so bad, mainly because I think most of the difference could be due to optimization.
In respect to compression ratio the situation does not change, Bolt always pays some (too) percent to Zip but for the file Kennedy.xls in the Calgary Corpus it achieves a minor size, the reason might be that Splay method is dynamic and it takes advantage on data with strong local correlation. As a final note I have to say that, like stated in the literature, on text files LZW with the standard variable code size works well than with Splay tree.
If someone wish to experiment with Bolt, I summarize here the various methods:
1 - LZRW1/KH with a more conservative search table and fixed for a misunderstated 32-bit operation (this method is not actual related to the post but it is here for confrontation);
2 - LZW with a dictionary of 4 Kb and variable code size;
3 - LZW with a dictionary of 16 Kb and variable code size;
4 - method 2 plus Splay tree;
5 - method 3 plus Splay tree.
My implementation of the pure LZW includes a couple of unusual feature:
a) methods 2 and 3 reset the dictionary with a MRU deletion criterion;
b) the phrases length is limited but instead to rely on a prefixed value known both to the compressor and decompressor it is signalled with an escape symbol that, in spite of a degradation of the compression size, allows a more open approach and in case of one lookahead selection (Storer, or in another term flexible parsing) avoids the proliferation of already seen phrases and the preservation of the decompressor speed.
I will appreciate any comments in the spirit of improvement.
Regards, Marco Borsari.