Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

bwtzip: Linear Time Sorting, Now With Chunking Support

2 views
Skip to first unread message

Stephan T. Lavavej

unread,
Jan 29, 2003, 8:06:12 AM1/29/03
to
bwtzip, my portable linear-time BWT compressor distributed under the
GNU GPL, has been updated: http://stl.caltech.edu/bwtzip.html

I have new comparisons of its compression efficiency with respect to
traditional compressors, using my Large English Text Corpus (80.8 MB),
Large Source Code Corpus (273.5 MB), and Human Genome Corpus (3.13 GB)
- as before, bwtzip handily beats everything else.

I also have a FAQ answering some common questions.

I have received a report that, with minor Makefile changes, bwtzip
successfully compiles and runs under Linux. I develop it on Win32
using MinGW. (N.B. DJGPP should not be used to compile bwtzip.)

What's new in bwtzip since last I posted to comp.compression about it?

My implementation of the Ukkonen linear time suffix sorting algorithm
has been greatly improved. This algorithm, if you recall, is the
shining jewel of bwtzip: it guarantees performance on all input data
and never performs pathologically.
* The Depth First Search is now iterative rather than recursive. This
prevents the stack being blown when you try to compress a megabyte of
zeros.
* dummy SuffixNodes for leaves are no longer created. Also, I no
longer use std::list in TransitionList - instead, I use std::vector,
which uses space much more efficiently. bwtzip's memory usage is now
down to about 53 bytes/input byte; previously, this was 100. The
theoretical best for this algorithm is 20.
* The treatment of the "sanitized text" in the Ukkonen algorithm is
now done directly, using something I call the "disgusting pointer
trick".
* The sanitized text is now initialized in a separate function.
Indeed, the suffix tree constructor is now much simpler. It used to do
the entire Burrows-Wheeler Transformation; now it simply constructs
the suffix tree. The Depth First Search and the conversion from
lengths of sorted suffixes to the Burrows-Wheeler Transformed text are
now done in separate functions.
* Suffix tree construction is now faster thanks to std::vector's
reserve() ability.

As for the other algorithms,
* Time Stamp(0) is now in the codebase. It results in terrible
efficiency, so it is not part of bwtzip itself.
* I have finally implemented MTF-1.
* My original "Bcell" codebase is not used in bwtzip; the code is much
cleaner now.
* The Manber-Myers suffix array code now works with sentinels (instead
of the "disgusting doubling trick") and produces the exact same output
as the suffix tree code. A single decompressor (bwtunzip) works no
matter which algorithm was used for the forward transformation.
* The suffix array code now closely parallels the suffix tree code. In
particular, they both seek to generate the lengths of sorted suffixes,
and the code from that point on is identical.
* The Huffman code has been simplified and now engages in more
bit-twiddling to be faster.
* Huffman overhead is now a constant 256 bytes; formerly, it was 256 +
8 bytes.
* Similarly, the arithmetic encoder now no longer encodes the size of
the original text. It is now fully streaming and indicates the end of
the text by encoding a sentinel.
* The Makefile for bwtzip no longer sucks. Well, not quite as much as
before.
* I have added a naive STL sort and a Bentley-Sedgewick Tripartition
Quicksort to the codebase.
* make fast_bwtzips is available, which uses -fprofile-arcs,
-fbranch-probabilities to further optimize the executables.

The largest user-visible change is CHUNKING SUPPORT. bwtzip can now
work on arbitrarily large (dozens of gigabytes, what have you) files.
This is done carefully; it works even on systems where the C I/O
functions that deal with filesizes use 32-bit integers (like my
system). The file format (such as it is) for bwtzip is still not
robust, but at least I'm not hoovering the entire file into memory
anymore.

Chunk size may be controlled from the commandline. It defaults to 5
megabytes. If you use a chunk size too large for your system, bwtzip
will begin swapping to disk and the performance will simply die. Don't
use a chunk size too large for your system.

How's bwtzip's performance? The last time I checked, I was about 7.5x
slower than bzip2. There is no inherent reason why bwtzip should be
any more than 2x slower than bzip2. (The arithmetic coder may end up
being the bottleneck.)

Comments are welcome.

Stephan T. Lavavej

0 new messages