Building a better block sorter

3rd...@willets.org

unread,

Jul 26, 2005, 12:26:50 AM7/26/05

to

I'm wondering if anyone wants to work on a better block sorter, using
some of the more recently published algorithms which work in memory
comparable to the compressed size. The main problem with block or
suffix sorting has been the heavy space requirement, often far more
than the compressed output will occupy, and the inefficiency of the
sorting process itself.

The basic output of the sorter would be the BWT'ed input file, in
either the L vector or other suitable format. It would be fairly easy
to graft on a post-transform encoder of any type, or use the output for
other purposes.

I have a rough design, and a bit of code written in C++ to implement
part of the data structure. If I can find some more people to work on
it, we can start an open source project to implement the different
parts. It wouldn't be excessively difficult, just a bit of twiddling
with skip lists and packed bit structures.

There are some references which describe the newer algorithms on my
site (see the last two, by Hon and Sadakane):
http://willets.org/fti.html.

Of course if someone has already done something like this I'd be
grateful for any pointers. There is some older code available for less
efficient algorithms, but I haven't seen any code based on the new
stuff.

Regards,

Kendall Willets
firs...@lastname.org

michael

unread,

Jul 26, 2005, 10:13:33 AM7/26/05

to

Hi Kendall,
There are quite a few good implementations with source code
around now.

A short list:

MSufSort http://www.michael-maniscalco.com/msufsort.html
DivSufSort http://homepage3.nifty.com/wpage/
DeepShallow www.mfn.unipmn.it/~manzini/lightweight/
GRZipII http://magicssoft.ru/?folder=projects&page=GRZipII

All of these are very fast and efficient and all have source
code available. However, you might want to decide on a strict
set of qualifiers since various algorithms have their own
individual short comings. Be it memory usage, catastrophic
worst case performance, etc.

Yuta Mori has a recent comparison table on a wide variety of
test sets at
http://translate.google.com/translate?hl=en&sl=ja&u=http://homepage3.nifty.com/wpage/&prev=/search%3Fq%3Dhttp://homepage3.nifty.com/wpage/%26hl%3Den%26lr%3D%26sa%3DG

Also, I would suggest that not just the BWT be produced but also
an option for the suffix array itself since there are now compression
algorithms that require more than the BWT to compress the result
better.
For example, my M03 requires the match length between each suffix
and the next highest sorted suffix to encode correctly. So, the
BWT itself would not be enough in this case.

What would be the performance requirements in terms of
max memory usage, worst case performance etc.?

- Michael A Maniscalco

michael

unread,

Jul 26, 2005, 10:20:54 AM7/26/05

to

Sorry, the link for the results table is
http://homepage3.nifty.com/wpage/junk/_table/20050610_time.html

- Michael

3rd...@willets.org

unread,

Jul 26, 2005, 2:48:38 PM7/26/05

to

I've used DS and I just tried divsufsort; it seems they're both 5n for
memory. I'm surprised no one has implemented one of the 1n algorithms
from recent years. The main problem I'm having is memory, and CPU
would be second.

It seems like a decent sorter could be built with n log n performance
in 1n or less space. In real world terms I'd estimate around 1k
instructions per character, give or take an order of magnitude. Most
of the work is packing and unpacking small integers.

michael

unread,

Jul 26, 2005, 8:46:04 PM7/26/05

to

I see two problems with this. First, I very much doubt that
a 1N solution is possible or, even if it is, would be effective
it terms of speed. But more than that, memory is becomming a
more expendable resource by the minute. Time, however, is not. (^:

- Michael

3rd...@willets.org

unread,

Jul 27, 2005, 8:39:20 PM7/27/05

to

I did some estimating last night, and it looks like I'm closer to 10k
instructions per character, which seems a bit slow. I'm going to look
at some papers to see if things can be improved.

michael

unread,

Jul 27, 2005, 11:33:44 PM7/27/05

to

A word to the wise. ...

Don't let yourself get tangled strictly in the
mathematics in this case. There are a lot of
this that affect the performance of a suffix sorter
on the modern machine. Cache performance is pivotal
in performance. You should not worry about instructions
per symbol but how the algorithm in question works
on the modern machine.

A good example is QSufSort which is academically sound
but suffers huge performance issues due to its cache
unfriendliness.

More specifically, I might trade an algorithm that
has 4 times as many ops per symbol for one that
has 1 per symbol if the 4 per symbol is significantly
more cache friendly because it might well work faster.

You need to decide what you want here. It is speed,
memory usage, simplicity, cache friendly or academically
best.

And I don't think you will find your answers in a paper
or someone would have already come up with them. ... (^:

- Michael A Maniscalco