The basic output of the sorter would be the BWT'ed input file, in
either the L vector or other suitable format. It would be fairly easy
to graft on a post-transform encoder of any type, or use the output for
other purposes.
I have a rough design, and a bit of code written in C++ to implement
part of the data structure. If I can find some more people to work on
it, we can start an open source project to implement the different
parts. It wouldn't be excessively difficult, just a bit of twiddling
with skip lists and packed bit structures.
There are some references which describe the newer algorithms on my
site (see the last two, by Hon and Sadakane):
http://willets.org/fti.html.
Of course if someone has already done something like this I'd be
grateful for any pointers. There is some older code available for less
efficient algorithms, but I haven't seen any code based on the new
stuff.
Regards,
Kendall Willets
firs...@lastname.org
A short list:
MSufSort http://www.michael-maniscalco.com/msufsort.html
DivSufSort http://homepage3.nifty.com/wpage/
DeepShallow www.mfn.unipmn.it/~manzini/lightweight/
GRZipII http://magicssoft.ru/?folder=projects&page=GRZipII
All of these are very fast and efficient and all have source
code available. However, you might want to decide on a strict
set of qualifiers since various algorithms have their own
individual short comings. Be it memory usage, catastrophic
worst case performance, etc.
Yuta Mori has a recent comparison table on a wide variety of
test sets at
http://translate.google.com/translate?hl=en&sl=ja&u=http://homepage3.nifty.com/wpage/&prev=/search%3Fq%3Dhttp://homepage3.nifty.com/wpage/%26hl%3Den%26lr%3D%26sa%3DG
Also, I would suggest that not just the BWT be produced but also
an option for the suffix array itself since there are now compression
algorithms that require more than the BWT to compress the result
better.
For example, my M03 requires the match length between each suffix
and the next highest sorted suffix to encode correctly. So, the
BWT itself would not be enough in this case.
What would be the performance requirements in terms of
max memory usage, worst case performance etc.?
- Michael A Maniscalco
- Michael
It seems like a decent sorter could be built with n log n performance
in 1n or less space. In real world terms I'd estimate around 1k
instructions per character, give or take an order of magnitude. Most
of the work is packing and unpacking small integers.
- Michael
Don't let yourself get tangled strictly in the
mathematics in this case. There are a lot of
this that affect the performance of a suffix sorter
on the modern machine. Cache performance is pivotal
in performance. You should not worry about instructions
per symbol but how the algorithm in question works
on the modern machine.
A good example is QSufSort which is academically sound
but suffers huge performance issues due to its cache
unfriendliness.
More specifically, I might trade an algorithm that
has 4 times as many ops per symbol for one that
has 1 per symbol if the 4 per symbol is significantly
more cache friendly because it might well work faster.
You need to decide what you want here. It is speed,
memory usage, simplicity, cache friendly or academically
best.
And I don't think you will find your answers in a paper
or someone would have already come up with them. ... (^:
- Michael A Maniscalco