He probably assumes a virtual EOF at the end of the buffer.
I assume the memcmp won't access memory beyond the end of the buffer ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
That's equivalent to using a terminating metacharacter. If both
strings match up to the end of the shorter one, then pick the shorter
one as the smallest. That's equivalent to saying there's a magic
terminator $ on the end of the shorter string, where $ < (char) 0-255.
This metacharacter can't be represented as a byte, so it's a real pain
to program. L vectors often have a separate integer saying where the
$ character is.
This makes the BWT equivalent to suffix sorting; the rotated portion
doesn't matter, since there are never any ties past $, and $ never
lines up with itself when comparing different suffixes/rotations.
Remember, $ is a character smaller than all other characters.
So ab$fananan is lexically smaller than ababab$fab.
More generally, ab$<anything> is lexically smaller than ab<anything>.
So, with the addition of the EOF character, you can always stop comparing
before there is a wraparound.
Of course they will be different, there is a (virtual) $ in there after
all. The first L will not be XXY, it will be XXY$
For example, in the case of XYX$, you get:
$XYX
X$XY
XYX$
YX$X
So L = XY$X
This works by outputting XYX and then storing the position of the $.
Note that in the original BWT you have to store the position of the
first character, but in this case, it follows the $, so you can get
away with just storing the position of the $ and work from there.
This is, incidentally, how bwc works. Take a look at the unbwt routine.
If you use wraparound, then the beginning of the string will be used as
part of the context of the end of the string. If you don't, then it won't.
So it depends on if the beginning of the string, seen as context, fits the
end of the string. I would guess offhand it won't, so using a virtual EOF
would tend to improve compression. I'd guess by half a byte or maybe a
whole byte per block. Maybe less. Dunno.
Nope, he's (probably) assuming that there's a virtual character
beyond the last real character which compares greater than any
character in the string. Therefore all comparisons can terminate
thereupon.
Phil
--
If a religion is defined to be a system of ideas that contains unprovable
statements, then Godel taught us that mathematics is not only a religion, it
is the only religion that can prove itself to be one. -- John Barrow