Here's a link to the usenet discussion where edgar discribes his
algorithm.
BTW, it is no longer the best method for compressing BWT data.
Try inversion frequencies, WFC and for the current best (wink)
look into M03. mij4x has implemented M03 to achieve 795,000 on
the 18 files of calgary (which is, at the moment, even better
than I can achieve and I invented the damn thing! (^;
- Michael A Maniscalco
Of course I have seen this. That's what I referred to by "vague and obscure".
> BTW, it is no longer the best method for compressing BWT data.
> Try inversion frequencies, WFC and for the current best (wink)
> look into M03. mij4x has implemented M03 to achieve 795,000 on
> the 18 files of calgary (which is, at the moment, even better
> than I can achieve and I invented the damn thing! (^;
No, I am really only interested in the distance coding algorithm.
As an example:
AABAB
First, the position of the first occurance of each unique symbol must
be
noted. In this case, the first A is at index 0 and the first B is at
index 2.
With this information (and the knowledge that there are only two
symbols) we can infer that index 1 must be symbol 'A'. So we start
with ...
AAB??
Now we encode the first occurance of A beyond what is known. This
would be
1 (zero would indicate no more occurances of current symbol.) The 1
indicates
that the first occurance of ? (unknown) is the next occurance of the
symbol 'A'.
This gives us ...
AABA?
Then move on to the symbol at 2 ('B'). Encode 1 to indicate that the
next occurance of B is the first ?. Then encode 0 to represent that
'A' at position 3 has no more occurances, and likewise encode 0 for no
more occurances of 'B' at position 4.
The real gains for this method occur when large spans of a run of one
unique
symbol can be inferred. This is the reason why the transformed BWT ->
DC is often far smaller than the BWT itself. For instance, in the
example above, the 'A' at index 1 was inferred and never included in
the encoding. This is not uncommon (infact it is expected) with BWT
streams on data which is atleast some what predictable.
The real challange with distance coding (it would seem to me) is to
create models for effective encoding of the distance coding transform.
I've never
bothered with it myself, but I would assume that adaptive models used
for logarithmic encoding of the distances is the way to go. These
models probably should strongly consider the recent trends for each
given symbol as well as the general local trend overall.
That is, if 'B' suddenly starts having short distances, it should be
expected that (for the time being) it will continue to have short
distances. But, when a long distance is encountered, we should expect
that this current behavior would change very soon.
At any rate, it would be a mistake to consider distance coding as a one
step
encoding process. Really, it should be viewed as a distance transform
plus
and encoding phase which encodes the transform. So I would start with
a
fast forward and reverse transform from BWT to DC(BWT) and then examine
the transform for its properties.
Hope this helps.
- Michael A Maniscalco
P.S. Distance coding has some things in common with inversion
frequencies
so you might want to read up on IF since there is not much written on
DC.
In a nut shell, I believe it is somewhat accurate to say that IF
considers
distances of symbols in BWT one at a time whereas, DC considers them
all
at the same time. Otherwise, they are somewhat the same idea.
----
encoding of the distances
For ABCAAABACCCA ...
Start with known first occurance of 'A' = 0, 'B' = 1, 'C' = 2.
You have ABC?????????
Leftmost considered symbol is 'A'.
Distance to next ? which is 'A' is 1 (zero is no more 'A').
This gives us (using lower case for symbols no longer in
consideration).
aBCA????????
Leftmost considered symbol is 'B'.
Distance to next ? which is 'B' is 3.
This gives us
abCA??B?????
Leftmost considered symbol is 'C'.
Distance to next ? which is 'C' is 4.
This gives us
abcA??B?C???
We can now infer that the first two ?? must be 'A'.
so ..
abcaaAB?C???
Leftmost considered symbol is 'A'
Distance to next ? which is 'A' is 1.
This gives us
abcaaaBAC???
Leftmost considered symbol is 'B'
No more 'B' so distance is 0.
This gives us
abcaaabAC???
Leftmost considered symbol is 'A'
Distance to next ? which is 'A' is 3
This gives us
abcaaabaC??A
We can now infer that the next two ? must be 'C' so ...
abcaaabaccCA
Leftmost considered symbol is 'C'
No more 'C' so distance is 0.
giving us abcaaabacccA
Leftmost considered symbol is 'A'
No more 'A' so distance is 0.
No more leftmost considered symbols indicates transform complete.
ABCAAABACCCA -> 1,3,4,1,0,3,0,0 (plus info for first occurance of
each unique symbol) ['A' = 0, 'B' = 1, 'C' = 2]
Did this in my head so if there is an error, sorry.
- Michael A Maniscalco
function [code, sideinfo] = distancecoding(symbolvec);
%calculate distance coding transformation
%for vector of symbols symbolvec
symbols = unique(symbolvec);
nbsymbols = length(symbols);
code = zeros(length(symbolvec)+2*nbsymbols+2, 1);
len = length(symbolvec);
%side information
sideinfo(1) = len;
sideinfo(2) = nbsymbols;
sideinfo(3:nbsymbols+2) = symbols;
%index of first occurence of each symbol in symbolvec
firstoccurence = zeros(nbsymbols, 1);
for s=1:nbsymbols
for i=1:len
if symbolvec(i)==symbols(s)
firstoccurence(s)=i;
break;
end;
end;
end;
sideinfo(nbsymbols+3:2*nbsymbols+2) = firstoccurence;
%most advanced known position in symbolvec for each symbol
%is kept ascendingly sorted
knownpositions = sort(firstoccurence);
cind = 1; %index into code vector
pos = 1; %index into symbol vector
while true
currsymbol = symbolvec(pos);
%skip over sequence of currsymbol
while true
pos=pos+1;
if pos>=len break; end;
if symbolvec(pos)~=currsymbol break; end;
end;
if pos>=len break; end;
%sequence ended, now count symbols up to next
%occurence of currsymbol
startpos = pos-1;
nextpos = pos+1;
while symbolvec(nextpos)~=currsymbol
nextpos = nextpos+1;
if nextpos>len break; end;
end;
if nextpos>len
%no more instance of currsymbol in symbolvec found: output 0
code(cind)=0;
cind=cind+1;
%set knownposition of currsymbol to pos after the end of vec
knownpositions(1)=len+1;
knownpositions=sort(knownpositions);
else
%next instance of currsymbol found at position nextpos
distance = nextpos-startpos;
%from the distance, substract the known symbols
%we have met on the way from startpos to nextpos
for i=2:nbsymbols
if nextpos<knownpositions(i)
break;
else
distance=distance-1;
end;
end;
%output distance
code(cind)=distance;
cind = cind+1;
%set knownposition of currsymbol
knownpositions(1)=nextpos;
knownpositions=sort(knownpositions);
end;
end;
code = code(1:cind-1);
function symbolvec = distancedecoding(code, sideinfo);
%calculate distance decoding
len = sideinfo(1);
nbsymbols = sideinfo(2);
symbols = sideinfo(3:nbsymbols+2);
symbolvec = zeros(len, 1);
firstoccurence = sideinfo(nbsymbols+3:2*nbsymbols+2);
%handle special case when there is only one symbol
if nbsymbols==1
symbolvec = symbolvec + symbols(1);
return;
end;
%most advanced known position in symbolvec for each symbol
knownpositions = firstoccurence;
%set the knownpositions of the symbolvec
%to the appropriate symbols
symbolvec(knownpositions) = symbols;
%ascending sort vector of knownpositions,
%will be kept sorted from now on
knownpositions = sort(knownpositions);
cind = 1; %index into code vector
pos = 1; %index into symbol vector
while true
currsymbol = symbolvec(pos);
%fill sequence of currsymbol up to next known symbol
while true
pos=pos+1;
if pos==knownpositions(2) break; end;
symbolvec(pos) = currsymbol;
end;
if pos>=len break; end;
%sequence ended, now skip over distance-1 unknown symbols up to
%next occurence of currsymbol
distance = code(cind);
cind = cind+1;
if distance==0
%no more occurences of currsymbol in the vector
%set knownposition to position after end of vector
knownpositions(1) = len+1;
knownpositions = sort(knownpositions);
continue;
end;
%one after next known symbol
nextpos = pos+1;
distance = distance-1;
%skip unknown symbols between known symbols
for knownind=3:length(knownpositions)
nextknownpos = knownpositions(knownind);
skipdistance = (nextknownpos-nextpos);
%next position is behind the knownsymbol at nextknownpos
if distance>=skipdistance
distance = distance - skipdistance;
nextpos = nextknownpos+1;
else
%next position is before the knownsymbol at nextknownpos
break;
end;
end;
nextpos = nextpos+distance;
symbolvec(nextpos) = currsymbol;
knownpositions(1) = nextpos;
knownpositions = sort(knownpositions);
end;