Wanted: Distance coding code

Max Wolf

unread,

Mar 18, 2005, 1:56:56 PM3/18/05

to

Hi,
could someone please post code for the "distance coding" algorithm? I
mean the algorithm that can be used in the second stage of
Burrows-Wheeler and that has been attributed to Edgar Binder.
Is there a reason why this algorithm is kept secret? Google finds
nothing except vague and obscure hints. This question has been asked
before but never answered. So I hope I will be more lucky. If you can,
please help!
Thank you,
Max

michael

unread,

Mar 18, 2005, 4:33:12 PM3/18/05

to

It's just not well documented.

Here's a link to the usenet discussion where edgar discribes his
algorithm.

http://groups-beta.google.com/group/comp.compression/browse_thread/thread/f7a0c734981fd0d3/4c98164a5eb83f44?q=binderedgar+dc#4c98164a5eb83f44

BTW, it is no longer the best method for compressing BWT data.
Try inversion frequencies, WFC and for the current best (wink)
look into M03. mij4x has implemented M03 to achieve 795,000 on
the 18 files of calgary (which is, at the moment, even better
than I can achieve and I invented the damn thing! (^;

- Michael A Maniscalco

Max Wolf

unread,

Mar 19, 2005, 5:15:15 AM3/19/05

to

> Here's a link to the usenet discussion where edgar discribes his
> algorithm.
>

Of course I have seen this. That's what I referred to by "vague and obscure".

> BTW, it is no longer the best method for compressing BWT data.
> Try inversion frequencies, WFC and for the current best (wink)
> look into M03. mij4x has implemented M03 to achieve 795,000 on
> the 18 files of calgary (which is, at the moment, even better
> than I can achieve and I invented the damn thing! (^;

No, I am really only interested in the distance coding algorithm.

michael

unread,

Mar 20, 2005, 2:20:41 PM3/20/05

to

Well, the idea of distance coding is fairly simple. Basically, you
transform the BWT stream such that the transform represents the
'distance' from the current position in the BWT stream to the next
occurance of the symbol which is at the current location. This
distance is calculated excluding any known symbols which are beyond the
current position.

As an example:
AABAB

First, the position of the first occurance of each unique symbol must
be
noted. In this case, the first A is at index 0 and the first B is at
index 2.
With this information (and the knowledge that there are only two
symbols) we can infer that index 1 must be symbol 'A'. So we start
with ...

AAB??

Now we encode the first occurance of A beyond what is known. This
would be
1 (zero would indicate no more occurances of current symbol.) The 1
indicates
that the first occurance of ? (unknown) is the next occurance of the
symbol 'A'.

This gives us ...
AABA?

Then move on to the symbol at 2 ('B'). Encode 1 to indicate that the
next occurance of B is the first ?. Then encode 0 to represent that
'A' at position 3 has no more occurances, and likewise encode 0 for no
more occurances of 'B' at position 4.

The real gains for this method occur when large spans of a run of one
unique
symbol can be inferred. This is the reason why the transformed BWT ->
DC is often far smaller than the BWT itself. For instance, in the
example above, the 'A' at index 1 was inferred and never included in
the encoding. This is not uncommon (infact it is expected) with BWT
streams on data which is atleast some what predictable.

The real challange with distance coding (it would seem to me) is to
create models for effective encoding of the distance coding transform.
I've never
bothered with it myself, but I would assume that adaptive models used
for logarithmic encoding of the distances is the way to go. These
models probably should strongly consider the recent trends for each
given symbol as well as the general local trend overall.

That is, if 'B' suddenly starts having short distances, it should be
expected that (for the time being) it will continue to have short
distances. But, when a long distance is encountered, we should expect
that this current behavior would change very soon.

At any rate, it would be a mistake to consider distance coding as a one
step
encoding process. Really, it should be viewed as a distance transform
plus
and encoding phase which encodes the transform. So I would start with
a
fast forward and reverse transform from BWT to DC(BWT) and then examine
the transform for its properties.

Hope this helps.

- Michael A Maniscalco

P.S. Distance coding has some things in common with inversion
frequencies
so you might want to read up on IF since there is not much written on
DC.
In a nut shell, I believe it is somewhat accurate to say that IF
considers
distances of symbols in BWT one at a time whereas, DC considers them
all
at the same time. Otherwise, they are somewhat the same idea.

----
encoding of the distances

Max Wolf

unread,

Mar 21, 2005, 4:56:33 AM3/21/05

to

Thanks for holding the line, but I still can't infer the algorithm.
There are still too many "?"s in it for me. So the clou is to look at
each symbol in turn? What about encoding "abcaaabaccca" for example?
I am looking for a compression algorithm that uses little memory (most
important) and little cpu time as well and gives a reasonable
compression. Looking at the special kind of data that I have, I was
hoping to use that distance coding algorithm + maybe a cheap entropy
coder without doing a BWT first. I already tried that with inversion
frequencies and it works relatively good.
Thanks, Max

michael

unread,

Mar 21, 2005, 9:15:01 AM3/21/05

to

For ABCAAABACCCA ...
Start with known first occurance of 'A' = 0, 'B' = 1, 'C' = 2.
You have ABC?????????

Leftmost considered symbol is 'A'.
Distance to next ? which is 'A' is 1 (zero is no more 'A').
This gives us (using lower case for symbols no longer in
consideration).
aBCA????????

Leftmost considered symbol is 'B'.
Distance to next ? which is 'B' is 3.
This gives us
abCA??B?????

Leftmost considered symbol is 'C'.
Distance to next ? which is 'C' is 4.
This gives us
abcA??B?C???

We can now infer that the first two ?? must be 'A'.
so ..
abcaaAB?C???

Leftmost considered symbol is 'A'
Distance to next ? which is 'A' is 1.
This gives us
abcaaaBAC???

Leftmost considered symbol is 'B'
No more 'B' so distance is 0.
This gives us
abcaaabAC???

Leftmost considered symbol is 'A'
Distance to next ? which is 'A' is 3
This gives us
abcaaabaC??A

We can now infer that the next two ? must be 'C' so ...
abcaaabaccCA

Leftmost considered symbol is 'C'
No more 'C' so distance is 0.
giving us abcaaabacccA

Leftmost considered symbol is 'A'
No more 'A' so distance is 0.

No more leftmost considered symbols indicates transform complete.

ABCAAABACCCA -> 1,3,4,1,0,3,0,0 (plus info for first occurance of
each unique symbol) ['A' = 0, 'B' = 1, 'C' = 2]

Did this in my head so if there is an error, sorry.

- Michael A Maniscalco

Max Wolf

unread,

Mar 24, 2005, 8:32:56 AM3/24/05

to

Thank you very much, that was clear enough. Below some working Matlab
code for encoding and decoding.

function [code, sideinfo] = distancecoding(symbolvec);
%calculate distance coding transformation
%for vector of symbols symbolvec
symbols = unique(symbolvec);
nbsymbols = length(symbols);
code = zeros(length(symbolvec)+2*nbsymbols+2, 1);
len = length(symbolvec);

%side information
sideinfo(1) = len;
sideinfo(2) = nbsymbols;
sideinfo(3:nbsymbols+2) = symbols;

%index of first occurence of each symbol in symbolvec
firstoccurence = zeros(nbsymbols, 1);
for s=1:nbsymbols
for i=1:len
if symbolvec(i)==symbols(s)
firstoccurence(s)=i;
break;
end;
end;
end;
sideinfo(nbsymbols+3:2*nbsymbols+2) = firstoccurence;

%most advanced known position in symbolvec for each symbol
%is kept ascendingly sorted
knownpositions = sort(firstoccurence);
cind = 1; %index into code vector
pos = 1; %index into symbol vector
while true
currsymbol = symbolvec(pos);
%skip over sequence of currsymbol
while true
pos=pos+1;
if pos>=len break; end;
if symbolvec(pos)~=currsymbol break; end;
end;
if pos>=len break; end;
%sequence ended, now count symbols up to next
%occurence of currsymbol
startpos = pos-1;
nextpos = pos+1;
while symbolvec(nextpos)~=currsymbol
nextpos = nextpos+1;
if nextpos>len break; end;
end;
if nextpos>len
%no more instance of currsymbol in symbolvec found: output 0
code(cind)=0;
cind=cind+1;
%set knownposition of currsymbol to pos after the end of vec
knownpositions(1)=len+1;
knownpositions=sort(knownpositions);
else
%next instance of currsymbol found at position nextpos
distance = nextpos-startpos;
%from the distance, substract the known symbols
%we have met on the way from startpos to nextpos
for i=2:nbsymbols
if nextpos<knownpositions(i)
break;
else
distance=distance-1;
end;
end;
%output distance
code(cind)=distance;
cind = cind+1;
%set knownposition of currsymbol
knownpositions(1)=nextpos;
knownpositions=sort(knownpositions);
end;
end;
code = code(1:cind-1);

function symbolvec = distancedecoding(code, sideinfo);
%calculate distance decoding
len = sideinfo(1);
nbsymbols = sideinfo(2);
symbols = sideinfo(3:nbsymbols+2);
symbolvec = zeros(len, 1);
firstoccurence = sideinfo(nbsymbols+3:2*nbsymbols+2);

%handle special case when there is only one symbol
if nbsymbols==1
symbolvec = symbolvec + symbols(1);
return;
end;

%most advanced known position in symbolvec for each symbol
knownpositions = firstoccurence;
%set the knownpositions of the symbolvec
%to the appropriate symbols
symbolvec(knownpositions) = symbols;
%ascending sort vector of knownpositions,
%will be kept sorted from now on
knownpositions = sort(knownpositions);
cind = 1; %index into code vector
pos = 1; %index into symbol vector
while true
currsymbol = symbolvec(pos);
%fill sequence of currsymbol up to next known symbol
while true
pos=pos+1;
if pos==knownpositions(2) break; end;
symbolvec(pos) = currsymbol;
end;
if pos>=len break; end;
%sequence ended, now skip over distance-1 unknown symbols up to
%next occurence of currsymbol
distance = code(cind);
cind = cind+1;
if distance==0
%no more occurences of currsymbol in the vector
%set knownposition to position after end of vector
knownpositions(1) = len+1;
knownpositions = sort(knownpositions);
continue;
end;
%one after next known symbol
nextpos = pos+1;
distance = distance-1;
%skip unknown symbols between known symbols
for knownind=3:length(knownpositions)
nextknownpos = knownpositions(knownind);
skipdistance = (nextknownpos-nextpos);
%next position is behind the knownsymbol at nextknownpos
if distance>=skipdistance
distance = distance - skipdistance;
nextpos = nextknownpos+1;
else
%next position is before the knownsymbol at nextknownpos
break;
end;
end;
nextpos = nextpos+distance;

symbolvec(nextpos) = currsymbol;
knownpositions(1) = nextpos;
knownpositions = sort(knownpositions);
end;