Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A truely BIJECTIVE BWT is here!

439 views
Skip to first unread message

biject

unread,
Dec 12, 2007, 11:00:12โ€ฏPM12/12/07
to
I just created what I think is the first truely bijective
BWT through my Scottification process.
My code is a mod of Marks BWT and should
function in most cases as a drop in replacement.
I call it BWTS.

my bijective BWT at:

http://bijective.dogma.net/bwts.zip

it does what BWT should have done the transform with out
an index. N bytes to N bytes with only the order changed.
I don't think anyone else has a practical working bijective
BWT transform it took a lot of trail and error to make it
so that on the average it will lead to bijective BWT compression
with out the stupid INDEX and it will lead to better compression
on the average.

Thank You
Merry Christmas
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Industrial One

unread,
Dec 13, 2007, 4:24:59โ€ฏAM12/13/07
to
On Dec 12, 9:00 pm, biject <davvid_a_sc...@email.com> wrote:
> I just created what I think is the first truely bijective
> BWT through my Scottification process.
> My code is a mod of Marks BWT and should
> function in most cases as a drop in replacement.
> I call it BWTS.
>
> my bijective BWT at:
>
> http://bijective.dogma.net/bwts.zip
>
> it does what BWT should have done the transform with out
> an index. N bytes to N bytes with only the order changed.
> I don't think anyone else has a practical working bijective
> BWT transform it took a lot of trail and error to make it
> so that on the average it will lead to bijective BWT compression
> with out the stupid INDEX and it will lead to better compression
> on the average.
>
> Thank You
> Merry Christmas
> David A. Scott
> --
> My Crypto codehttp://bijective.dogma.net/crypto/scott19u.ziphttp://www.jim.com/jamesd/Kong/scott19u.zipold version
> My Compression codehttp://bijective.dogma.net/

> **TO EMAIL ME drop the roman "five" **
> Disclaimer:I am in no way responsible for any of the statements
> made in the above text. For all I know I might be drugged.
> As a famous person once said "any cryptograhic
> system is only as strong as its weakest link"

Will it make random data more compressible?

Mark Nelson

unread,
Dec 13, 2007, 7:35:41โ€ฏAM12/13/07
to
On Dec 12, 10:00 pm, biject <davvid_a_sc...@email.com> wrote:

> it does what BWT should have done the transform with out
> an index. N bytes to N bytes with only the order changed.

David,

The reversible transform permutes all the data in a message. The
standard procedure for inverting the transform requires that you know
in advance the position of the final character.

It sounds like you are saying that you can permute a message of N
symbols into another message of N symbols, and can then perform the
inverse without knowing the position of the final character.

Do I misunderstand your claim?

|
| Mark Nelson - http://marknelson.us
|

Phil Carmody

unread,
Dec 13, 2007, 9:43:17โ€ฏAM12/13/07
to
biject <davvid_...@email.com> writes:
> I just created what I think is the first truely bijective
> BWT through my Scottification process.
> My code is a mod of Marks BWT and should
> function in most cases as a drop in replacement.
> I call it BWTS.
>
> my bijective BWT at:
>
> http://bijective.dogma.net/bwts.zip
>
> it does what BWT should have done the transform with out
> an index. N bytes to N bytes with only the order changed.
> I don't think anyone else has a practical working bijective
> BWT transform it took a lot of trail and error to make it
> so that on the average it will lead to bijective BWT compression
> with out the stupid INDEX and it will lead to better compression
> on the average.


I suspect this is impossible. In order to achieve an
improvement in compression, BWT needs to map less-
compressible vectors onto more-compressible vectors.
However, there are fewer of the latter. It gets away
with this by mapping multiple source vectors onto the
same output vector, and then adding an auxiliary index
such that the process is reversible. BWT as such cannot
work without that index.

Now it's entirely possible that you've created a new
transform that encourages "better compression on the
average" from the back end, but the fact that it is
better is almost certainly unrelated to a lack of an
index, and is probably harmed by that lack.

Care to outline the algorithm here?

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration

biject

unread,
Dec 13, 2007, 10:34:57โ€ฏAM12/13/07
to

Yes its what I call a Scottification of the BWT it gets rid of the
index
you have the code test it. N symbols to N symbols its what the BWT
should have been in first place to allow for bijective compression.
But llike any compressor sometimes you end up with a smaller file
sometimes not. Its like bijective arithmetic versus bijective huffman
sometime one better sometimes the other. Though for most useful
the bijective arithmetic better.

For an example take bananas
old way it bwt tranforms to BNNSAAA
one way to get reverse is to list transform then sort single charter
string
this gets AAABNNS then create an array of two elements each you get
BA NA NA SB AN AN AS sort these then repeat eventually you get
a table of with word BANANAS each with rotation possible you need an
index to pick which rotation thats the old way.
In my way you end up using the .top row for the inverse BTWS no index
each single cycle maps to one unique file.

In my case when I BWTS BANANAS I get SNNBAAA it looks as nice
as old way but no index. There is no reverse of this string regardless
of
index with old BWT method since this is not a single cycle string.
if you apply the method of about you end up with 2 differeents cyles
usinga above method first row is
ANANAS you can repeat if needed
ANASAN is second row
ASANAN is thrid row
BBBBBB is forth row
NANASA is next
NASANA is next
SANANA is last
now first cycle is ANANAS if it was a single cycle it would have all
7 letters and reverse done
this does not so look for next cycle its B know combine the 2 in
reverse order
you get BANANAS so new way
BANANAS transforms to SNNBAAA while old way
BANABAS transforms to BNNSAAA with an index
sadly there is no revere transform old way to SNNBAAA
but n new way
ANANASB transform to BNNSAAA

BWTS is the reverse of above. Look at it this way when you sort lets
say the word THE apears
many times in the single cycle method they woul sort together. If
there are many cyles in new
way that contain the word THE in each cycle, The BWT still sorts them
together.

biject

unread,
Dec 13, 2007, 10:38:37โ€ฏAM12/13/07
to
On Dec 13, 7:43 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

I have spent years trying to get people to reasle that huffman
or LZW or arithmetic can be done in a bijective way. You
have the code test it. There may be some errors who knows.
but the idea is correct. The very fact of adding an index ruins
most chance for using the standard BWT for bijective compression.
This will on the average lead to better compression than the old way.

Phil Carmody

unread,
Dec 13, 2007, 11:09:59โ€ฏAM12/13/07
to
biject <davvid_...@email.com> writes:
> On Dec 13, 7:43 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
> wrote:
> > I suspect this is impossible.
...

> > Care to outline the algorithm here?
>
> I have spent years trying to get people to reasle that huffman
> or LZW or arithmetic can be done in a bijective way. You
> have the code test it. There may be some errors who knows.
> but the idea is correct.

The numbers in your READ.ME file are certainly pulled out
of your arse, I'm disinclined to read any more after that.

biject

unread,
Dec 13, 2007, 11:35:11โ€ฏAM12/13/07
to
On Dec 13, 9:09 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

Well my writting sucks but my code tends to do what I say it
does. You can use it or leave it its your choice. I know the
soution seems to work on the files I tested with the side
benafit that it can lead to turely bejective compression
something which did not seem possible for BWT type of
programs. If you judge code by the comments then don't
test it. Most of the comments are in the code for historical
reasons from when Mark wrote and have nothing to do with
the program. The code is standard C++ it should be readable
on its own.

Mark Nelson

unread,
Dec 13, 2007, 3:56:36โ€ฏPM12/13/07
to
On Dec 13, 8:43 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

> I suspect this is impossible. In order to achieve an
> improvement in compression, BWT needs to map less-
> compressible vectors onto more-compressible vectors.
> However, there are fewer of the latter.

I don't think this is quite correct Phil. If you look at the BWT
alone, it is truly a bijective function. It is a one-to-one mapping of
the set of input messages to the set of output messages, for all input
and output.

The output vectors that we get are more compressible ONLY because the
input data has a specific type of correlation - the chances that
symbol X appears after a given sequence of symbols, {a, b, c, ... } is
larger than would be suggested by mere chance. And of course, most
interesting files have this property. Files that aren't interesting
won't have this property, but they will still be transformed by BWT -
into equally uncompressible output vectors.

The reason that I am suspicious of David's claim is just that I don't
see how the transform can be reversible without additional
information.

Of course, I'm reluctant to flag it as impossible. If you were naively
reading the original paper without having ever heard of it before, I
would be very surprised if you were to get through the description of
the transform and then say "Aha! This would be totally reversible if I
just had a pointer to the new position of the last character in the
input sequence!" Well, maybe you would, but I sure didn't, and it
still seems pretty remarkable to me that it works at all.

I haven't even looked at David's code yet but I was hoping he could
give a words-only description of what he did to achieve his claim. But
I'll give David a break, he says his attempts at exposition are rarely
successful, and that's not likely to change overnight. What worries me
is my own personal conviction that there is a bijection between clear
writing in spoken language and clear writing in code.

biject

unread,
Dec 13, 2007, 5:46:13โ€ฏPM12/13/07
to

I took your oriignal BWT compressor and run the Calgory test files
the
resuts are bleow. The BWTS bijective compress beat it in evry case
which was a surprise to me. It should be worse on the average then
pure
BWT but better on the average when you count the indexing which I feel
is not needed

the original test files

07/15/1991 12:55 PM 111,261 BIB
07/15/1991 12:56 PM 768,771 BOOK1
07/15/1991 12:56 PM 610,856 BOOK2
07/15/1991 12:56 PM 102,400 GEO
07/15/1991 12:56 PM 377,109 NEWS
07/15/1991 12:56 PM 21,504 OBJ1
07/15/1991 12:56 PM 246,814 OBJ2
07/15/1991 12:56 PM 53,161 PAPER1
07/15/1991 12:56 PM 82,199 PAPER2
07/15/1991 12:56 PM 46,526 PAPER3
07/15/1991 12:56 PM 13,286 PAPER4
07/15/1991 12:56 PM 11,954 PAPER5
07/15/1991 12:56 PM 38,105 PAPER6
07/15/1991 12:57 PM 513,216 PIC
07/15/1991 12:57 PM 39,611 PROGC
07/15/1991 12:57 PM 71,646 PROGL
07/15/1991 12:57 PM 49,379 PROGP
07/15/1991 12:57 PM 93,695 TRANS
18 File(s) 3,251,493 bytes

Marks compressed size

12/13/2007 03:05 PM 29,567 bib.m
12/13/2007 03:08 PM 275,831 book1.m
12/13/2007 03:11 PM 186,592 book2.m
12/13/2007 03:12 PM 62,120 geo.m
12/13/2007 03:12 PM 134,174 news.m
12/13/2007 03:12 PM 10,857 obj1.m
12/13/2007 03:12 PM 81,948 obj2.m
12/13/2007 03:13 PM 17,724 paper1.m
12/13/2007 03:13 PM 26,956 paper2.m
12/13/2007 03:13 PM 16,995 paper3.m
12/13/2007 03:13 PM 5,529 paper4.m
12/13/2007 03:13 PM 5,136 paper5.m
12/13/2007 03:13 PM 13,159 paper6.m
12/13/2007 03:15 PM 50,829 pic.m
12/13/2007 03:16 PM 13,312 progc.m
12/13/2007 03:16 PM 16,688 progl.m
12/13/2007 03:16 PM 11,404 progp.m
12/13/2007 03:16 PM 19,301 trans.m
18 File(s) 978,122 bytes


Scotts compressed output replaced BWT with BWTS
and UNBWT with UNBWTS so in index needed

12/13/2007 03:05 PM 29,563 bib.s
12/13/2007 03:08 PM 275,779 book1.s
12/13/2007 03:11 PM 186,555 book2.s
12/13/2007 03:12 PM 62,113 geo.s
12/13/2007 03:12 PM 134,166 news.s
12/13/2007 03:12 PM 10,845 obj1.s
12/13/2007 03:12 PM 81,930 obj2.s
12/13/2007 03:13 PM 17,710 paper1.s
12/13/2007 03:13 PM 26,938 paper2.s
12/13/2007 03:13 PM 16,979 paper3.s
12/13/2007 03:13 PM 5,515 paper4.s
12/13/2007 03:13 PM 5,122 paper5.s
12/13/2007 03:13 PM 13,147 paper6.s
12/13/2007 03:15 PM 50,815 pic.s
12/13/2007 03:16 PM 13,302 progc.s
12/13/2007 03:16 PM 16,682 progl.s
12/13/2007 03:16 PM 11,395 progp.s
12/13/2007 03:16 PM 19,298 trans.s
18 File(s) 977,854 bytes

batch file to create above

..\rle %1. %1.1
..\bwt %1.1 %1.2
..\mtf %1.2 %1.3
..\rle %1.3 %1.4
..\ari %1.4 %1.m
..\unari %1.m %1.4m
..\unrle %1.4m %1.3m
..\unmtf %1.3m %1.2m
..\unbwt %1.2m %1.1m
..\unrle %1.1m %1.mm
..\bwts -b200000 %1.1 %1.5
..\mtf %1.5 %1.6
..\rle %1.6 %1.7
..\ari %1.7 %1.s
..\unari %1.s %1.4s
..\unrle %1.4s %1.3s
..\unmtf %1.3s %1.2s
..\unbwts -b200000 %1.2s %1.1s
..\unrle %1.1s %1.ss


fc /b %1. %1.mm
fc /b %1. %1.ss

there was no errors and in each case the BWTS as a drop in
replacement did better than Marks BWT this even surprsed
me.

Actually the total difference was 263 bytes. That is very close
to space wasted by using the indexes.

Willem

unread,
Dec 13, 2007, 11:09:45โ€ฏAM12/13/07
to
Phil wrote:
) I suspect this is impossible. In order to achieve an
) improvement in compression, BWT needs to map less-
) compressible vectors onto more-compressible vectors.

I disagree. It maps higher-order redundancy onto lower-order redundancy.
The input and output of BWT are both equally compressible, it's just that
the BWT output can be compressed with a much simpler algorithm.


SaSW, Willem
--

Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be

drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Phil Carmody

unread,
Dec 14, 2007, 8:48:56โ€ฏAM12/14/07
to
Mark Nelson <snork...@gmail.com> writes:
> On Dec 13, 8:43 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
> wrote:
> > I suspect this is impossible. In order to achieve an
> > improvement in compression, BWT needs to map less-
> > compressible vectors onto more-compressible vectors.
> > However, there are fewer of the latter.
>
> I don't think this is quite correct Phil. If you look at the BWT
> alone, it is truly a bijective function. It is a one-to-one mapping of
> the set of input messages to the set of output messages, for all input
> and output.

That surprises me. Let's assume your output is <3,bnnsaaa>, then are
you saying that there is no possible inverse image for <0,bnnsaaa>,
<1,bnnsaaa>, <2,bnnsaaa>, <4,bnnsaaa>, <5,bnnsaaa>, and <6,bnnsaaa>?

> The output vectors that we get are more compressible ONLY because the
> input data has a specific type of correlation - the chances that
> symbol X appears after a given sequence of symbols, {a, b, c, ... } is
> larger than would be suggested by mere chance. And of course, most
> interesting files have this property. Files that aren't interesting
> won't have this property, but they will still be transformed by BWT -
> into equally uncompressible output vectors.
>
> The reason that I am suspicious of David's claim is just that I don't
> see how the transform can be reversible without additional
> information.

But what you've said has just made it reversible:

Given vector V
For i in [ 0 .. length(V) )
If < i, V > has an inverse image I
Return I
Fi
Rof


> Of course, I'm reluctant to flag it as impossible. If you were naively
> reading the original paper without having ever heard of it before, I
> would be very surprised if you were to get through the description of
> the transform and then say "Aha! This would be totally reversible if I
> just had a pointer to the new position of the last character in the
> input sequence!" Well, maybe you would, but I sure didn't, and it
> still seems pretty remarkable to me that it works at all.

It's been over a decade, but I'm pretty sure that's exactly how
I reacted.



> I haven't even looked at David's code yet but I was hoping he could
> give a words-only description of what he did to achieve his claim. But
> I'll give David a break, he says his attempts at exposition are rarely
> successful, and that's not likely to change overnight. What worries me
> is my own personal conviction that there is a bijection between clear
> writing in spoken language and clear writing in code.

One I share.

Do you agree with all the '3's in the READ.ME file?

Klaus Stengel

unread,
Dec 15, 2007, 3:46:19โ€ฏAM12/15/07
to
Phil Carmody wrote:
> Mark Nelson <snork...@gmail.com> writes:
>> On Dec 13, 8:43 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
>> wrote:
>>> I suspect this is impossible. In order to achieve an
>>> improvement in compression, BWT needs to map less-
>>> compressible vectors onto more-compressible vectors.
>>> However, there are fewer of the latter.
>> I don't think this is quite correct Phil. If you look at the BWT
>> alone, it is truly a bijective function. It is a one-to-one mapping of
>> the set of input messages to the set of output messages, for all input
>> and output.
>
> That surprises me. Let's assume your output is <3,bnnsaaa>, then are
> you saying that there is no possible inverse image for <0,bnnsaaa>,
> <1,bnnsaaa>, <2,bnnsaaa>, <4,bnnsaaa>, <5,bnnsaaa>, and <6,bnnsaaa>?

For your particular examples inverse images exist and are they
character-wise rotations of the word "bananas" (e.g. "sbanana",
"asbanan", ...). The exact meaning depends on how you define the index
number in your implementation.

But there are indeed values that look valid but don't have an
corresponding input value for BWT. For example <3,snnbaaa> has no
possible inverse image. This stems from the fact that BWT maps each
rotation of a given input to the same output string. So there have to be
some strings that BWT can't produce and that's also why you need to
store that index/rotation number along with the output string if you
want to get your original input back.

>> The output vectors that we get are more compressible ONLY because the
>> input data has a specific type of correlation - the chances that
>> symbol X appears after a given sequence of symbols, {a, b, c, ... } is
>> larger than would be suggested by mere chance. And of course, most
>> interesting files have this property. Files that aren't interesting
>> won't have this property, but they will still be transformed by BWT -
>> into equally uncompressible output vectors.
>>
>> The reason that I am suspicious of David's claim is just that I don't
>> see how the transform can be reversible without additional
>> information.

The index information is stored by extending the number of possible
output strings. Instead of mapping each possible rotation of an input to
the same output (like BWT without index number would do), David's
algorithm maps different rotations to different output strings.

>> I haven't even looked at David's code yet but I was hoping he could
>> give a words-only description of what he did to achieve his claim. But
>> I'll give David a break, he says his attempts at exposition are rarely
>> successful, and that's not likely to change overnight. What worries me
>> is my own personal conviction that there is a bijection between clear
>> writing in spoken language and clear writing in code.
>
> One I share.

There are no useful comments in it, but at least the variable and
function names make sense, at least sometimes... Well, here's a textual
description of David's algorithm with some ASCII art for visualization
;-) It took me some time to get the details right, but I think it's
accurate. I'll start with a short illustration of the regular BWT
algorithm and then try to explain what David's code does (as I
understand it).

Forward BWT

0123456789 (character position in string, counting from 0)

FOOBAR2000 <-- orignal input string, 10 characters
0FOOBAR200 rotate 10 times by 1 char to get array of strings
00FOOBAR20
000FOOBAR2
2000FOOBAR
R2000FOOBA
AR2000FOOB
BAR2000FOO
OBAR2000FO
OOBAR2000F

.. sort ...

,------- BWT output column
.L.
000FOOBAR|2|
00FOOBAR2|0|
0FOOBAR20|0|
2000FOOBA|R|
AR2000FOO|B|
BAR2000FO|O|
FOOBAR200|0|<-- line no. 6 containing original string
OBAR2000F|O| (counting from zero)
OOBAR2000|F|
R2000FOOB|A|
`-ยด

Last column and index with original string is output of BWT
--> 200RBO0OFA, 6

All rotations of the input (in this example "FOOBAR2000", "0FOOBAR200",
"00FOOBAR20", ...) actually map to the same BWT'ed string. Because of
this property it's necessary to save the original index or some other
kind of "offset" in order to distinguish the possible rotations.

----------------------------------------

Reverse BWT:

0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list

,---- Each character Count[] times in alphabetical order
| ,---- Column with BWT string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V________V V V V Reverse mapping of sum (=forward links)
0 0 2 3 + 0 = 3 1
1 0 0 0 + 0 = 0 2
2 0 0 0 + 1 = 1 6
3 2 R 9 + 0 = 9 0
4 A B 5 + 0 = 5 9
5 B O 7 + 0 = 7 4
6 FOOBAR2000 0 + 2 = 2 8 <-- Start decoding here by
7 O O 7 + 1 = 8 5 following line numbers and
8 O F 6 + 0 = 6 7 noting the characters from
9 R A 4 + 0 = 4 3 the 2nd column

This results in the following sequence with the letter in the first
column noted below:

6 -> 8 -> 7 -> 5 -> 4 -> 9 -> 3 -> 0 -> 1 -> 2
F O O B A R 2 0 0 0

Note that following the links in the last column forms one cyclic
sequence, spanning all characters from the input. If we continued past
the end (line 2 in this example) we'd get the same output again and
again. This is always the case if the input came from a BWT. Applying
the reverse BWT to some random or corrupted input usually leads to some
shorter cycle that doesn't include all characters.

David's basic idea is now to modify the algorithm to create an output
containing multiple cycles by design. With multiple cycles available it
is possible to split the input in a way that decoding from index zero
for each part always yields the correct result. The information formerly
stored in the extra index value is now encoded in the size and order of
the cycles.

To make this work we have to determine where to split up the input
first. In David's program the function part_cycle() is responsible for
this. When applying the BWT, line/index zero always contains the
rotation with the lowest value. In case of "FOOBAR2000" this is
"000FOOBAR2" (see forward BWT example from the beginning). As our
original input string starts with "FOOBAR..." we need to split this into
two cycles: "FOOBAR2" and "000". Now if we apply the BWT to the
remaining "FOOBAR2" again, we notice that the index zero contains
"2FOOBAR". This is still not what we want, so we create another part
containing only "2". Now we try again with "FOOBAR" and get "ARFOOB"...
When we continue to chop off the unwanted prefixes and note them as we
go along, we'll finally get the remainder "FOO" which is still looking
the same in index zero of the BWT. The complete partitioning sequence is
as follows:

FOOBAR2000 -> 000 FOOBAR2
FOOBAR2 -> 2 FOOBAR
FOOBAR -> AR FOOB
FOOB -> B FOO
FOO -> FOO <-- equal, we're done

FOO|B|AR|2|000
`-ยด U `ยด U `-ยด

So we end up with five cycles. Now we need to create a transformed
string containing these cycles in the given order. This works
essentially the same way as the ordinary BWT, but instead of sorting all
possible rotations of one input string, we use each cycle in each
possible rotated position as the input for the sorting function:

,-- gap indicates cycle len ,--- result column
V .L.
FOO FOOFOO... 00|0|
OFO OFOOFO... 00|0|
OOF OOFOOF... 00|0|
B BBBBBBBB... |2|
AR ARARARA... A|R|
RA RARARAR... --> sort --> |B|
2 22222222... FO|O|
000 000000... OF|O|
000 000000... OO|F|
000 000000... R|A|
`-ยด

In case we have to compare strings of different lengths while sorting,
we need to keep in mind that we're actually comparing infinite cyclic
character sequences. That's why we need to extend the shorter one by
repeating it until a difference occurs or both were repeated at least
once. For example if we want to compare "AXYA" with "AXY" we have to
expand "AXY" by repeating it: "AXYAXY". Now "AXYA" is again shorter
and equals "AXYAXY" in all available characters, so we have to repeat
it too: "AXYAAXYA". Now we see that "AXYAAXYA" is smaller than
"AXYAXY", as the fifth character "A" < "X". In the sorted list "AXYA"
must be positioned before "AXY".

The resulting transform is the last character of each cycle in the
sorted list, almost like the ordinary BWT. Saving some index is no
longer required.

--> 0002RBOOFA

----------------------------------------

Reverse transformation

The reverse transformation works almost like the reverse BWT. We only
need to introduce an additional column later to track which characters
were already visited so we don't enter a cycle twice.

0 2 A B F O R <-- list of different characters in BWT string
------------- (sorted alphabetically)
3 1 1 1 1 2 1 Count[]: how often each character appears in the string
0 3 4 5 6 7 9 RunningTotal[]: position where given character starts
in the sorted list

,---- Each character Count[] times in alphabetical order
| ,---- Column with transformed string
| | ,---- RunningTotal[] of character in BWT string
| | | ,---- Number of times the character from the BWT
| | | | string already occured
| | | | ,---- Sum gives line no. containing previous
| | | | | character in original string
| | | | |
Line V V V V V Reverse mapping of sum (=forward links)
0 0 0 0 + 0 = 0 0
1 0 0 0 + 1 = 1 1
2 0 0 0 + 2 = 2 2
3 2 2 3 + 0 = 3 3
4 A R 9 + 0 = 9 9
5 B B 5 + 0 = 5 5
6 F O 7 + 0 = 7 8
7 O O 7 + 1 = 8 6
8 O F 6 + 0 = 6 7
9 R A 4 + 0 = 4 4

We start decoding in line zero. Output is generated backwards, as the
string resulting from the forward transformation always starts with the
last character of the original input string. Output is taken from the
column containing the alphabetically sorted characters (2nd column in
the table above). If a cycle consists of more than one character the
first character is skipped and will be sent to output at the end of the
cycle to undo a rotation introduced in the forward transformation. Any
lines processed are subsequently marked as "visited".

Line Character list Forward link Visited
0 0 0 1 <--
1 0 1 0
2 0 2 0
3 2 3 0
4 A 9 0
... ... ... ...

Output so far: _________0

The forward link points to line 0 again, but this is already marked as
"visited". So for the next step we need to find the next line not visited
yet. This is Line 1 in our case, where we have the same situation again.

Line Character list Forward link Visited
0 0 0 1
1 0 1 1 <--
2 0 2 0
... ... ... ...

Output so far: ________00

This continues always the same way until we reach line 4:

Line Character list Forward link Visited
... ... ... ...
3 2 3 1
4 A 9 1 <--
5 B 5 0
... ... ... ...
9 R 4 0

Output so far: ______2000

The forward link now points to line 9, which is not visited yet. As this
indicates a cycle with more than one character we delay the output of
the "A" in line 4 until we jump back. The next step in line 9:

Line Character list Forward link Visited
... ... ... ...
4 A 9 1
5 B 5 0
... ... ... ...
9 R 4 1 <--

Output so far: ____AR2000

First we output the "R" on line 9 and mark the entry as visited.
The forward link now points back to line 4, which has the visited flag
already set. This indicates we're at the end of the cycle and need to
send the delayed start character "A" to the output, too. As line 4 was
already visited we continue in the next line where visited is zero,
in our case line 5:

Line Character list Forward link Visited
... ... ... ...
5 B 5 1 <--
6 F 8 0
... ... ... ...

Output so far: ___BAR2000

Line 5 links to itself again, so we continue with line 6...

Line Character list Forward link Visited
... ... ... ...
6 F 8 1 <--
7 O 6 0
8 O 7 0

Output so far: ___BAR2000

In line 6 we enter a cycle with more than one character again, so we
delay the output of the "F" again and follow the link to line 8.
There we output an "O" and follow the link to line 7. Line 7 links back
to the start of the cycle, so in addition to the "O" on that line we
have to output the delayed "F" from the beginning of the cycle.

Line Character list Forward link Visited
... ... ... ...
6 F 8 1
7 O 6 1 <--
8 O 7 1

Output so far: FOOBAR2000

As there are no more lines left we could "visit", we're done.

-----------------------------------

In my opinion the practical merits of David's approach are questionable.
While I like the general idea of having a variation of the BWT without
an extra number in the output, the additional effort required just isn't
worth it. The algorithm is twice as complicated as the plain BWT and it
might (if at all) save only about 4 bytes for each block of input
(usually some megabytes).

Compression speed is also likely to be worse. In order to determine what
prefixes have to be cut off in the splitting phase, one needs to search
the smallest string repeatedly in the remaining part of the input
buffer. I haven't made any benchmarks but judging from the structure of
the algorithm, this "preprocessing" can become quite time intensive,
especially with large block sizes and some decreasing sequence in the
beginning of the input blocks (e.g. "987654321").

Bye,
Klaus

biject

unread,
Dec 15, 2007, 9:21:52โ€ฏAM12/15/07
to
On Dec 15, 1:46 am, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:
> Phil Carmody wrote:

I think you have the basic picture. But I thought I would add a few
comments.
1) The extra work is from a human point of view is done. The code
exists
2) It may not be much to you but at least those few of us that like
pure compression
so that any output file could be thought of as either a compressed or
uncompressed
file at same time. We could not do that with standard BWT. I like
pure bijective
compression. It would be sad not to have a bijective BWT so that real
bijective
compression could be done.
3)as it stands now the BWT and PPM compressros are close but PPM wins
one of the reasons it Wins is that its not so hard to convert almost
any PPM
compressors to a bijective compressor. This could not be done till now
with
BWT style compressors. I predicate that BWT and PPM compressors will
basically
end in a dead heat know that truely bijective BWT are proved to exist.
4) I also like encryption. its common practice to compress then
encrypt.
one of the current problems in my mind is the problem of Shannons
Unicity
distance. Where an attacker as enough information to attack an
enctyped message
from looking at some very small set of encrypted blocks from the front
of file.
Encrypted blocks are very small compated to BWT block size. It would
be
nice to have a bijective BWT this problem goes away especiall a binary
sorted BWTS
and yes it will be slow.
5) I have encryyption schemes where you can test a key and that get a
failure because
the compression functions not bijective. I like every key tested to
work in the sense that
you get output on decrption that goes to same file with the key you
tested.

******
But thanks for giving another example than my bananas you
explanation seems correct
*****

But I would like to add a few direct comments I will add more to the -
d since in the code
I posted the arrays printed from wrong spot.
People are under the illusion that a BWT transforms to a single cycle
this is not ture
for all strings.
Example ABCABCABC
BWT trasnforms to CCCBBBAAA
In this case I get 3 cycle

Last of all a bijctive BWT compressior is already there from there
stages
BWTS MTFQ RLEQ BIACODE
compared to old it differs with the standard test files by 20 bytes
total even though
more files compressed same or smaller the over all resule was 20 bytes
worse..
In that case I uses BWTQ DSC
DSC was written to combine a single pointer bijectively to a file. It
often shortens
a BWTed file in the process it is better than adding Universal number
which uses more space
the laster stages suck. I have better stages but still have to do some
work for a real
kick ass good compressor but its coming.
David A. Scott
Again thanks for looking at it with a mind that at least part open
which is better than most

biject

unread,
Dec 15, 2007, 11:07:56โ€ฏAM12/15/07
to
On Dec 15, 1:46 am, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:
....

>
> Compression speed is also likely to be worse. In order to determine what
> prefixes have to be cut off in the splitting phase, one needs to search
> the smallest string repeatedly in the remaining part of the input
> buffer. I haven't made any benchmarks but judging from the structure of
> the algorithm, this "preprocessing" can become quite time intensive,
> especially with large block sizes and some decreasing sequence in the
> beginning of the input blocks (e.g. "987654321").
>
> Bye,
> Klaus

My first thought compression speed would be worse. My code is not
for speed. But here at two thoughts.

Each cycle in my way is unique. The case abcababc is treated as 1
on BWTS but 3 on in UNBTS and thats ok. Given strings from
separte cycles are different the compare function need only
compare till the difference is found. two since strings from
different
cycle are always different. Its only when string from same cycle can
a can two strings be identical and these will easer to deal with
since they would be shorter than buffer in most cases. My
current compare functions don't use this info so slower than could
be.

You could sort each cycle indepently and then merge the results
since the order of each row is sane within a cycle either from
big sort or seperate cyclle sorts. I chose the way I did for ease
of
understanding. Note sorting is at best an n log n time function a
string of 8 character 8*3 or time 24 a string of 16 would be 16*4
or 64 so 2 strings of 8 versus 1 string of 16 woulf be like
48 versus 64 I hope you get the point. The speed problem is
not settled.

biject

unread,
Dec 15, 2007, 12:08:09โ€ฏPM12/15/07
to

Mark Nelson

unread,
Dec 15, 2007, 12:27:32โ€ฏPM12/15/07
to
On Dec 14, 7:48 am, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
>

> > I don't think this is quite correct Phil. If you look at the BWT
> > alone, it is truly a bijective function. It is a one-to-one mapping of
> > the set of input messages to the set of output messages, for all input
> > and output.
>
> That surprises me. Let's assume your output is <3,bnnsaaa>, then are
> you saying that there is no possible inverse image for <0,bnnsaaa>,
> <1,bnnsaaa>, <2,bnnsaaa>, <4,bnnsaaa>, <5,bnnsaaa>, and <6,bnnsaaa>?

Sorry, I introduced some confusion here. When I said the transform was
a one-to-one mapping, I was talking about the just the output of the
transform, not the tuple <transform,index>. The latter is what is
normally needed to be able to invert the transform, but I was
considering the index to be extra information, not part of the BWT
output. David claims to be able to invert the transform using just the
former.


> Do you agree with all the '3's in the READ.ME file?
>

I don't exactly get what he's driving at there. It seems a little
extraneous though. I don't really need any proof that removing the
index shortens the data stream. What I need is to understand the
modification to the algorithm.

Mark Nelson

unread,
Dec 15, 2007, 12:53:19โ€ฏPM12/15/07
to
On Dec 15, 2:46 am, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:

Thanks very much for taking the time to provide this explanation,
Klaus.

My gut feeling is that modifying the transform in this fashion is
going to reduce the compressibility of the transformed string. The
whole point of BWT is exploiting correlation in the last column values
- for every occurrence of 'foo' in the text, we know there will be a
consecutive run of characters in the last column that immediately
precede 'foo', Compressibility arises because all the characters that
precede 'foo' are collected.

Using David's modified rotation, it seems that we are giving up some
of this adjacency clustering, which ought to have a bad effect on
compression. Testing needed.

biject

unread,
Dec 15, 2007, 1:55:38โ€ฏPM12/15/07
to

Mark the fact is BWT does not make good use of the output
space as it is. BWTS gets to use these missed output files.
My gut feeling is that it will be better on the average than
BWT + index. Its obvious the index adds waste in at least two
different ways.
1) your have to put the stupid bits for index some where in the
output file.
2) even though to the eyeball two strings may look almost identical
The fact is one might be UNBWTable and the other is
regardless of index The decompressor should be able to act
detect such a problem before the standard UNBWT with T vector
applied. Since it can't. It will never be as good as PPM. Somehow
one would have to guarnatee that as you decompress at the later
stages you automatically end up with a string that is UNBWTable
That could happen but I don't currently see such a way.

As far as testing lots of files. I am not sure am your guy.
It may end up like when at China Lake the Univac qsort
had the standard mods so sorted files or strings of constants
sorted quickly. But I found the N^2 pattern.It looked like garbage
but it made the Univac grind to a halt. Some suspected me of cheating
but I stated it was just luck I happened on a sequence that made
qsort bit the dust. Many systems programmers today still think
that sorted or long string make for worst case. They do not it
just takes a little research to hit them.

If I pick "ramdom" file to test it will shine. If Phil picks them
they
the old way will look best. But I suspect one time one will win
another time the other. But at least mine gives people a chance
to make a BWT type of compressor that matches PPM.
Just my thougths.

By the way both verisons the old and new transform to same
file its just they can vary a lot in speed. Older versio better for
some test files. Just thought I would let you now.

It would take me some time to speed it up. The point is I
want people to understand it. after all

How can I state it you might be giving up some of the
advantages but. Looking from other side giving up the index
is also a good thing. Please someone test it!!

biject

unread,
Dec 15, 2007, 5:39:14โ€ฏPM12/15/07
to
On Dec 15, 10:53 am, Mark Nelson <snorkel...@gmail.com> wrote:

In a previous post I compared you way with the full gambit and
mine beat it every case you can see the post above it you want.
But it wasn't fair since you didn't use the full wrap around BWT
it cost you some space and the indexs were all 4 bytes again
costing space. See here the fail test with full wrap around BWT
and the indexing dropped. Itshold and does compress better
than mine. But when you add the indexing your see its not
black and white.

Its hard to come up with a fair test to up to see an exmple I took
you code for later stage and my BWTS
BWTS
MTF
RLE
ARI

then moded BWTS to drop the index so your left
with the pure wrap aroung BWT on eof stuff
this is lossy compression since the index needed
to get make the file here are the results:

I call this the big fair test and one that anyone
can duplicate. I used marks old code and droped
in BWTS folowed by MTF RLE and ARI these are
not bijective but it follows the compression using
only marks test code with the Calgary files.
I then reversed and the uncompress matched.

12/15/2007 01:38 PM 29,376 bib.ddd
12/15/2007 01:42 PM 250,983 book1.ddd
12/15/2007 01:44 PM 169,725 book2.ddd
12/15/2007 01:46 PM 61,946 geo.ddd
12/15/2007 01:48 PM 125,194 news.ddd
12/15/2007 01:50 PM 10,793 obj1.ddd
12/15/2007 01:52 PM 79,609 obj2.ddd
12/15/2007 01:54 PM 17,606 paper1.ddd
12/15/2007 01:55 PM 26,829 paper2.ddd
12/15/2007 01:56 PM 16,899 paper3.ddd
12/15/2007 01:57 PM 5,469 paper4.ddd
12/15/2007 01:58 PM 5,062 paper5.ddd
12/15/2007 01:59 PM 13,020 paper6.ddd
12/15/2007 02:00 PM 54,283 pic.ddd
12/15/2007 02:02 PM 13,218 progc.ddd
12/15/2007 02:03 PM 16,836 progl.ddd
12/15/2007 02:05 PM 11,508 progp.ddd
12/15/2007 02:06 PM 19,144 trans.ddd
18 File(s) 927,500 bytes

The following is lossy compression I used BWTQ
with out any indexing so the result not reverseable
but the output is a pure BWT wrap around compression
using marks old code with no indexing. and same
afeter stages.


12/15/2007 01:38 PM 29,371 bib.lll
12/15/2007 01:42 PM 250,979 book1.lll
12/15/2007 01:44 PM 169,715 book2.lll
12/15/2007 01:46 PM 61,939 geo.lll
12/15/2007 01:48 PM 125,185 news.lll
12/15/2007 01:50 PM 10,797 obj1.lll
12/15/2007 01:52 PM 79,610 obj2.lll
12/15/2007 01:54 PM 17,600 paper1.lll
12/15/2007 01:55 PM 26,827 paper2.lll
12/15/2007 01:56 PM 16,896 paper3.lll
12/15/2007 01:57 PM 5,467 paper4.lll
12/15/2007 01:58 PM 5,060 paper5.lll
12/15/2007 01:59 PM 13,021 paper6.lll
12/15/2007 02:00 PM 54,282 pic.lll
12/15/2007 02:02 PM 13,215 progc.lll
12/15/2007 02:03 PM 16,832 progl.lll
12/15/2007 02:05 PM 11,507 progp.lll
12/15/2007 02:06 PM 19,144 trans.lll
18 File(s) 927,447 bytes

if you notice it saves 53 bytes over the 18 files.
1 files compressed to same size and mine even beat
the PURE BWT in 3 cases even with the handycap
that I have built in index and plain BWT no index does
not.


The following is the oringinal files. Note
if you carred the index you would have to use
less than 3 bytes on average or you lose the
savings

at work when dealing with byte type of files
we did the following you can use something esle
of you compare

Cxxxxxx if C a zero not last byte C equal one
is last byte. one byte tag would cover
a point form 0 to 127 2*7
a two byte pointer covers and addition 2*14
up to 16,511 values
you could use 3 byte for the rest.

2 * 2 = 4
16 * 3 = 48

that means you would need 52 more bytes to carry
index so your way with this indexing method would
compres to 927,447 + 52 to 927,499

the BWTS way no fancy indexing is 927,500

I don't think 1 bytes is not very much difference
do you? If I replaced book2 say with one simalar
to book1 your way would be longer by 5 bytes.
and again if writting after stages to compare a
little playing around and difference even more

It looks like it wasn't in the cards. (the 52):>

..


07/15/1991 12:55 PM 111,261 BIB
07/15/1991 12:56 PM 768,771 BOOK1
07/15/1991 12:56 PM 610,856 BOOK2
07/15/1991 12:56 PM 102,400 GEO
07/15/1991 12:56 PM 377,109 NEWS
07/15/1991 12:56 PM 21,504 OBJ1
07/15/1991 12:56 PM 246,814 OBJ2
07/15/1991 12:56 PM 53,161 PAPER1
07/15/1991 12:56 PM 82,199 PAPER2
07/15/1991 12:56 PM 46,526 PAPER3
07/15/1991 12:56 PM 13,286 PAPER4
07/15/1991 12:56 PM 11,954 PAPER5
07/15/1991 12:56 PM 38,105 PAPER6
07/15/1991 12:57 PM 513,216 PIC
07/15/1991 12:57 PM 39,611 PROGC
07/15/1991 12:57 PM 71,646 PROGL
07/15/1991 12:57 PM 49,379 PROGP
07/15/1991 12:57 PM 93,695 TRANS
18 File(s) 3,251,493 bytes

Take Care
David Scott
P.S. the point is this is first pure bijective BWT
the SCOTT way. Some one else could come up with
a different way of pasring or other means to make
a better version than mine. Its just a first cut.

*/ pretend standard disclaimer here I may be drugged or
something like that */

Phil Carmody

unread,
Dec 16, 2007, 1:58:16โ€ฏAM12/16/07
to
Klaus Stengel <Klaus....@asamnet.de> writes:

Thanks for that wonderfully insightful post, Klaus.
More 'bits' in that post than most of the rest of the
year's posts put together!

biject

unread,
Dec 16, 2007, 10:40:17โ€ฏAM12/16/07
to
On Dec 15, 1:46 am, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:

Since most like Klaus explanation I thought I would fix the errors or
add a
few points to clarafy or obscurafy its sort of how you look at it. I
have
spent over 3 Hours trying to get this right. I don't see how people
document
its a very hard job for me. When I worked at China Lake I was lucky
in that many people hated coming up with code. I use to do it for
many they just had to do the pretty paper work BS. How you wrote
this so fast is beyond me so again THANK YOU

I left the old BWT stuff since its the way you compare BWTS

Belive it or not when left Chain Lake I did proof reading what a joke
I see what I want in text and seldom write what I want.
English sucks

***** my comments start here when you talk aout BWTS ****

> David's basic idea is now to modify the algorithm to create an output
> containing multiple cycles by design. With multiple cycles available it
> is possible to split the input in a way that decoding from index zero
> for each part always yields the correct result. The information formerly
> stored in the extra index value is now encoded in the size and order of
> the cycles.
>
> To make this work we have to determine where to split up the input
> first. In David's program the function part_cycle() is responsible for
> this. When applying the BWT, line/index zero always contains the
> rotation with the lowest value. In case of "FOOBAR2000" this is
> "000FOOBAR2" (see forward BWT example from the beginning). As our

actually the reason for two seperate compare functions is here.
And I think he missed it. If you use the same compare you don't
get a good BTWS and that may be way people missed this
bijective form of BWT

I compare for first partiation with this verstion of code
000|000|000|000...
2000|2000|2000|...
AR2000|AR2000|AR2000|....
BAR2000|BAR2000|....
...

> original input string starts with "FOOBAR..." we need to split this into
> two cycles: "FOOBAR2" and "000". Now if we apply the BWT to the
> remaining "FOOBAR2" again, we notice that the index zero contains

actually we don't do a full here BWT which implies a sort we just
Search
for the lightest string for part left like this

2|2|2|2|2|2|2|...
AR2|AR2|AR2|AR2|AR2|...
FOOBARS|FOOBAR2|...


> "2FOOBAR". This is still not what we want, so we create another part

again note that "2FOOBAR" is not even a string tested during the
partitioning of string

> containing only "2". Now we try again with "FOOBAR" and get "ARFOOB"...
> When we continue to chop off the unwanted prefixes and note them as we
> go along, we'll finally get the remainder "FOO" which is still looking
> the same in index zero of the BWT. The complete partitioning sequence is
> as follows:
>
> FOOBAR2000 -> 000 FOOBAR2
> FOOBAR2 -> 2 FOOBAR
> FOOBAR -> AR FOOB
> FOOB -> B FOO
> FOO -> FOO <-- equal, we're done
>
> FOO|B|AR|2|000
> `-ยด U `ยด U `-ยด
>

Notice the string of zeros. Actually version one of the code treats
each
Zero as seprate cycle and it actually sorts first than current version
that
matchs comments. I thought I changed code to match comments but
it seems to slow the execusion for certain files the only change in
code is to "return 0;" instead of "return I<j?-1:1;" in last line fo
bounded_compareS(). But let me stress the output of sort
does not change

the line would be FOO|B|AR|2|0|0|0

but in either case they sort to same output

For this paticular for of code the "return 0;" repeated cycles like
|0|0|0| mapped to one cycle |000|
EVERY CYCLE UNIQUE so if you write the compare function
(and I did not yet) to take advantage of this fact you will only
compare till a DIFFERENCE they can never be the same.

Now if the compare is from same cycle they can be the same as BWT
so thats exactly the same problem that regular wrap aroung BWT solves

The reverse is pretty much the same as regular BWT you do each cycle
and that codes each partation. I feel the reverse is more obvious and
I thingk he has it

So think you Klaus everybody likes your write up as I only added
a few points but I do like I hope others can see some of my added
points

I like
0002RBOOFA with its repeat of 000 and OO and no index better than
BWT
200RBO0OFA,6 with its extra "index" and only the 00 repeat great
example

Take Care

Klaus Stengel

unread,
Dec 17, 2007, 8:03:39โ€ฏPM12/17/07
to
biject wrote:
> On Dec 15, 1:46 am, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:
>> To make this work we have to determine where to split up the input
>> first. In David's program the function part_cycle() is responsible for
>> this. When applying the BWT, line/index zero always contains the
>> rotation with the lowest value. In case of "FOOBAR2000" this is
>> "000FOOBAR2" (see forward BWT example from the beginning). As our
>
> actually the reason for two seperate compare functions is here.
> And I think he missed it. If you use the same compare you don't
> get a good BTWS and that may be way people missed this
> bijective form of BWT

Thanks for pointing this out. I well noticed the two different compare
functions but I wasn't fully aware of all the effects caused by that
difference, so I made a mistake in the description of the splitting
process. Other parts of my description are not affected. The input
"FOOBAR2000" unfortunately doesn't exhibit any problems and thus is a
somewhat awkwardly chosen example. Let's take "BAACA" instead. After
transformation the first line of the BWT table contains "AACAB".

>> original input string starts with "FOOBAR..." we need to split this into
>> two cycles: "FOOBAR2" and "000". Now if we apply the BWT to the
>> remaining "FOOBAR2" again, we notice that the index zero contains

Now according to my original description we remove "AACA" and get "B" as
the remainder. With "B" as the only character left (which of course
always maps to itself) we'd be done. The problem is the cycle "AACA" we
just split off. If we perform the BWT on that part, we'll get "AAAC" for
the first index instead of the required "AACA". So my original
description obviously doesn't work in all cases.

One solution is to check _both_ parts after each split operation and to
subdivide those parts further until each single part satisfies the
requirement. For "BAACA" we get the following tree, where the leaf nodes
contain the cycles we're looking for:

BAACA
/ \
AACA B
/ \
A AAC

--> B|AAC|A

An entirely different way of tackling this problem is just start with
some empty buffer and add characters from the input until it breaks,
i.e. the cycle would no longer end up on the first line of the BWT
table. Then we know we have to start another cycle before this very
character. Now we just need some way to know when this happens and
fortunately it exists.

The following text is now a somewhat lengthy description how I came up
with the details. If you're just interested in the results, you can
safely skip down to the line where it says "So the rules are as follows".

If we look at strict decreasing character sequences, e.g. something like
"DEF", we can easily show that these will always end up in the first
line. If you look at the BWT table it should be obvious:

DEF
EFD
FDE

There is just no other character in the string that could occupy the
first position except the "A" because the character with the lowest
value is always in the beginning. So as long as our characters are in
strict decreasing order we can continue without worrying. But what if
our sequence isn't strict, i.e. has runs of the same character in it?
Let's try with "DDDEEF":

DDDEEF
DDEEFD
DEEFDD
....

Still no problem because the longest run of the smallest character must
end up in the first position. Now we'll append some characters that have
a lower value than the preceding ones and look what happens. We'll start
carefully with the previous example and try adding an "E":

DDDEEFE
DDEEFED
....

We can easily see that as long as the added character is bigger than the
first one in the sequence we don't have to worry. So let's try adding a
"D" next:

DDDDEEFE
DDDEEFED
...

Now this no longer works. But if we look closer, we notice that it
breaks only because adding the same character from the beginning to the
end just extends the character run in the beginning. If we add multiple
different characters, e.g. "DE", we won't enlarge the run in the
beginning and the original string indeed stays on first position:

DDDEEFDE
DDEEFDED
...

Now if we continue this game we'll eventually come to the conclusion
that as long as we append strings that are larger, i.e. alphabetically
sort later, than the string that is already in the cycle buffer we can
continue adding characters without trouble. In order to find out if
we're adding a such a string we need to know how many characters we
added recently match the beginning of the string, thus we need a
position marker. It advances each time a character equal to the marked
one is added. Each time we encounter a bigger character than the marked
one we can be sure the string is larger so we reset it to the first
character again. If the new character is smaller or we reach the end of
the input we need to close the current cycle and open a new one.

Closing a cycle is actually somewhat more complicated than one might
think at the first glance. To illustrate this we look at the example
"FOOFOE". We already added "FOOFO" to our cycle and want to add the
final "E" now. The marker is currently at the second "O" because of the
repeating "FO" sequence.

FOOFO
^

If we simply closed the cycle at the end we won't get what we probably
expected:

FOFOO
FOOFO
...

This happens because the character where the marker is ("O") is bigger
than the first character in the cycle ("F"). When we closed the cycle we
implicitly created a string "FOF", which alphabetically sorts before
"FOO". Put in another way, we might say that closing a cycle is somewhat
like appending the first character to the end of the string again, which
is smaller in our case than the marked character and thus makes trouble.

But now our marker comes to the rescue. To avoid this we obviously have
to close the cycle early, more specifically we need to cut off the
number of characters the marker is away from it's starting position.
That number just shows us how long the repetitive part is that
displaced our desired string from the first line. In this example this
is two characters:

FOO|FO
^

FOO FO
OFO OF
OOF

I want to stress that it's sheer coincidence in this case that the cut
happens directly after the marker. With more complex strings (e.g. like
the "1212AB120" shown later) this is not always the case. Don't jump to
the conclusion we need to cut after the marker by just looking at the image.

Now we advance the pointer to the new cycle and try to add the "E" again:

FOO|FO
^
As "F" > "E", this is not a good idea and we have to close the current
cycle again. But this time the marked character is the same as the first
one. In that case we don't have to worry about destroying our cycle by
closing it at the end:

--> FOO|FO|E

In order to clarify some things I'd like to use the more complex string
"1212AB120" as another example. When we want to append the final "0" we
have the following situation:

1212AB12
^

121212AB
1212AB12
...

This time the "1" of the marker matches the "1" in front of the cycle.
We notice however that the marker does not point to the beginning. This
indicates again a repetitive pattern that matches the first
characters, so it would destroy the sorting again and a look at the BWT
table of the whole cycle confirms this. We can learn from this example
that it doesn't matter at all if the character values of the first and
marked one match. The value difference may also occur somewhat later in
the cycle. Yet one can always be sure that the string starting at the
marked position alphabetically sorts after the the whole cycle string.
In fact this is the very reason why we established those rules in the
first place:

1212AB12 <-- complete cycle buffer
12AB12 <-- characters from marker position onwards

As the pointer is two positions away from the beginning in our example,
we know that we need to split away the last two characters and put them
in the next cycle.

1212AB|12
^
Here we can also see that the new cycle doesn't necessarily start after
the marker position.

So the rules are as follows:

- The current cycle buffer is initialized with the first character
from the input string and the marker initially sits on that first
character.

- If the next input character is ...

* bigger than the marked character we reset the marker to the first
character again and add that character to the current cycle.

* the same as the marked character we move the marker one position
to the right and add the character to the current cycle.

* smaller than the marked character or no more input characters are
available we create a new cycle. This new cycle has to take the
last N characters from the current cycle, where N is the current
marker position counted from zero from the start of the
current cycle buffer. If N = 0 (i.e. the marker sits on the first
position) and there are no more input characters we stop here,
else we put the input character in the new (otherwise empty)
cycle. If N > 0 we do _not_ add the input character yet but
start the process of adding the character once again to make sure
we don't need to create any additional partitions first. No
matter what the value of N was, the marker should finally point
to the first character of the new cycle and we make the new cycle
current.

If we look at those rules it's obvious that this algorithm is in O(n)
speed wise and also O(n) memory wise as it eats up exactly one character
in almost every step and can never create more partitions than the
length of the input string. It can be easily proven that the only case
when doesn't eat a character (third condition) can be entered at most
twice in a row.

So if done right, David's idea can be implemented as an inexpensive
preprocessing step to a slightly modified BWT (irregular cycles instead
of a big one as input). The number of cycles created depends very much
on the structure of the input data. My general estimate is that strings
containing more decreasing sequences tend to trigger the third rule more
often and thus create more partitions.

For the sake of completeness, the "FOOBAR2000" example looks like this:

Input char | Cycle Buffer
F | F
| ^
O | FO
| ^
O | FOO
| ^
B | FOO|B
| ^
A | FOO|B|A
| ^
R | FOO|B|AR
| ^
2 | FOO|B|AR|2
| ^
0 | FOO|B|AR|2|0
| ^
0 | FOO|B|AR|2|00
| ^
0 | FOO|B|AR|2|000
^

And "BAACA" like this:

Input char | Cycle Buffer
B | B
| ^
A | B|A
| ^
A | B|AA
| ^
C | B|AAC
| ^
A | B|AAC|A
^


>> In case we have to compare strings of different lengths while sorting,
>> we need to keep in mind that we're actually comparing infinite cyclic
>> character sequences. That's why we need to extend the shorter one by
>> repeating it until a difference occurs or both were repeated at least
>> once. For example if we want to compare "AXYA" with "AXY" we have to
>> expand "AXY" by repeating it: "AXYAXY". Now "AXYA" is again shorter
>> and equals "AXYAXY" in all available characters, so we have to repeat
>> it too: "AXYAAXYA". Now we see that "AXYAAXYA" is smaller than
>> "AXYAXY", as the fifth character "A" < "X". In the sorted list "AXYA"
>> must be positioned before "AXY".
>>
>
> For this paticular for of code the "return 0;" repeated cycles like
> |0|0|0| mapped to one cycle |000|
> EVERY CYCLE UNIQUE so if you write the compare function
> (and I did not yet) to take advantage of this fact you will only
> compare till a DIFFERENCE they can never be the same.

I guess it's just a matter of taste if you want to split up repetitive
cycles. If you want to merge longer repetitive cycles, e.g. |FOO|FOO|
you'll need to do a full cycle comparison somewhere to rule out any
non-repetitive elements, which can be expensive. However I've still no
idea why this should guarantee that every cycle will be unique then. Or
did you mean something entirely different? And what would be the
advantage of doing that?

>> The resulting transform is the last character of each cycle in the
>> sorted list, almost like the ordinary BWT. Saving some index is no
>> longer required.
>>
>> --> 0002RBOOFA
>>

> I like
> 0002RBOOFA with its repeat of 000 and OO and no index better than
> BWT
> 200RBO0OFA,6 with its extra "index" and only the 00 repeat great
> example

Yes, in this particular case. Try it with "00FOOBAR20" instead of
"FOOBAR2000" as the input and you'll get "020RBO0OFA" as a result. I
guess any improvements in compressibility happen mostly by chance and
it's easy to generate counterexamples by trying different rotations of
the same input.

Bye,
Klaus.

biject

unread,
Dec 18, 2007, 1:07:17โ€ฏAM12/18/07
to
On Dec 17, 6:03 pm, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:

....

>


> >> In case we have to compare strings of different lengths while sorting,
> >> we need to keep in mind that we're actually comparing infinite cyclic
> >> character sequences. That's why we need to extend the shorter one by
> >> repeating it until a difference occurs or both were repeated at least
> >> once. For example if we want to compare "AXYA" with "AXY" we have to
> >> expand "AXY" by repeating it: "AXYAXY". Now "AXYA" is again shorter
> >> and equals "AXYAXY" in all available characters, so we have to repeat
> >> it too: "AXYAAXYA". Now we see that "AXYAAXYA" is smaller than
> >> "AXYAXY", as the fifth character "A" < "X". In the sorted list "AXYA"
> >> must be positioned before "AXY".
>
> > For this paticular for of code the "return 0;" repeated cycles like
> > |0|0|0| mapped to one cycle |000|
> > EVERY CYCLE UNIQUE so if you write the compare function
> > (and I did not yet) to take advantage of this fact you will only
> > compare till a DIFFERENCE they can never be the same.
>
> I guess it's just a matter of taste if you want to split up repetitive
> cycles. If you want to merge longer repetitive cycles, e.g. |FOO|FOO|
> you'll need to do a full cycle comparison somewhere to rule out any
> non-repetitive elements, which can be expensive. However I've still no
> idea why this should guarantee that every cycle will be unique then. Or
> did you mean something entirely different? And what would be the
> advantage of doing that?
>

depending on how you parse the "return " statement differece in
the two versions of code. You could break a single cycle that
repedative
intro several cycels.

But other than this one type of case you never get something like

if partituion ...xyz.....xyz....
a set of cycles like ...|xyz|....|xyz|... not vaiid if different
intervening string
like cycles only come from continuous strecthes in
the string. Other than that all cycle are unique. Which means
the compare for things from diffent cycles are unique.

example say you have n cycles you could sort them seperately just
like old bwt and then merge them since they would be in the order of
how they are used. This could make for fast coding.

example to cycles say sort to {ABCD} not I mean anything not literal
and {abcd}

they might merge like {AabBCcdD] or whatever you get the picture

no partition has the first and last character the same
|x.....x| not valid unless all characters the same
no partitopn splts a run of the same character

.....xxxx|xxxx... not valid except for the case if you
wsh of treating run of all the same character as seperate partions

one last fact if |xyz| is a cycle then there is no cycle |yzx|

as to the adavantagess of observing this is how one would code the
comprares for speed I did not do this. There are many ways to speed
it up.

> >> The resulting transform is the last character of each cycle in the
> >> sorted list, almost like the ordinary BWT. Saving some index is no
> >> longer required.
>
> >> --> 0002RBOOFA
>
> > I like
> > 0002RBOOFA with its repeat of 000 and OO and no index better than
> > BWT
> > 200RBO0OFA,6 with its extra "index" and only the 00 repeat great
> > example
>
> Yes, in this particular case. Try it with "00FOOBAR20" instead of
> "FOOBAR2000" as the input and you'll get "020RBO0OFA" as a result. I
> guess any improvements in compressibility happen mostly by chance and
> it's easy to generate counterexamples by trying different rotations of
> the same input.
>

Yes when one picks the test one gets the results one wants.
The real question in many minds may be how does it work with real life
large files. I suspect on the average it will beat BWT,INDEX but
only time will tell. But one this for sure old BWT,INDEX is not good
enough to get to codes that match good PPM code. And it saves
a lot of coding for some indexing which every one does differently
and if one is doing lots of blocks those repeated indexs can be
a paing to code and handle especaill when this is functionaly
so easy to use. How many programers use qsort of heapsort
they don't need to know how to code it only how to call it.
This will be much easier to use.


> Bye,
> Klaus.

Klaus Stengel

unread,
Dec 19, 2007, 1:42:23โ€ฏPM12/19/07
to
biject wrote:
> On Dec 17, 6:03 pm, Klaus Stengel <Klaus.Sten...@asamnet.de> wrote:
>> I guess it's just a matter of taste if you want to split up repetitive
>> cycles. If you want to merge longer repetitive cycles, e.g. |FOO|FOO|
>> you'll need to do a full cycle comparison somewhere to rule out any
>> non-repetitive elements, which can be expensive. However I've still no
>> idea why this should guarantee that every cycle will be unique then. Or
>> did you mean something entirely different? And what would be the
>> advantage of doing that?
>>
>
> if partituion ...xyz.....xyz....
> a set of cycles like ...|xyz|....|xyz|... not vaiid if different
> intervening string
> like cycles only come from continuous strecthes in
> the string. Other than that all cycle are unique. Which means
> the compare for things from diffent cycles are unique.

You're right, I didn't see the wood for the trees. From my description
of the algorithm one can see that you just split up the input string in
smaller parts of descending order. When one removes repetitions from a
descending sequence this results in a strict descending sequence. And a
strict descending sequence is always built from unique distinguishable
elements.

As we know that we already have a descending sequence, the
transformation to a set of unique elements isn't even expensive. We just
have to compare all adjacent cycles in order and join them if they match.

> example say you have n cycles you could sort them seperately just
> like old bwt and then merge them since they would be in the order of
> how they are used. This could make for fast coding.
>
> example to cycles say sort to {ABCD} not I mean anything not literal
> and {abcd}
>
> they might merge like {AabBCcdD] or whatever you get the picture

I don't think so, Tim. Just look at the "FOOBAR2000" example. I wrote
the cycle number and the position within the cycle in parentheses after
each line so that it's easier to track where each output character has
it's origin.
_
FOO FOOFOO... (0,0) 00|0| (4,0)
OFO OFOOFO... (0,1) 00|0| (4,1)
OOF OOFOOF... (0,2) 00|0| (4,2)
B BBBBBBBB... (1,0) |2| (3,0)
AR ARARARA... (2,0) A|R| (2,0)
RA RARARAR... (2,1) --> "merge" --> |B| (1,0)
2 22222222... (3,0) FO|O| (0,0)
000 000000... (4,0) OF|O| (0,1)
000 000000... (4,1) OO|F| (0,2)
000 000000... (4,2) R|A| (2,1)
`-ยด
Since the cycles are already in descending order, you can only guarantee
that all lines marked (x,0) are spit out in reverse order. Where all
other entries will end up isn't that clear, as you can see from the "RA"
in the last line. Of course there is some tendency that the individual
cycles will stay together, but this is not generally the case. I guess
it's even possible to craft input strings where lines from different
cycles are distributed almost completely random between the (x,0) ones
after performing the BWT.

> no partition has the first and last character the same
> |x.....x| not valid unless all characters the same
>
> no partitopn splts a run of the same character
> .....xxxx|xxxx... not valid except for the case if you
> wsh of treating run of all the same character as seperate partions

I agree. I think I proved those two statements somewhere in my previous
postings already.

> one last fact if |xyz| is a cycle then there is no cycle |yzx|

Getting a cycle "yzx" is entirely impossible, regardless whether there
was a "xyz" somewhere else or not. The "x" at the end is before "y"
(i.e. first character of cycle) in the alphabet, thus you need to split
after "yz" to obtain valid results in any case.

> as to the adavantagess of observing this is how one would code the
> comprares for speed I did not do this. There are many ways to speed
> it up.

That property of descending order can speed up comparison only if you
compare two cycles each from the beginning, and cycles don't even need
to be unique then. If the comparison starts somewhere in the middle of a
cycle you can't take any shortcuts, but this makes for the majority of
cases.

Bye,
Klaus

Klaus Stengel

unread,
Dec 19, 2007, 1:46:12โ€ฏPM12/19/07
to
Klaus Stengel wrote:
> So the rules are as follows:
>
> - The current cycle buffer is initialized with the first character
> from the input string and the marker initially sits on that first
> character.
>
> - If the next input character is ...
>
> * bigger than the marked character we reset the marker to the first
> character again and add that character to the current cycle.
>
> * the same as the marked character we move the marker one position
> to the right and add the character to the current cycle.
>
> * smaller than the marked character or no more input characters are
> available we create a new cycle. This new cycle has to take the
> last N characters from the current cycle, where N is the current
> marker position counted from zero from the start of the
> current cycle buffer. If N = 0 (i.e. the marker sits on the first
> position) and there are no more input characters we stop here,
> else we put the input character in the new (otherwise empty)
> cycle. If N > 0 we do _not_ add the input character yet but
> start the process of adding the character once again to make sure
> we don't need to create any additional partitions first. No
> matter what the value of N was, the marker should finally point
> to the first character of the new cycle and we make the new cycle
> current.

There is a mistake in that last sentence. If we always reset the marker
to the start of the new cycle, we'll miss some subdivisions. For
example, "FOOFOOFO1" would be partitioned into "FOO|FOOFO|1" instead of
the correct "FOO|FOO|FO|1".

If the marker is already in the new cycle, it has to stay in that
position. Only if it stayed in the old cycle we need to advance it
to the start of the new one.

This has no impact on the fact that this algorithm is O(n), as the
maximum number of partitioning steps is still limited to the number of
input characters.

Bye,
Klaus.

biject

unread,
Dec 19, 2007, 8:02:22โ€ฏPM12/19/07
to
> Klaus.- Hide quoted text -
>
> - Show quoted text -

Let me gave an other example by what I meant the breaking into cycles
then merging
|b,,iththj...v|a....eththd...x|
above is eample of the two splits where |a...ethtjd...y| is
lessthen |b...iththj...z|

when sorting the partitions you get two lists

a....eththd,,x label for other 1 the partiotion its self is first
for the cycle
d...xa...etheh 2
eththd...xa... 3
hd,,,xa,,,.etht 4
hthd...xa....et 5
thd....xa...eth 6
ththd...xa...e.. 7
xa....eththd... 8
of course many other rows missing just showing relative order of these

next partition
b.. iththj,,,,v A again many rows missing just to show
trend
iththj.....zb... B
j......vb...ithth C
hj.....vb.....itht D
hthj.....vb.....it E
thj......vb....ith F
ththj.....vb itht G
vb...,,.iththj.... H

to do BWTS you know megre these lists

1 of course the master cycle is top dog
A though some other missed in master cycle could be above
this
2
3
B
C
4 goes to t the mergincan increase runs
D goes to t
5 goes to t
E goes to t
6
F
7
G
H
8

Note I mention to compare funcntions the one for serpate cycles
( ) ( )
and the one for embedded cycles ( ( ))
well many people like having a $ super heavy symbol ( $)
versus ( $)
I think that compare could would be the same as ( ))
( $)) for compare when
making partitions. There are many ways to slice and dice it I lke
the one tha looks
more like the original BWT method but thats me tomorrow I may change
my mind in fact
once you see one more jump out at you

0 new messages