A break down of the different portions of the invention, known as
Harrington's Compression Method, or HCM for the layman.
~~~~~~~~
The First Fundamental is the Huffman.
A Huffman is a way for a computer to change a data line, and be able
to go back and change it back with out any errors, loss, or the likes.
It does this through identifying the following bit in a sequence of
bits until it recognizes a pattern.
This can be a typical Huffman
11 = 0
10 = 10
01 = 110
00 = 111
A Huffman can be patterned in many ways, and many lengths. The above
example works on switching 1's for 0's, and if 11 results outnumber 01
and 00 results it will also result in compression. A Huffman is a very
adaptable format, with many advantages to compression.
In the example above if the Huffman sees a 0 as the first bit it is
currently examining, it know it is actually a result and can end
identifying this group there. If the next bit is a 1, then it knows to
seek the following bit out and identify it. If it is a 0, then the
final result is a 10, and it ends that group there. Each is
identifiable in turn. Reversal of a Huffman is both easy and exacting.
There is no errors so long as no corruption occurred in the code.
~~~~~~~~~~
The Second Fundamental is Pascals triangle.
Pascals triangle has applications in many many different fields. For
our purposes the Triangle is a perfect tool for identifying the odds
of certain outcomes in certain sized files.
A triangle is made rather easily. The first two results are 11, and
are applied to the top of the triangle.
1-1
The left most, and right most result are always 1, however for those
used to a table, we can say the left is always a 1.
Therefore our Triangle starts to look like this:
1-1
1-
1-
1-
1-
now the second row down, 2nd column and on, operate in this manner:
Add the bit directly above the desired result, and the bit to the left
of that one.
1-1
1-2-1
1-3-3-1
1-4-6-4-1
1-5-10-10-5-1
This will always result in a SUM count over a row equal to the total
possible outcomes of that many bits, as represented in column 2. This
is always true. The row can be looked at like this. Assuming 0 is the
left, and 1 is the right, the first number represents the number of
outcomes with all 0's, the next is the number of bits, -1, that are
0's with +1 of the 1's. The 2nd column is all zero's minus two, and
plus two ones. This goes on until there is no more 0's. Therefore at
100 bits the area with 30 total 0's and 70 total 1's will have around
29,372,339,821,610,900,000,000,000 (with some rounding occurring
(darned maximums of OpenOffice and Excel) outcomes. As we progress in
total number of bits we end up with a odds of occurrence of most
outcomes (to a degree which rounds to 100% just by the decimal
displacement generated) to lay in/close to, the 50% range, with a
deviation of less than 1% in either direction.
This is a basic law of large software, so long as it is entropic, or
random. If the software is text based, largely non-random, or is
highly imbalanced in a low bit ratio, then it will fall outside this
norm.
~~~~~~~
The Third Fundamental is Separate Filing
This is a new concept to some extent. In typical compression the usage
of multiple files, of multiple sizes, of multiple possible names,
results in a potential compression method, for extremely imbalanced in
ratio levels information. It is extremely ineffective, and unlikely to
compress. The problem ultimately is an issue of the command section,
or real size of a factual file, to hold the data separated.
The numbers feel impressive, for this can be an example:
Assumption, up to 3 files, 1 to 3 bits in each. Each file therefore
has 14 outcomes. This is 14 * 14 * 14 + 14 * 14 + 14 * 14 + 14 * 14 +
14 + 14 + 14, or a total of 3374. This is 11.7202 bits in ratio. Where
we had 1 to 3 bits, with an average of 2.57 bits per file, of 1 to 3
files, and a predictable number of around 2.5 files averaged. This is
7.19 bits averaged! It seems wonderful... until you need to count how
many bits per which file. The command section requires 1 bit per file
to say if it is active, 1 bit per an active file to identify if there
is 1 bit or 2 bits, and this adds up fast. So fast that it out does
any gains statistically, and renders itself moot.
However if a means existed to sort information to different files,
without an overhead cost to keep track of this information then a huge
increase would occur in the possible data being stored per bit. As
would the ability to have different numbers of bits in a file tracked
without cost. Either way would be an increase, together would be a
dramatic increase in the capacity for our purposes.
~~~~~~~~~
The Fourth Fundamental is perfect tracking
If Step A tells you to do 00 = 11 then step B tells you 11 = 10, and
your command section is AB, then if your code was 001110 you get
instead 100011 as your final results. Reversing it is as simple as
doing the reverse, BA.
~~~~~~~~~
The Fifth Fundamental is Lossy versus Lossless
A lossy code system can mistake two results for each other. An example
of a lossy encoder would be one which takes a word, or words, it knows
in context, and tries to repair them. An example is “If then” type
statements. If the If then is in a programming context then the
software may be able to just predict a lot of the system, but if it
happens to be in a text portion it may lose some of the text to
errors.
A lossless system never makes a mistake in the code. There is no means
to create a mistake as steps are clearly written to produce step B
after step A, and with perfect accountability. A Huffman is a lossless
algorithm due to the fact it can be flawlessly reversed.
~~~~~~~~~~~~
The Sixth Fundamental is Ratio's
Ratio's can play an important part in compression. It is possible to
have anywhere from 100% ones, to 0% ones, versus a 0% zero's to 100%
zero's. This can lead to wildly different variables when talking two
bit outcomes, even if both are 50%. An example. You might have 50/50
1's to 0's, but 20% - 11, 30% - 10, 30% - 01, 20% 11. This would be a
ratio imbalance.
It is possible to create an imbalance from a pure statistical evenness
via any number of means. For instance a Huffman with 11 = 111, 10 =
110, 01 = 10, 00 = 0 will result in a statistical imbalance of 66.7%
total 1's versus 33.3% total 0's
~~~~~~~~~~~~
I will not go into Arithmetic encoding, bijective encoding, and other
compression techniques because frankly I do not need to. These are
better than Huffman is, by a good measure, but they are not needed to
validate the method.
Some people have exclusively seen it in advance, or have a chance to
do so.
>
> The Second Fundamental is Pascals triangle.
> Pascals triangle has applications in many many different fields. For
> our purposes the Triangle is a perfect tool for identifying the odds
> of certain outcomes in certain sized files.
About two years ago we had a guy here that presented a nice method for
encoding a permutation of k bits in a total of n bits optimally (i.e. to
encode one of the (k over n) permutations). The algorithm was neat, but
required a lot of buffering, and the major flaw of it was that there it
was rather unclear how to apply it - i.e. there was no easy way to
design a model for a data source that fits to that. I did a couple of
tests with it, and found that even for primitive audio coding an MQ
coder could easily outperform it. By that I mean that you should
probably check for prior art.
>
> The Third Fundamental is Separate Filing
>
> This is a new concept to some extent. In typical compression the usage
> of multiple files, of multiple sizes, of multiple possible names,
> results in a potential compression method, for extremely imbalanced in
> ratio levels information. It is extremely ineffective, and unlikely to
> compress. The problem ultimately is an issue of the command section,
> or real size of a factual file, to hold the data separated.
>
> The numbers feel impressive, for this can be an example:
>
> Assumption, up to 3 files, 1 to 3 bits in each. Each file therefore
> has 14 outcomes. This is 14 * 14 * 14 + 14 * 14 + 14 * 14 + 14 * 14 +
> 14 + 14 + 14, or a total of 3374. This is 11.7202 bits in ratio. Where
> we had 1 to 3 bits, with an average of 2.57 bits per file, of 1 to 3
> files, and a predictable number of around 2.5 files averaged. This is
> 7.19 bits averaged! It seems wonderful... until you need to count how
> many bits per which file. The command section requires 1 bit per file
> to say if it is active, 1 bit per an active file to identify if there
> is 1 bit or 2 bits, and this adds up fast. So fast that it out does
> any gains statistically, and renders itself moot.
>
> However if a means existed to sort information to different files,
> without an overhead cost to keep track of this information then a huge
> increase would occur in the possible data being stored per bit. As
> would the ability to have different numbers of bits in a file tracked
> without cost. Either way would be an increase, together would be a
> dramatic increase in the capacity for our purposes.
However, if 1 = 0, I would fly to the moon. The point is that the
counting argument ensures that there is no such means, so this point is
moot in first place. If you start with an assumption that is false, you
can prove everything (ex falso quodlibet). You need to keep track of
side-information somehow, and that's the matter of the operating system,
and if you do not count this side-information, you get wrong results,
clearly.
You should probably look into things like multi-channel coding, maybe
that's related. (Yes, first get in touch with existing technology).
> The Sixth Fundamental is Ratio's
>
> Ratio's can play an important part in compression. It is possible to
> have anywhere from 100% ones, to 0% ones, versus a 0% zero's to 100%
> zero's. This can lead to wildly different variables when talking two
> bit outcomes, even if both are 50%. An example. You might have 50/50
> 1's to 0's, but 20% - 11, 30% - 10, 30% - 01, 20% 11. This would be a
> ratio imbalance.
>
> It is possible to create an imbalance from a pure statistical evenness
> via any number of means. For instance a Huffman with 11 = 111, 10 =
> 110, 01 = 10, 00 = 0 will result in a statistical imbalance of 66.7%
> total 1's versus 33.3% total 0's
You forget here that the input of the huffman must be "out of balance"
in first place to make the huffman worth applying. If the input is IID
with a flat distribution, then the huffman code above would indeed cause
an out-of-balance of one's and zero's, but you wouldn't apply huffman to
that data source in first place. If you would apply huffman to the
source it was modelled for, you will find that the result is in balance.
However, huffman is not ideal for all sources (arithmetic is), so you
can find for "most" sources an out-of-balance output, which is just
another way of expressing the non-optimality of the code. For an optimal
code, no such out-of-balance situation arises (proof: If it would be out
of balance, a simple binary arithmetic compressor would shorten the
output file size, making the combined algorithm performing better, which
is against the assumed optimality of the initial code. q.e.d.)
> I will not go into Arithmetic encoding, bijective encoding, and other
> compression techniques because frankly I do not need to. These are
> better than Huffman is, by a good measure, but they are not needed to
> validate the method.
They are needed from a theoretical standpoint to debunk your method.
Now, please back to the drawing board.
Thanks,
Thomas
That would be the QI coder, right ? I thought it wasn't so bad with
the buffering, in fact it was quite a good and novel idea.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Sorry, wasn't clear about it (still too tired with a jet-lack) - it
required rather large *tables* to work with. As said, the algorithm was
neat, I just had no clue how to apply it in realistic problems (i.e. you
rarely have the problem of having to encode a specific permutation, and
it is unclear how to turn that into a model).
So long,
Thomas
> I will not go into Arithmetic encoding, bijective encoding, and other
> compression techniques because frankly I do not need to. These are
> better than Huffman is, by a good measure, but they are not needed to
> validate the method.
Bijective encoding is not "better than Huffman". As David Scott so
often shows, a given encoding scheme can be modified to have the
bijective property. Thus he has created bijective versions of both
Huffman and arithmetic coders. You are trying to say that the color
red is faster than a startled wombat.
|
| Mark Nelson - http://marknelson.us
|