At any rate, since the amusement factor is wearing off and the belly
laughs for this day are fading.
Here is a simpler question.
If you are merely storing the Nth Prime... (making a list, checking it
twice, seeing whom can only be factored by itself & 1 very nice)
Prime(499) = 3559
At what data points does the costs of using your method to store numbers
begin paying off as a regular deal? I am not denying your ability to
compress numbers with this method, it does work in certain circumstances),
but I am strongly emphasizing that you delimit your outputs to only the
ranges that this method works best with. Don't forget that your ESCAPE
CHARACTERS are going to cost you unless you have positional or relative
notation storage tricks.
Suggestion, store your (Odd/Even) bit marker at the beginning of your
"compressed" value.
Ergo:
(Odd Bit) + Prime(499) + Prime(7) =
1 + 3559 + 13 = 3573
1 + 111110011 + 111 = 1101 1111 0101
(1 bit) [ESC Char] (9 bits) [ESC Char] (3 bits) = (12 bits)
(13 bits) + [2 ESC Char] = (12 bits)
NO PAYOFF IN COMPRESSION
Just make a list of what number pair ranges WORK for Data Compression
(you use less bits + Escape Chars than what would be required to just store
the original number bitwise) and then we can get past my guffaws and I will
help you a bit if it continues to amuse me. I will suggest your Nth Prime
be considerably larger than my simple (for simple minds) examples to
yourself here. If I've made any silly mistakes, well, it is not like this
group really matters enough to seriously double check my work for the usual
giggle-posts. I've dropped some good research directional hints to these
jokers in the past and none have even lifted the lightest finger to research
or follow-through, so you don't owe them anything and you will only cheat
yourself if you simply give up when the mockery begins.
========================
Hacking Data Compression Lesson 2 Basics
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie
http://www.fadden.com/techmisc/hdc/lesson02.htm
The usual way to form codes is with a distinguished "escape" character.
Escape characters don't have anything to do with the Esckey (ASCII 27);
that's just what they're called. Think about how hitting the Esc key in
most programs makes the program stop its current activities and do something
else. The usage here is similar.
When a run of characters is encountered, say:
lda #$1234 ;load the secret number
an RLE encoder would compress the "run" of spaces between the "1234" and the
";" by outputting the escape character, then a space, and then the number of
spaces. If your escape character were ']', your output would look like:
lda #$1234] 8;load the secret number
^^^
|||--- count == 8 spaces
||--- character is a space
|--- escape character
Decoding is just the opposite of encoding. The input data is copied to the
output until an escape character is hit. At that point the decoder reads
the character and the count, outputs that many characters, and then resumes
copying data.
That's basically it. However, there is one slight complication: what
happens if the escape character appears in the input? If we just copied it
to the output, the decoder would see it as the start of a run of characters,
and would output garbage. The solution is to output it as a one-byte run of
characters, so you'd output an escape
character (indicating start of run), another escape character (indicating
the character to use), and then a one (number of characters).
Note that we can still encode runs of escape characters in the usual
way. Since we use 3 bytes to output a run, we only lose ground if we
encounter a solo escape character or two in a row. If we encounter
three escape characters in a row, we would output "]]3", which is
exactly the same size. Thus, the maximum expansion for RLE is +50% of
the original size, for a file which looks like "]a]a]a]a]a]a]a".
From this we can see that the choice of escape character is fairly
important (the value $db - high-ASCII '[' - is a popular choice). It
should be the least frequently seen character in the input. We can
make it less important by CHANGING it every time. For example, we
could add some number relatively prime to 256 (say 51) to it every
time. That way we only repeat escape characters once every 256 runs,
so a burst of '$db's won't screw us up. So long as both the encoder
and the decoder adjust the escape value after each run, there's no
chance of confusion.
Note that this does make the data more susceptible to transmission
errors. Text compressed with RLE is still more or less readable. If
some error (say, modem line noise) caused the decoder to become out of
sync with the encoder, it would look for the wrong escape character
and would start spitting out garbage periodically.
(NOTE: I don't know if 51 is a good choice or not. Seems nice
enough.)
-=*=-
Note that it isn't necessary to do the encoding exactly as described.
Some routines reverse the order of character and escape code; it may
make the encoder or decoder simpler in certain situations. It's also
possible to avoid using an escape character altogether by assuming
that two identical characters will be followed by a length byte. For
example, if the input were:
abcDDDDefGGhi
the output would be:
abcDD2efGG0hi
^ ^
| |--- zero Gs follow
|--- two more Ds follow
indicating that there are two more 'D's, for a total of four. If there are
only two characters, as with the 'G's, an extra 0 must be added. So the
worst case for this scheme is a series of double characters (like
"aabbccdd"), which must be encoded in three bytes ("aa0bb0cc0dd0"). Again,
this gives a maximum increase of +50%, so we aren't gaining anything. If
the input text contains more occurrences of double characters than
occurrences of the escape character, we will do worse.
-=*=-
A few efficiency notes...
Since we represent runs with three bytes, there is no reason to represent
runs of two or three bytes specially (in fact, specifying a run of two bytes
would be a loss). In terms of decompression speed, the overhead involved in
outputting a run of three characters is higher than that for just copying
three characters to the output, so RLE encoders should only encode runs of
four or more.
Whenever we decode a run, we can assume that there will be at least one copy
of that character in the output (why else would we be doing it?). Thus the
sequence "escape|char|0" has no meaning; we just told the decompressor that
there was a run of 0 bytes, so please get on the ball and output all zero of
those bytes right this very minute.
We ought to change the meaning slightly, so that the length byte is one LESS
than the length of the run. Now a length of zero represents one byte, and
255 represents a run of 256. In fact, if it weren't for the special
handling needed for the escape character, we could treat the length byte as
length-4.
(And if you want to get really confusing, you could handle the length byte
for runs of escape characters specially. But the extra overhead involved is
probably not worth the effort.)
[More at webpage]