Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What do you call this?

11 views
Skip to first unread message

Industrial One

unread,
Feb 11, 2012, 12:10:45 PM2/11/12
to
With a PRNG I can generate a stream of bytes of any size from any
range. If I select 0-127 then it will generate a random distribution
of hex bytes 00 to 7F, and WinRAR will compress it by 12.5%, as
expected. If I select the full 0-255 then compression will be 0%,
obviously. If I select 00-01 the compression ratio will be 8:1.

What do you call the technique of "degrouping" the unique set of
symbols to a set short enough to represent the combos and not waste
space?

Is there an app out there that can do this with say English text? How
many bits to represent 26 letters, 5 right?

Willem

unread,
Feb 11, 2012, 1:07:37 PM2/11/12
to
Industrial One wrote:
) With a PRNG I can generate a stream of bytes of any size from any
) range. If I select 0-127 then it will generate a random distribution
) of hex bytes 00 to 7F, and WinRAR will compress it by 12.5%, as
) expected. If I select the full 0-255 then compression will be 0%,
) obviously. If I select 00-01 the compression ratio will be 8:1.
)
) What do you call the technique of "degrouping" the unique set of
) symbols to a set short enough to represent the combos and not waste
) space?
)
) Is there an app out there that can do this with say English text? How
) many bits to represent 26 letters, 5 right?

Encoding?

Well, actually encoding means more than that but it does include that.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Industrial One

unread,
Feb 11, 2012, 1:22:40 PM2/11/12
to
Uh, no.

Robert Wessel

unread,
Feb 11, 2012, 5:10:46 PM2/11/12
to
Assuming I understand your question, If you want your output to
consist of symbols a whole number of bits long, Huffman encoding will
do the trick for you. You'll get more efficiency with arithmetic
encoding, which is in many way similar to Huffman, but allows the
output symbols to be a fractional number of bits. Note that those
generate variable length output symbols based on the probability of
the input symbol. A simpler approach will suffice if you want fixed
length output symbols.

And to represent N distinct states, you need lg(N) bits. (lg being
the base-2 logarithm).

biject

unread,
Feb 11, 2012, 10:03:30 PM2/11/12
to
On Feb 11, 3:10 pm, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Sat, 11 Feb 2012 09:10:45 -0800 (PST), Industrial One
>
actually if you have N objects and N = 2**X you need exactly X bits
if its not a power of 2 find 2**X that is the next power of 2.
get Y = 2(N - 2**(X-1) )
assign Y symbols to X bits and (N-Y) to X-1 bits then you mantain
the integer number of bit for each symbol
examples symbols A B C D since 2**2 = 4 each two bits
A = 00
B = 01
C = 10
D = 11

next symbols 5 of them A B C D E X=3 X-1 = 2
Y=2(5 -4)=2 so 2 3bits 3 2 bits since 5-2 = 3
A = 00
B = 01
C = 10
D = 110
E = 111

next symbols 6 of them A B C D E F X=3 X-1 = 2
Y=2(6 -4)=4 so 4 3bits 2 2 bits since 6-4 = 2
A = 00
B = 01
C = 100
D = 101
E = 110
F = 111

next symbols 7 of them A B C D E F G X=3 X-1 = 2
Y=2(7 -4)=6 so 6 3bits 1 2 bits since 7-6 = 1
A = 00
B = 010
C = 011
D = 100
E = 101
F = 110
G = 111

Take this last case and assume A through F random
then if each symbol appeared once you need 20 bits for
7 characters that is 2.85714286 per character

if you used arithmetic ln(7)/ln(2) 2.80735492 per character
so in this case arithmetic would be better.



David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Industrial One

unread,
Feb 12, 2012, 9:13:25 AM2/12/12
to
On Feb 11, 10:10 pm, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Sat, 11 Feb 2012 09:10:45 -0800 (PST), Industrial One
>
> <industrial_...@hotmail.com> wrote:
> >With a PRNG I can generate a stream of bytes of any size from any
> >range. If I select 0-127 then it will generate a random distribution
> >of hex bytes 00 to 7F, and WinRAR will compress it by 12.5%, as
> >expected. If I select the full 0-255 then compression will be 0%,
> >obviously. If I select 00-01 the compression ratio will be 8:1.
>
> >What do you call the technique of "degrouping" the unique set of
> >symbols to a set short enough to represent the combos and not waste
> >space?
>
> >Is there an app out there that can do this with say English text? How
> >many bits to represent 26 letters, 5 right?
>
> Assuming I understand your question,

To convert a bunch of 00000000s and 00000001s to 0s and 1s would
basically be removing the first 7 zeroes, correct? Does this technique
have a name?

> If you want your output to
> consist of symbols a whole number of bits long, Huffman encoding will
> do the trick for you.  You'll get more efficiency with arithmetic
> encoding, which is in many way similar to Huffman, but allows the
> output symbols to be a fractional number of bits.  Note that those
> generate variable length output symbols based on the probability of
> the input symbol.  A simpler approach will suffice if you want fixed
> length output symbols.
>
> And to represent N distinct states, you need lg(N) bits.  (lg being
> the base-2 logarithm).

And WinRAR makes use of Huffman? (I know what Huffman is.) I thought
WinRAR uses LZW techniques, and I don't see how LZW would possibly
compress a 1 MB sequence of random hex bytes 00-7F perfectly since it
has no patterns/repeating phrases for it to encode.

File: http://www.sendspace.com/file/px82nz

glen herrmannsfeldt

unread,
Feb 12, 2012, 10:02:12 AM2/12/12
to
Industrial One <industr...@hotmail.com> wrote:

(snip)
> And WinRAR makes use of Huffman? (I know what Huffman is.) I thought
> WinRAR uses LZW techniques, and I don't see how LZW would possibly
> compress a 1 MB sequence of random hex bytes 00-7F perfectly since it
> has no patterns/repeating phrases for it to encode.

Not perfectly, but it won't be far off for a long enough file.

Sequences will repeat just often enough that it takes about 7/8
as much space as it otherwise would.

-- glen

Jim Leonard

unread,
Feb 13, 2012, 11:19:25 AM2/13/12
to
On Feb 12, 8:13 am, Industrial One <industrial_...@hotmail.com> wrote:
> To convert a bunch of 00000000s and 00000001s to 0s and 1s would
> basically be removing the first 7 zeroes, correct? Does this technique
> have a name?

Choosing a symbol set based on your input set is called "encoding".

> And WinRAR makes use of Huffman? (I know what Huffman is.) I thought

WinRAR uses lots of techniques, but shares the same basic technique as
7-zip and pkzip, which is to look for repeating patterns, then encode
the output of the pattern matcher. winrar is not finding any patterns
in the random output, but it is correctly encoding the symbols output
from that stage using less bits.

Industrial One

unread,
Feb 13, 2012, 3:04:03 PM2/13/12
to
Encoding by itself is vague, it means lots of things. What is the
concisest way to explain such a method?

glen herrmannsfeldt

unread,
Feb 13, 2012, 3:28:57 PM2/13/12
to
Jim Leonard <moby...@gmail.com> wrote:

(snip)
> WinRAR uses lots of techniques, but shares the same basic technique as
> 7-zip and pkzip, which is to look for repeating patterns, then encode
> the output of the pattern matcher. winrar is not finding any patterns
> in the random output, but it is correctly encoding the symbols output
> from that stage using less bits.

Depending on your definition of pattern.

If you take a file full of ASCII 0's and 1's, (and so well
compressible) the pattern logic will find sequences of 0's and 1's
that appear, and so look like repetitive sequences. To you, they
look like random bit patterns, but to LZW they look like
repeats in '0' and '1'.

-- glen

Jim Leonard

unread,
Feb 13, 2012, 3:54:13 PM2/13/12
to
On Feb 13, 2:04 pm, Industrial One <industrial_...@hotmail.com> wrote:
> Encoding by itself is vague, it means lots of things. What is the
> concisest way to explain such a method?

Encoding means many things, and this is one of those things. The
actual process you're describing is "representing a set of data using
the least amount of bits per symbol". The actual process is a generic
operation, so the generic term fits.

There are many different *methods* to achieve this (Huffman, a
predefined dictionary, Range Coding, etc.), each of which are concise
and specific and have their own term.
0 new messages