Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Compression of random binary data

1,371 views
Skip to first unread message

jonas.t...@gmail.com

unread,
Jul 11, 2016, 1:52:27 PM7/11/16
to
What kind of statistic law or mathematical conjecture or is it even a physical law is violated by compression of random binary data?

I only know that Shanon theorised it could not be done, but were there any proof?

What is to say that you can not do it if the symbolic representation is richer than the symbolic represenatation of the dataset.

Isn't it a fact that the set of squareroots actually depict numbers in a shorter way than their actual representation.

Now the inpretator or program must know the rules. And i have very good rules to make it happen.

Joonas Liik

unread,
Jul 11, 2016, 2:09:39 PM7/11/16
to
On 11 July 2016 at 20:52, <jonas.t...@gmail.com> wrote:
> What kind of statistic law or mathematical conjecture or is it even a physical law is violated by compression of random binary data?
>
> I only know that Shanon theorised it could not be done, but were there any proof?

Compression relies on some items in the dataset being more frequent
than others, if you have some dataset that is completely random it
would be hard to compress as most items have very similar number of
occurrances.

> What is to say that you can not do it if the symbolic representation is richer than the symbolic represenatation of the dataset.
>
> Isn't it a fact that the set of squareroots actually depict numbers in a shorter way than their actual representation.

A square root may be smaller numerically than a number but it
definitely is not smaller in terms of entropy.

lets try to compress the number 2 for instance using square roots.
sqrt(2) = 1.4142
the square root actually takes more space in this case even tho it is
a smaller number. so having the square root would have negative
compression in this case.
with some rounding back and forth we can probably get around the fact
that sqrt(2) would take an infinite amout of memory to accurately
represent but that neccesarily means restricting the values we are
possible of encoding.

for sqrt(2) to not have worse space consumprion than the number 2
itself we basically have to trow away precision so sqrt(2) ~= 1
now i challenge you to get that 2 back out of that 1..

jonas.t...@gmail.com

unread,
Jul 11, 2016, 2:24:37 PM7/11/16
to
Well who it to say different kind of numbers isn't treated differently, i mean all numbers isn't squares. All numbers isn't naturals.

MRAB

unread,
Jul 11, 2016, 2:31:00 PM7/11/16
to
If you want a challenge:

The Enduring Challenge of Compressing Random Data
http://www.drdobbs.com/architecture-and-design/the-enduring-challenge-of-compressing-ra/240049914

jonas.t...@gmail.com

unread,
Jul 11, 2016, 2:32:26 PM7/11/16
to
But it could be all numbers are foldable. Both the integer parts and the real parts.And be expressed by the folding differences.

Steven D'Aprano

unread,
Jul 11, 2016, 2:38:51 PM7/11/16
to
On Tue, 12 Jul 2016 03:52 am, jonas.t...@gmail.com wrote:

> What kind of statistic law or mathematical conjecture or is it even a
> physical law is violated by compression of random binary data?

The pigeon hole principle. If you have 100 pigeon holes, and 101 pigeons,
then clearly at least one pigeon hole must have two pigeons in it.

To keep the numbers small and manageable, let's say we are going to compress
one byte at a time. Now a byte has eight bits, so there are exactly 256
possible bytes:

0000 0000
0000 0001
0000 0010
...
1111 1110
1111 1111

Now, suppose I claim that I can LOSSLESSLY (that is, reversibly) compress
any random byte to just two bits. The lossless part is important: its not
hard to compress random data by irreversibly throwing some of it away, and
there's no violation there.

So I claim that you can give me any random byte, I will compress it to just
two bits:

00
01
10
11

and then be able to decompress it back again to give you the original byte
once more.

Obviously I'm trying to pull a fast one! There's no way I can do this. I can
squeeze 256 pigeons into just four pigeon holes, but only by doubling them
up. Suppose I compress these three bytes to 00:

0000 0000
0110 1001
1100 0110

Now when I go to uncompress 00, what should I return? There is no way for me
to know which of the three was the original value.

(If I'm cunning, I'll have sneakily stored some data *elsewhere*, say, in
the file name, or in a database, so that you need this extra hidden data to
uncompress the 00 back to a full byte. But then I'm not compressing eight
bits down to two. I'm compressing eight bits down to two bits plus
who-knows-how-many-bits of hidden data.)

So the pigeon hole principle tells us one of two things:

(1) If you compress random data, then it must be lossy; I can compress eight
bits to two, but then I can't uncompress it back again, at least not
without throwing away some data.

(2) Or, if the compression is lossless, then some data must be expanded
rather than compressed. If you pick data at random, some of it will be
expanded.

Suppose I have a compression algorithm that infallibly and reversibly
compresses as follows:

0000 0000 <--> 00
0000 0001 <--> 01
0000 0010 <--> 10
0000 0011 <--> 11

That part is fine. But what will my algorithm do with the other 252 bytes?
At *best* it will leave them untouched:

0000 0100 <--> 0000 0100
...
1111 1111 <--> 1111 1111

which is no compression at all, but at worst it will actually expand them
and make them bigger. (After all, it's likely that my compression format
has at least a bit of overhead.)

In practice, compression algorithms are designed to look for particular
kinds of order or structure in the data, and compress that. That's fine for
the sorts of non-random data we care about: pictures are rarely pictures of
static, text files are rarely random collections of bits. But if you do
throw a random set of bits at a lossless compression algorithm, it will at
best not compress it at all, and at worst actually make the file bigger.


> What is to say that you can not do it if the symbolic representation is
> richer than the symbolic represenatation of the dataset.
>
> Isn't it a fact that the set of squareroots actually depict numbers in a
> shorter way than their actual representation.

Sure. But you need to know what √2 means. It *represents* the number
1.41421356237... but doesn't compress it. There's nothing you can do to the
symbol √2 that will uncompress back to the infinite series of digits. All
you can do is look it up somewhere to see what the digits are.

> Now the inpretator or program must know the rules. And i have very good
> rules to make it happen.

Right. How much information is in the rules? More than you save with
the "compression". Consider:

1.41421356237 compressed down to √2, that's 13 characters down to 2. Great!
But to *uncompress*, you need to store a rule:

√2=1.41421356237

and that's *sixteen* characters. So your "compression" is:

original: 13
compressed to: 2
plus rule: 16

means you have "compressed" 13 characters to 18.

Now, this is still worth doing if you need to repeat the √2 many times, so
long as you don't have to repeat the rule. That's useful. But it's not
compression. It's more like keeping an index to a database, or a scrap of
paper with the title of a book written on it:

"See Lord Of The Rings, by J.R.R. Tolkien"

That's a lot smaller than the actual book: eight words, instead of who knows
how many tens of thousands. But you can't call it compression: you can't
sit down with the scrap of paper *and nothing else* and uncompress it back
to the entire LOTR trilogy.




--
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Nobody

unread,
Jul 11, 2016, 2:55:47 PM7/11/16
to
On Mon, 11 Jul 2016 10:52:08 -0700, jonas.thornvall wrote:

> What kind of statistic law or mathematical conjecture or is it even a
> physical law is violated by compression of random binary data?

You can't create an invertable mapping between a set with 2^N elements
(e.g. the set of all N-bit binary sequences) and any set with fewer than
2^N elements (e.g. the set of all M-bit binary sequences for M<N).

Lossless compression requires an invertable mapping.

For any lossless compression algorithm, there will always be inputs where
the output is larger than the input, even if only by a single bit.

Practical lossless compression schemes operate by mapping likely inputs to
short outputs and unlikely inputs to longer outputs, resulting in outputs
which are /on average/ shorter than the inputs.

Lossy compression can achieve as much compression as you want, providing
that you're willing to tolerate the resulting loss of information.

MRAB

unread,
Jul 11, 2016, 2:57:59 PM7/11/16
to
On 2016-07-11 19:30, MRAB wrote:
> On 2016-07-11 18:52, jonas.t...@gmail.com wrote:
> If you want a challenge:
>
> The Enduring Challenge of Compressing Random Data
> http://www.drdobbs.com/architecture-and-design/the-enduring-challenge-of-compressing-ra/240049914
>
It turns out there's a newsgroup for compression, and they're used to
getting claims about compressing random data. In fact, so much so that
it's in their FAQ:

comp.compression Frequently Asked Questions (part 1/3)
Section - [9] Compression of random data (WEB, Gilbert and others)
http://www.faqs.org/faqs/compression-faq/part1/section-8.html

Terry Reedy

unread,
Jul 11, 2016, 3:32:09 PM7/11/16
to
On 7/11/2016 2:09 PM, Joonas Liik wrote:
> On 11 July 2016 at 20:52, <jonas.t...@gmail.com> wrote:
>> What kind of statistic law or mathematical conjecture or is it
>> even a physical law is violated by compression of random binary
>> data?

Off-topic, but... It is unclear whether you mean 'random' in the
technical sense of 'unpredictable' or the common sense that adds 'of
equal probability'.

Bell engineers discovered that physical communication channels have a
finite information transmission capacity that could be formalized as
bits per second. You should be able to find good articles on the web,
and I suggest you read some.

If every message could be compressed, than every message could be
compressed to 0 or 1, which is absurd.

>> I only know that Shanon [Shannon] theorised it could not be done, but were
>> there any proof?

Shannon meant random in the technical sense and explicitly considered
unequal probabilities. Random bit streams with unequal probabilities
*can* be compressed by recoding.

> Compression relies on some items in the dataset being more frequent
> than others,

Perhaps better to say that compression relies on removing redundancy,
*if there is any*. The two ideas are related.

> if you have some dataset that is completely random it
> would be hard to compress as most items have very similar number of
> occurrances.

Assuming equal bit probabilities. Uncorrelated bits of unequal
probability results in blocks of whatever size having unequal
probabilites and redundancy that can be removed by replacing blocks with
coded blocks. Huffman encoding does this by replacing blocks of equal
size with code blocks of unequal size, with the size related to the
probability of the block replaced.

--
Terry Jan Reedy

Michael Selik

unread,
Jul 11, 2016, 8:36:58 PM7/11/16
to
On Mon, Jul 11, 2016, 10:56 AM <jonas.t...@gmail.com> wrote:

> What kind of statistic law or mathematical conjecture or is it even a
> physical law is violated by compression of random binary data?
>

If you get lucky, you might be able to achieve very good compression.

> http://dilbert.com/strip/2001-10-25

Or are you asking about an algorithm that reliably compresses random data
by a fairly constant percent?

>

Lawrence D’Oliveiro

unread,
Jul 11, 2016, 11:01:20 PM7/11/16
to
On Tuesday, July 12, 2016 at 5:52:27 AM UTC+12, jonas.t...@gmail.com wrote:

> What kind of statistic law or mathematical conjecture or is it even a
> physical law is violated by compression of random binary data?

Try compressing already-compressed data.

Does that answer your question?

jonas.t...@gmail.com

unread,
Jul 12, 2016, 10:24:24 AM7/12/16
to
But it seem your reasoning is based upon interpretation of the actual digits, bits and bytes value. There could be different interpretation worlds of course you would have to chose one using digits, An interpretationworld here could be reading out different word lengths of the dataset and maybe a lookup table.


But it could also be arithmetic rules that magically recreate a number from a number of folds or difference of folds.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 10:29:22 AM7/12/16
to
Yes that is my question, and also a claim i can do it.

Steven D'Aprano

unread,
Jul 12, 2016, 11:12:01 AM7/12/16
to
On Wed, 13 Jul 2016 12:24 am, jonas.t...@gmail.com wrote:

> Den måndag 11 juli 2016 kl. 20:38:51 UTC+2 skrev Steven D'Aprano:
>> On Tue, 12 Jul 2016 03:52 am, jonas.t...@gmail.com wrote:
>>
>> > What kind of statistic law or mathematical conjecture or is it even a
>> > physical law is violated by compression of random binary data?
>>
>> The pigeon hole principle. If you have 100 pigeon holes, and 101 pigeons,
>> then clearly at least one pigeon hole must have two pigeons in it.
[...]
> But it seem your reasoning is based upon interpretation of the actual
> digits, bits and bytes value.

Not at all. If you think that, you've misread my example. There's no
interpretation of the bytes: they are just 8-bit numbers from 0 to 255. You
cannot losslessly compress all 256 of them to just four 2-bit numbers.


> There could be different interpretation
> worlds of course you would have to chose one using digits, An
> interpretationworld here could be reading out different word lengths of
> the dataset and maybe a lookup table.

Any lookup table you have counts as part of the compressed data.


> But it could also be arithmetic rules that magically recreate a number
> from a number of folds or difference of folds.

Oh, sure, if you believe in magic, anything is possible. Just close your
eyes, click your heels together, and wish really, really hard.

Suppose I could compress ANY random data, no matter what, down to 10% of the
original size. Okay, let's start with a million bits of data. Compress it
down to 100,000 bits.

But I believe that I can compress *anything*, any random collection of data.
Okay, let me compress it again. Now I have 10,000 bits.

Compress it again. Now I have 1,000 bits.

Compress it again. Now I have 100 bits.

Compress it again. Now I have 10 bits.

Compress it again. Now I have 1 bit, either a 0 or a 1.


Can you not see how absurd this is? I have claimed that I can take *any*
random set of data, and by compressing it again and again and again,
compress it down to ONE BIT, either a 0 or a 1, WITHOUT LOSS. Somehow I
have to take that 0 bit and uncompress it back to the Complete Works Of
William Shakespeare, and *also* uncompress it back to the recent Deadpool
movie, AND uncompress it back to last year's Ant Man movie, AND uncompress
it back to some funny picture of a cat.

How can I possibly know which of the billions and billions of different
files this 0 bit represents?

If you pass me a 0 bit, and say "uncompress this", and I get The Lord Of The
Rings novels, and then you pass me another 0 bit, and I uncompress it and
get The Hobbit, well, how did I tell the two bits apart? They're both zero.



The alternative is to say, it doesn't matter how clever you are, you can't
compress *everything*. There are some things that simply won't compress.
Eventually you get something no longer compresses. If you could compress
EVERYTHING, then you could compress the compressed data, and compress the
compressed-compressed data, and so on, until you've got only a single bit.
And that is ridiculous.

Steven D'Aprano

unread,
Jul 12, 2016, 11:19:54 AM7/12/16
to
Can you also make a perpetual motion machine, square the circle, and find an
exact rational fraction equal to pi?


What gets me is the people who *say* that they can compress already
compressed data. We know they can't, because if they could, they could
compress it again and again and again and again until there was only a
single bit, AND STILL REVERSE IT, using no external storage. Your lookup
tables are part of the compressed data. If the "compressed file" plus the
lookup table is bigger than the original file, then you haven't really
compressed anything. You've just moved some of it from the file into a
lookup table.

So why do people claim that they can compress already compressed data? Who
are they fooling? Themselves?



--
Steve

mm0fmf

unread,
Jul 12, 2016, 12:33:03 PM7/12/16
to
*plonk*

jonas.t...@gmail.com

unread,
Jul 12, 2016, 1:35:48 PM7/12/16
to
No it is only compressible down to a limit given by the algorithm.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 1:46:47 PM7/12/16
to
Well the algorithm start with looking up a suitable folding structure "close enough to the number", then it works down the folding structure finding the fold where the difference or sum between the folds closest to zero.

You do the same prinicple with the remainder until zero is achieved.

So our first fold can either be bigger or smaller, and it seek a configuration for the fold that close in max on the actual random number. The second fold could be a fold that depending upon our first fold was bigger or smaller than number either will add or subtract lower layers of the fold.

There will come out a difference that need to be folded, the process is repeated until there is nothing to fold.

It is basicly a search algorithm looking for suitable folding structures.

Michael Torrie

unread,
Jul 12, 2016, 2:20:52 PM7/12/16
to
On 07/12/2016 11:46 AM, jonas.t...@gmail.com wrote:
> Well the algorithm start with looking up a suitable folding structure
> "close enough to the number", then it works down the folding
> structure finding the fold where the difference or sum between the
> folds closest to zero.
>
> You do the same prinicple with the remainder until zero is achieved.
>
> So our first fold can either be bigger or smaller, and it seek a
> configuration for the fold that close in max on the actual random
> number. The second fold could be a fold that depending upon our first
> fold was bigger or smaller than number either will add or subtract
> lower layers of the fold.
>
> There will come out a difference that need to be folded, the process
> is repeated until there is nothing to fold.
>
> It is basicly a search algorithm looking for suitable folding
> structures.

Better patent it quickly then. And you will win a noble prize for math
if you could do what you say you could.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 3:32:01 PM7/12/16
to
I doubt it i never got anyone before for my ideas.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 3:40:36 PM7/12/16
to
Den tisdag 12 juli 2016 kl. 20:20:52 UTC+2 skrev Michael Torrie:
I must stress when i say number here i really mean +100000 decimal digit. So i basicly search in on big numbers that i compress. So i divide the dataset into suitable sizes for compression.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 3:42:57 PM7/12/16
to
And the dataset chunks that comes out from the process can also be treated like a new datafile, so the compression is iterative down to a limit.

Ian Kelly

unread,
Jul 12, 2016, 6:24:23 PM7/12/16
to
On Tue, Jul 12, 2016 at 11:35 AM, <jonas.t...@gmail.com> wrote:
>
> No it is only compressible down to a limit given by the algorithm.

Then your algorithm does not compress random data as you claimed.

For some input, determine the limiting output that it ultimately
compresses down to. Take that output and feed it through your
algorithm as if it were the original input. If the data are to be
considered random, then this input is just as probable as the
original. What output does the algorithm now create? If it just
returns the input unchanged, then how do you discern the original
input from this input when decompressing? If it returns a different
output of the same size, then repeat the process with the new output.
Now there are *two* outputs of that size that can't be repeated. There
are only finitely many possible outputs of that size, so eventually
you're going to have to get to one that either repeats an output -- in
which case your algorithm produces the same output for two different
inputs and is therefore incorrect -- or you will get to an input that
produces an output *larger* in size than the original.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 8:43:33 PM7/12/16
to
The dataset must have a certain size that is the only requirment, of course you can not compress something into nothing, at least the arithmetic ruleset need to be encoded but what would be the point to just compress something less than a couple of bytes. And of course the number of rounds you applied the algorithm must be stored. But that is no problem for small datasets.

jonas.t...@gmail.com

unread,
Jul 12, 2016, 8:47:40 PM7/12/16
to
Den onsdag 13 juli 2016 kl. 00:24:23 UTC+2 skrev Ian:
The later sounds reasonable that is start toggle between states.

Steven D'Aprano

unread,
Jul 12, 2016, 10:29:48 PM7/12/16
to
On Wed, 13 Jul 2016 03:35 am, jonas.t...@gmail.com wrote:

> No it is only compressible down to a limit given by the algorithm.

Right! Then there is data that you can't compress.

Suppose you have some data:

data = "ABABABABABAB...ABAB"

And you compress it "down to a limit":

x = compress(compress(compress(data)))
print(x)
=> prints "@nx7%k!b"

Now let's try again with something else:

data = "AABBBCCCCDDDDEEEE...ZZZZ"

And you compress it "down to a limit":

x = compress(compress(compress(compress(data))))
print(x)
=> prints "wu*$cS#k-pv32zx[&+r"


One more time:

data = "AABBAABBAABBAABBAABB"
x = compress(data)
print(x)
=> prints "g^x3@"


We agree on this. Now you say, "Give me some random data, anything at all,
and I'll compress it!", and I run a random number generator and out pops:

data = "@nx7%k!b"

or possibly:

data = "wu*$cS#k-pv32zx[&+r"

or:

data = "g^x3@"


and I say "Compress that!"

But we've already agreed that this is as compressed as you can possibly make
it. You can't compress it any more.

So there's *at least some* random data that you can't compress. Surely you
have to accept that. You don't get to say "Oh, I don't mean *that* data, I
mean only data that I can compress". Random data means its random, you
don't get to pick and choose between data you can compress and data that
you can't.

Now the tricky part is to realise that its not just short sequences of
random data that can't be compressed. The same applies for LONG sequences
to. If I give you a gigabyte of raw video, you can probably compress that a
fair bit. That's what things like x264 encoders do. The x265 encoder is
even better. But they're lossy, so you can't reverse them.

But if I give you a gigabyte of random data, you'll be lucky to find *any*
patterns or redundancies that allow compression. You might be able to
shrink the file by a few KB. And if you take that already compressed file,
and try to compress it again, well, you've already hit the limit of
compression. There no more redundancy left to remove.

It doesn't matter how clever you are, or what a "folding structure" is, or
how many times you iterate over the data. It's a matter of absolute
simplicity: the pigeonhole principle. You can't argue with the numbers.

If you start with a 100 digit decimal number, there are 10**100 different
pigeons. If you can compress down to a 6 digit decimal number, there are
10**6 pigeon holes. You cannot put 10*100 pigeons into 10**6 pigeon holes
without doubling up (which makes your compression lossly).

So either some numbers cannot be compressed, or some numbers are compressed
to the same result, and you can't tell which was the original. That's your
choice: a lossless encoder means some numbers can't be compressed, a lossy
encoder means you can't reverse the process exactly.

jonas.t...@gmail.com

unread,
Jul 13, 2016, 5:47:02 AM7/13/16
to
It is not that the data is not compressible i just need more chunks or random data, it is the footprint of the algorithm that has a certain it is a structure afterall albeit richer in interpretation than the numerical field.

You exchange size for computational complexity, but that may be true for any compression algorithm.

jonas.t...@gmail.com

unread,
Jul 13, 2016, 6:05:03 AM7/13/16
to
Ok, try to see it this way ****very big**** numbers can be described as the sum or difference between a sequense of a few polynomials. Unfortunately we lack the computational skill/computing power to find them.

That is not the case using foldings/geometric series.

jonas.t...@gmail.com

unread,
Jul 13, 2016, 6:14:57 AM7/13/16
to
The sum or difference of a few small polynomials would still of course be a polynomial. But as i say it is a case of enrich the interpretation of the symbolic set that you look at you replace digits bits/integers with arithmetic describing them.

Steven D'Aprano

unread,
Jul 13, 2016, 8:20:10 AM7/13/16
to
On Wed, 13 Jul 2016 08:04 pm, jonas.t...@gmail.com wrote:

> Ok, try to see it this way ****very big**** numbers can be described as
> the sum or difference between a sequense of a few polynomials.

*Any* number, big or small, can be given as the sum or difference of a few
polynomials:

15 = (25*x**2 - 2*x + 40) - (25*x**2 - 2*x + 25)

But... why am I wasting my time with the x**2 and x terms? They must
*always* cancel, because I'm trying to simplify to a constant. So I should
just write:

15 = 40 - 25

but that's a waste of time to. I should just write:

15

and be done. The same applies for any number, no matter how big.


> Unfortunately we lack the computational skill/computing power to find
> them.
>
> That is not the case using foldings/geometric series.

You still haven't explained how you are supposed to compress 10**100
possible inputs to just 10**6 outputs without any loss of information.

Steven D'Aprano

unread,
Jul 13, 2016, 9:43:32 AM7/13/16
to
On Wed, 13 Jul 2016 07:46 pm, jonas.t...@gmail.com wrote:

> It is not that the data is not compressible

Yes it is.

Until you explain how you can *reversibly* pack 10**100 inputs into 10**6
outputs without loss of information, all your explanations about "folding"
and polynomials and structure is just cheap talk.

Michael Torrie

unread,
Jul 13, 2016, 3:04:01 PM7/13/16
to
On 07/13/2016 03:46 AM, jonas.t...@gmail.com wrote:
> It is not that the data is not compressible i just need more chunks
> or random data, it is the footprint of the algorithm that has a
> certain it is a structure afterall albeit richer in interpretation
> than the numerical field.

Err, no, that does not apply to "random." If the data is truly random
then it does not matter whether you have 5 bytes or 5 GB. There is no
pattern to discern, and having more chunks of random data won't make it
possible to compress.

> You exchange size for computational complexity, but that may be true
> for any compression algorithm.

People are taking issue with your claim that you can compress random
data. Clearly you cannot, by your own admission about requiring more
chunks of data (see above), and also by the irrefutable logical
principles Steven D'Aprano layed out.

Probably time to mute this thread. It's not related to Python.

Grant Edwards

unread,
Jul 13, 2016, 3:34:45 PM7/13/16
to
On 2016-07-13, Michael Torrie <tor...@gmail.com> wrote:
> On 07/13/2016 03:46 AM, jonas.t...@gmail.com wrote:

>> It is not that the data is not compressible i just need more chunks
>> or random data, it is the footprint of the algorithm that has a
>> certain it is a structure afterall albeit richer in interpretation
>> than the numerical field.

Well, thanks for that. Not only was it authentic frontier
gibberish...

> Err, no, that does not apply to "random." If the data is truly
> random then it does not matter whether you have 5 bytes or 5 GB.
> There is no pattern to discern, and having more chunks of random
> data won't make it possible to compress.

You might as well be arguing with Ludwig Plutonium.

--
Grant Edwards grant.b.edwards Yow! I wonder if I should
at put myself in ESCROW!!
gmail.com

Marko Rauhamaa

unread,
Jul 13, 2016, 3:36:01 PM7/13/16
to
Michael Torrie <tor...@gmail.com>:
> If the data is truly random then it does not matter whether you have 5
> bytes or 5 GB. There is no pattern to discern, and having more chunks
> of random data won't make it possible to compress.

That's true if "truly random" means "evenly distributed". You might have
genuine random numbers with some other distribution, for example
Gaussian: <URL: https://www.random.org/gaussian-distributions/>. Such
sequences of random numbers may well be compressible.


Marko

Tim Delaney

unread,
Jul 13, 2016, 6:40:04 PM7/13/16
to
No one is saying that *all* random data is incompressible - in fact, some
random samples are *very* compressible. A single sample of random data
might look very much like the text of "A Midsummer Night's Dream"
(especially if your method of choosing the random sample was to pick a book
off a library shelf).

But unless otherwise qualified, a claim of being able to compress random
data is taken to mean any and all sets of random data.

Anyway, that's going to be my only contribution to this thread.

Tim Delaney

danceswi...@gmail.com

unread,
Oct 22, 2017, 5:23:01 PM10/22/17
to
In Short, I cannot find a single mathematical proof that says you cannot compress random numbers. Pigeon hole and other conjectures are just that. In fact, the biggest fallacy when people start talking about compression is to say that all compression alg rely on redundancies, or repetitive sequences. The million random digits is compressed down to 415,241 kb. I have been only able to compress that down to 352,954 kb. A very small amount. It has taken six years to come up with that small amount.

danceswi...@gmail.com

unread,
Oct 22, 2017, 5:25:15 PM10/22/17
to
On Monday, July 11, 2016 at 11:52:27 AM UTC-6, jonas.t...@gmail.com wrote:
> What kind of statistic law or mathematical conjecture or is it even a physical law is violated by compression of random binary data?
>
> I only know that Shanon theorised it could not be done, but were there any proof?
>
> What is to say that you can not do it if the symbolic representation is richer than the symbolic represenatation of the dataset.
>
> Isn't it a fact that the set of squareroots actually depict numbers in a shorter way than their actual representation.
>
> Now the inpretator or program must know the rules. And i have very good rules to make it happen.

Steve D'Aprano

unread,
Oct 23, 2017, 12:21:57 AM10/23/17
to
On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:

> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>When we have a source of 2 random binary digits,
>>there are 2^2 possible outcomes:
>>( 0, 0 ),
>>( 0, 1 ),
>>( 1, 0 ), and
>>( 1, 1 ).
>>. One cannot aggree upon a code to represent each
>>of those four possibilities with just one bit.
>
> If the probabilities of each outcome is 0.25.

When people talk about "random data", they are normally referring to equal
probabilities. Just as when they talk about compression in this context, they
mean lossless, reversible compression.

If the probability of certain codes (either single codes, or sequences of
codes) are non-equal, then you can take advantage of that by encoding the
common cases into a short representation, and the uncommon and rare cases
into a longer representation. As you say:


> Otherwise, if ( 0, 0 ) is much more frequent,
> we can encode ( 0, 0 ) by "0" and
>
> ( 0, 1 ) by "101",
> ( 1, 0 ) by "110", and
> ( 1, 1 ) by "111".
>
> And we could then use /less/ than two bits on the
> average.

That's incorrect. On average you use 2.5 bits.

(1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

Where you can actually get compression is if you limit yourself to only
encoding specially selected input data where the code pairs are non-random
and you have many more (0,0) than other combinations. But that is highly
non-random.

For example, if we are compressing a black and white bitmap, say a fax, the
most common data is a page which is mostly white with only a bit of black
text. That gives us a bitmap with lots and lots of (0,0) pairs and relatively
few of the other combinations, which means for the common case of "a page of
black text on a white background" the compression works well.

(I'm treating 0 as white and 1 as black.)

But for purely random input, there will be roughly equal numbers of each pair
of bit and on average the encoding will expand the data by 25% (two bits of
input maps to 2.5 bits output, on average).



> But two bits are (maximally )random if
> the probability of every combination /is/ 0.25.

In the example you give, we have four possible pairs of bytes:


Input: (0,0) (0,1) (1,0) (1,1) # 2 bits on average
Output: "0" "101" "110" "111" # 2.5 bits on average


So if the incoming data is random (uniformly-distributed), this will expand
the size of the data by 25%.

Only if the data is non-random and biased heavily towards (0,0) will you see
compression savings.

Of course in the real world most data we want to compress is non-random, and
we can get really impressive compressions. But the cost of that is that you
can rarely compress data twice (if you do, you either get very little savings
the second time, or it expands) and you cannot compress random (uniformly
distributed) data.

> (If other combinations than ( 0, 0 ) were extremely
> rare, one could even improve the compression more.)

What you say is true, but it has nothing to do with Danceswithnumbers'
ignorant and ludicrous claim that he can compress random data and that there
is no mathematical proof that you cannot.



--
Steve

Gregory Ewing

unread,
Oct 23, 2017, 1:08:55 AM10/23/17
to
danceswi...@gmail.com wrote:
> On Monday, July 11, 2016 at 11:52:27 AM UTC-6, jonas.t...@gmail.com wrote:

>> What is to say that you can not do it if the symbolic representation is
>> richer than the symbolic represenatation of the dataset.

The more symbols you have in your alphabet, the more bits
are needed to encode each symbol.

>> Isn't it a fact that the set of squareroots actually depict numbers in a
>> shorter way than their actual representation.

No.

> In Short, I cannot find a single mathematical proof that says you cannot
> compress random numbers. Pigeon hole and other conjectures are just that.

No, they are not conjectures, they are proofs. If you don't
recognise them as such, then you haven't fully understood them.

> In
> fact, the biggest fallacy when people start talking about compression is to
> say that all compression alg rely on redundancies, or repetitive sequences.

I think it's more correct to say they rely on the fact that
the input data is encoded *ineffiently* in some way -- it uses
more bits than are necessary to represent the information.

> The million random digits is compressed down to 415,241 kb. I have been only
> able to compress that down to 352,954 kb. A very small amount. It has taken
> six years to come up with that small amount.

You may well have an algorithm that encodes *some* million
digit sequences in less than 415,241 bytes. But I can guarantee
that there will be other million-digit sequences that it
encodes to *more* than 415,241 bytes. The average over all
possible million-digit sequences, assuming they are all
equally likely, *cannot* be less than 415,241 bytes. Anything
else would be a logical impossibility.

Note that "equally likely" is important here. That's what
"random data" means in this context -- that all possible
input sequences have the same probability.

This provides another way of thinking about how compression
works: it relies on all input sequences *not* having the
same probability. If some are more likely than others, then
you can encode the more frequent ones with less bits and
the less frequent ones with more bits, and gain overall.

--
Greg

Steve D'Aprano

unread,
Oct 23, 2017, 1:39:50 AM10/23/17
to
On Mon, 23 Oct 2017 08:22 am, danceswi...@gmail.com wrote:

> In Short, I cannot find a single mathematical proof that says you cannot
> compress random numbers.

Three seconds of googling finds this:

http://www.faqs.org/faqs/compression-faq/part1/section-8.html

which demonstrates a simple, trivial proof.

It took me a little longer to find this:

https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compression

but only because I couldn't remember how to spell Kolmogorov, and that lead me
to this:

https://en.wikipedia.org/wiki/Incompressible_string

for a concrete example.

Of course *occasionally* one will, by pure luck, compress a stream of random
bits. I run my random number generator and by a fluke it generates a stream
of 50 zeroes in a row:

00000000000000000000000000000000000000000000000000

which even a simple run-length encoding scheme can compress:

(50*0)


But flukes are not what people mean when they talk about compressing random
data. They mean some algorithm which can:


- transparently (no hiding data in the decompressor, or the file name,
or in a secret file hidden on the disk, or anywhere else)

- reversibly (there must be a decompressor which when given ONLY the
compressed file, and NOTHING else, reconstruct the original)

- losslessly (no throwing away data: we must reconstruct the EXACT
original, not merely something "close enough")

- and *consistently* (flukes don't count: every input must be compressed)


compress random data of size N bits to at most N-1 bits.

We can prove that this is impossible by simply *counting* how many random
strings there are.

Let us pick a concrete example, N=8. Then there are 2**8 = 256 possible
strings. We need to compress down to at most 7 bits. If we allow the output
to vary in length, there are:

2**7 = 128 seven bit strings;
2**6 = 64 six bit strings;
2**5 = 32 five bit strings;
2**4 = 16 four bit strings;
2**3 = 8 three bit strings;
2**2 = 4 two bit strings;
2**1 = 2 one bit strings;
2**0 = 1 zero bit strings

which add up to 255 compressed strings in total. But there are 256 possible
inputs, and only 255 possible outputs: so at least one input cannot be
compressed, or the compression must be lossy (two inputs compress to the same
output).




By the way: here is a very clever trick for hiding information in the file
system:

http://www.patrickcraig.co.uk/other/compression.php


but as people point out, the information in the file, plus the information in
the file system, ends up being *larger* than the original. Well done to
Patrick Craig for finding a clever loophole to "win" the challenge, but he
did so without actually compressing the original data.

danceswi...@gmail.com

unread,
Oct 23, 2017, 5:07:12 AM10/23/17
to
Ridiculous? Ludicrous??

Harsh words! First let me clarify before you lump this in with perpetual motion, or cold fusion. It is a mapping solution to compress ANY i repeat ANY random file with numbers of only 0 - 9 such as are in the million rand numbers page. Entirely possible. Since i did it, not by hiding data anywhere then it must be conjecture not law.

Jon J Hutton

Chris Angelico

unread,
Oct 23, 2017, 5:22:30 AM10/23/17
to
On Mon, Oct 23, 2017 at 8:06 PM, <danceswi...@gmail.com> wrote:
> Ridiculous? Ludicrous??
>
> Harsh words! First let me clarify before you lump this in with perpetual motion, or cold fusion. It is a mapping solution to compress ANY i repeat ANY random file with numbers of only 0 - 9 such as are in the million rand numbers page. Entirely possible. Since i did it, not by hiding data anywhere then it must be conjecture not law.
>

Where did you get the idea that you start with a representation that
uses ten symbols (0-9) and uses an entire byte for each one? Where is
that stated?

ChrisA

danceswi...@gmail.com

unread,
Oct 23, 2017, 5:32:30 AM10/23/17
to
According to this website. This is an uncompressable stream.

https://en.m.wikipedia.org/wiki/Incompressible_string

1234999988884321

It only takes seven 8 bit bytes to represent this

Paul Moore

unread,
Oct 23, 2017, 5:42:22 AM10/23/17
to
Would you care to provide the seven 8-bit bytes you propose to use?
Paul

Chris Angelico

unread,
Oct 23, 2017, 5:46:54 AM10/23/17
to
That page says nothing about using a byte to represent each digit. So
if you're going to compress the data into bytes, you then have to
represent your compressed data in some comparable way. The simplest
representation is to treat this as a single number; in hexadecimal, it
would be 46339d7a29361, and if you want a byte representation of a
number, the most trivial way is to take each pair of hex digits and
store the corresponding byte: b"\x4\x63\x39\xd7\xa2\x93\x61" is seven
bytes (six and a half, rounded up). This is not data compression; this
is the more general question of how you store information at the
concrete level.

ChrisA

danceswi...@gmail.com

unread,
Oct 23, 2017, 5:54:37 AM10/23/17
to
I really do not think this has a value besides being a trinket or cute toy. Like i said i can not see how it can be adapted to work as a rand binary compression alg...it only works with 0-9 in any seq. It's taken me six years to solve, but so what.

Jon Hutton
danceswi...@gmail.com

Thomas Jollans

unread,
Oct 23, 2017, 5:54:55 AM10/23/17
to
On 2017-10-23 11:32, danceswi...@gmail.com wrote:
> According to this website. This is an uncompressable stream.
>
> https://en.m.wikipedia.org/wiki/Incompressible_string
>
> 1234999988884321

No, it's not. According to that article, that string is incompressible
by a particular algorithm. I can see no more general claims.

>
> It only takes seven 8 bit bytes to represent this
>

You amaze me.

>>> bin(1234999988884321)
'0b100011000110011100111010111101000101001001101100001'
>>> len(bin(1234999988884321)[2:])
51
>>> 7 * 8
56


Marko Rauhamaa

unread,
Oct 23, 2017, 6:14:15 AM10/23/17
to
Thomas Jollans <tj...@tjol.eu>:

> On 2017-10-23 11:32, danceswi...@gmail.com wrote:
>> According to this website. This is an uncompressable stream.
>>
>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>
>> 1234999988884321
>
> No, it's not. According to that article, that string is incompressible
> by a particular algorithm. I can see no more general claims.

Here's a compression algorithm that manages to compress that string into
a 0-bit string:

* If the original string is 1234999988884321 (whatever that means),
return the empty bit string.

* Otherwise, prepend a don't-care bit to the original string and return
the result of the concatenation.

>> It only takes seven 8 bit bytes to represent this
>
> You amaze me.
>
>>>> bin(1234999988884321)
> '0b100011000110011100111010111101000101001001101100001'
>>>> len(bin(1234999988884321)[2:])
> 51
>>>> 7 * 8
> 56

So danceswithnumbers made a correct statement!

Anyway, a fun troll.


Marko

Ben Bacarisse

unread,
Oct 23, 2017, 7:15:44 AM10/23/17
to
danceswi...@gmail.com writes:
<snip>
> ... First let me clarify before you lump this in with
> perpetual motion, or cold fusion. It is a mapping solution to compress
> ANY i repeat ANY random file with numbers of only 0 - 9 such as are in
> the million rand numbers page. Entirely possible.

Of course it is. I have half a dozen programs on this machine that do
it. I don't need yours.

It's possible you want to claim something beyond what you stated, but no
one is going to get excited until you actually claim such a thing.

<snip>
--
Ben.

alister

unread,
Oct 23, 2017, 8:19:12 AM10/23/17
to
I would suspect he is using BCD & storing 2 values in reach byte
that is not what is meant by you cant compress random data.
his compression is simply removing redundant space from an inefficient
coding

Can you compress that sequence on paper when you only have the values 0-9
to work with (forget computer representation & removing un-used bits)


--
Old age and treachery will overcome youth and skill.

Chris Angelico

unread,
Oct 23, 2017, 8:36:30 AM10/23/17
to
On Mon, Oct 23, 2017 at 11:18 PM, alister via Python-list
<pytho...@python.org> wrote:
> On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:
>
>> On 23 October 2017 at 10:32, <danceswi...@gmail.com> wrote:
>>> According to this website. This is an uncompressable stream.
>>>
>>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>>
>>> 1234999988884321
>>>
>>> It only takes seven 8 bit bytes to represent this
>>
>> Would you care to provide the seven 8-bit bytes you propose to use?
>> Paul
>
> I would suspect he is using BCD & storing 2 values in reach byte
> that is not what is meant by you cant compress random data.
> his compression is simply removing redundant space from an inefficient
> coding

I suspect he is using ASCII and storing one value in each byte.

ChrisA

Neil Cerutti

unread,
Oct 23, 2017, 9:41:49 AM10/23/17
to
On 2017-10-23, Chris Angelico <ros...@gmail.com> wrote:
> On Mon, Oct 23, 2017 at 11:18 PM, alister via Python-list
><pytho...@python.org> wrote:
>> On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:
>>
>>> On 23 October 2017 at 10:32, <danceswi...@gmail.com>
>>> wrote:
>>>> According to this website. This is an uncompressable stream.
>>>>
>>>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>>>
>>>> 1234999988884321
>>>>
>>>> It only takes seven 8 bit bytes to represent this
>>>
>>> Would you care to provide the seven 8-bit bytes you propose
>>> to use?
>>> Paul
>>
>> I would suspect he is using BCD & storing 2 values in reach
>> byte that is not what is meant by you cant compress random
>> data. his compression is simply removing redundant space from
>> an inefficient coding
>
> I suspect he is using ASCII and storing one value in each byte.

There's also ZSCII, which stores roughly 3 characters every 2
bytes. Since all the digits are in A2, this sequence would take
up 7 bytes in ZSCII as well.

http://inform-fiction.org/zmachine/standards/z1point0/sect03.html

--
Neil Cerutti

alister

unread,
Oct 23, 2017, 10:11:57 AM10/23/17
to
not sure how 16 characters can be represented by either ascii or zscii in
only 8 bytes



--
I fear explanations explanatory of things explained.

danceswi...@gmail.com

unread,
Oct 23, 2017, 10:29:36 AM10/23/17
to
I'm really not trolling, and even though some are sarcastic i sm learning from your comments. Dec to bin is not bad at removing wasted space, but there is a better way. Here is an example. How would you compress these numbers. If you look for redundancy and then code to a bulky dictionary or change it to binary, one would unflate the other would only get it down to

Compress this:

4135124325

Bin to dec...still very large
11110110
01111000
11111101
01100101

New compression method:

11000101
11000111
01001111

A full byte less than bin.

I know many are skeptical....thats okay.this has taken 6 years, im not going to insult your intelligence with something so juvenile as dec to bin. I'm really trying to find an application for this since it only deals with digits 0-9 or 0-20 or other strange combinations. Wait did you just give it away in that small example....no, because it also encrypts in a way that has never been done. The compression is better than log(256)÷log (10)....wait isn't that impossible, binary is minimalistic. I agree that binary is minimalistic, but the architecture is not, making all arguments conjecture...not laws.

Paul Moore

unread,
Oct 23, 2017, 11:29:31 AM10/23/17
to
On 23 October 2017 at 15:29, <danceswi...@gmail.com> wrote:
> I'm really not trolling, and even though some are sarcastic i sm learning from your comments.

I'm willing to believe that, but if you're trying to claim you have
"compressed" data (in a way that satisfies the technical,
information-theoretic meaning of the word), you need to be *really*
careful not to include hidden information in your assumptions.

For example, if I have a string made up of only the numbers 0-7, then
I can trivially (octal) store that data in 3 bits per digit. But
that's not compression, as there's "hidden information" in the
knowledge that you're using a restricted character set. Factor in that
information and your string only contains 3 bits of information per
digit.

Using bytes (characters, if you assume a 1-byte encoding) to hold just
the digits 0-9 is inefficient (there's 256 bytes and you're only using
10 of them), and "of course" you can hold that data more efficiently.
But that's not "compression", that's simply using a better encoding.
In the technical sense, "compression" is about looking at redundancies
that go beyond the case of how effectively you pack data into the
bytes available.

> Dec to bin is not bad at removing wasted space

Yep, you're not talking about what people usually refer to as
compression, but rather about optimising an encoding.

>, but there is a better way. Here is an example. How would you compress these numbers.

10 digits = log2(10) bits of information. So getting down to 4 bits is
about encoding. You can go further by using a variable length encoding
and "extra knowledge" about which digits come up most commonly to give
the common digits shorter representation. That's called Gray coding.
You can use the fact that repeated copies of the same digit come up
together a lot to replace them by digit + count. That's run-length
encoding. There are other more complex approaches. But what *all* of
these have in common is that if you provide random input (within the
limits of what you support - digit strings here) then you'll *always*
get at least one input that encodes to a longer output than your
input.

> Compress this:
>
> 4135124325

10 x 10 digits = 10 x log2(10) bits of information = a bit under 34
bits of information

>
> Bin to dec...still very large
> 11110110
> 01111000
> 11111101
> 01100101

4x8 = 32 bits, but there's probably a couple of leading zeros needed
if you want to encode all 10-digit numbers.


> New compression method:
>
> 11000101
> 11000111
> 01001111
>
> A full byte less than bin.

You'll need to explain how to decode this, in a way that can be used
to decode the encodings of arbitrary 10-digit numbers, and with any
leading zeroes that are needed to cover the whole space of 10-digit
numbers, before you can claim you've compressed 10-digit numbers using
only 24 bits. And if you do that, you've basically overturned most of
information theory, so I'd start by assuming there's a flaw in your
argument - sorry about that... ;-)

Hope this helps put the subject into context. Compression is a very
technical subject, to "do it right". Special cases can be worked out,
sure, but the "hidden assumptions" in a method are what make the
difference between a "compression algorithm" and a "way of storing my
particular data more efficiently".

> I know many are skeptical....thats okay.this has taken 6 years, im not going to insult your intelligence with something so juvenile as dec to bin. I'm really trying to find an application for this since it only deals with digits 0-9 or 0-20 or other strange combinations. Wait did you just give it away in that small example....no, because it also encrypts in a way that has never been done. The compression is better than log(256)÷log (10)....wait isn't that impossible, binary is minimalistic. I agree that binary is minimalistic, but the architecture is not, making all arguments conjecture...not laws.

No-one is going to accept a claim that an algorithm you're not willing
to publish is valid. This is about maths/science, not "proprietary
algorithms" or anything like that. If you don't publish your methods,
people will simply point at information theoretic proofs and say
"either you're missing something, or your approach doesn't work in
cases that I care about, so thanks but no thanks".

Paul

danceswi...@gmail.com

unread,
Oct 23, 2017, 11:32:39 AM10/23/17
to
Just trying to find a practical application for this alg. Not real useful as it stands now.

Jon Hutton

danceswi...@gmail.com

unread,
Oct 23, 2017, 11:39:51 AM10/23/17
to
Thanks Paul...blunt to the point.

My 8 year old can decode this back into base 10, i still have to help him a bit going from base 10 to 8 bit bytes....it's incredibly simple to decode. No dictionary, can easily be done with pencil and paper, does not rely on redundancies.

Jon Hutton

Thomas Jollans

unread,
Oct 23, 2017, 12:13:43 PM10/23/17
to
On 2017-10-23 07:39, Steve D'Aprano wrote:
> By the way: here is a very clever trick for hiding information in the file
> system:
>
> http://www.patrickcraig.co.uk/other/compression.php
>
>
> but as people point out, the information in the file, plus the information in
> the file system, ends up being *larger* than the original. Well done to
> Patrick Craig for finding a clever loophole to "win" the challenge, but he
> did so without actually compressing the original data.

Thanks Steve that was very entertaining.


--
Thomas Jollans

Thomas Jollans

unread,
Oct 23, 2017, 12:34:42 PM10/23/17
to
On 2017-10-23 17:39, danceswi...@gmail.com wrote:
> Thanks Paul...blunt to the point.
>
> My 8 year old can decode this back into base 10, i still have to help him a bit going from base 10 to 8 bit bytes....it's incredibly simple to decode. No dictionary, can easily be done with pencil and paper, does not rely on redundancies.

It is, is it? Well, you know the rules:

"Code or it didn't happen."


danceswi...@gmail.com

unread,
Oct 23, 2017, 1:13:28 PM10/23/17
to
Good point....

I hope it has a use, other than a cute toy....i don't see it yet.

Neil Cerutti

unread,
Oct 23, 2017, 1:42:27 PM10/23/17
to
On 2017-10-23, alister via Python-list <pytho...@python.org> wrote:
>>>>>> 1234999988884321
>>>>>>
>>>>>> It only takes seven 8 bit bytes to represent this
>>>>>
>>>>> Would you care to provide the seven 8-bit bytes you propose to use?
>>>>> Paul
>>>>
>>>> I would suspect he is using BCD & storing 2 values in reach
>>>> byte that is not what is meant by you cant compress random
>>>> data. his compression is simply removing redundant space
>>>> from an inefficient coding
>>>
>>> I suspect he is using ASCII and storing one value in each byte.
>>
>> There's also ZSCII, which stores roughly 3 characters every 2
>> bytes. Since all the digits are in A2, this sequence would
>> take up 7 bytes in ZSCII as well.
>>
>> http://inform-fiction.org/zmachine/standards/z1point0/sect03.html
>
> not sure how 16 characters can be represented by either ascii
> or zscii in only 8 bytes

Oops! I hastily counted completely wrong. It's 10 bytes in ZSCII
version 2, using a shift-lock.

----* 16 bits* ----
12 -> 0 00101 01001 01010
349 -> 0 01011 01100 10001
999 -> 0 10001 10001 10001
888 -> 0 10000 10000 10000
843 -> 0 10000 01100 01011
21 -> 1 01010 01001 00101

--
Neil Cerutti

Message has been deleted

Ian Kelly

unread,
Oct 23, 2017, 4:27:43 PM10/23/17
to
On Mon, Oct 23, 2017 at 1:42 PM, <danceswi...@gmail.com> wrote:
> Wow, do programmers actually use zscii. That is huge. So much wated space.

Not really. ZSCII is only relevant if you're writing Z-code or a
Z-code interpreter. Those in turn are only relevant if you're writing
Infocom games.

danceswi...@gmail.com

unread,
Oct 23, 2017, 10:28:05 PM10/23/17
to
Finally figured out how to turn this into a random binary compression program. Since my transform can compress more than dec to binary. Then i took a random binary stream, changed it to a decimal stream 0-9 tranformed it into a compressed/encrypted binary stream 23.7% smaller. Yes! Decode reverse is easy......sorry so excited i could shout.

Jon Hutton

Steve D'Aprano

unread,
Oct 23, 2017, 11:56:27 PM10/23/17
to
On Tue, 24 Oct 2017 01:27 pm, danceswi...@gmail.com wrote:

> Finally figured out how to turn this into a random binary compression
> program. Since my transform can compress more than dec to binary. Then i
> took a random binary stream, changed it to a decimal stream 0-9 tranformed
> it into a compressed/encrypted binary stream 23.7% smaller.

Smaller than the original binary stream? I don't think so.

There's nothing clever about "compressing" a random binary stream if first you
expand it, then remove the extra space you added and claim victory because
the result is smaller than the expanded version.


> Yes! Decode reverse is easy......sorry so excited i could shout.

Then this should be easy for you:

http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

All you need to do is compress this file:

http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin

to less than 415241 bytes, and you can win $100.

danceswi...@gmail.com

unread,
Oct 24, 2017, 12:13:11 AM10/24/17
to
I did that quite a while ago. 352,954 kb.

Steve D'Aprano

unread,
Oct 24, 2017, 12:22:13 AM10/24/17
to
On Tue, 24 Oct 2017 03:13 pm, danceswi...@gmail.com wrote:

> I did that quite a while ago. 352,954 kb.

Sure you did. Let's see the code you used.

Chris Angelico

unread,
Oct 24, 2017, 1:36:37 AM10/24/17
to
On Tue, Oct 24, 2017 at 2:28 AM, Paul Moore <p.f....@gmail.com> wrote:
> Hope this helps put the subject into context. Compression is a very
> technical subject, to "do it right". Special cases can be worked out,
> sure, but the "hidden assumptions" in a method are what make the
> difference between a "compression algorithm" and a "way of storing my
> particular data more efficiently".

And to put *this* into context and perspective, both of the above are
extremely useful tools/algorithms. Generalized compression algos are
used all the time, special-purpose lossy compression algos too, and
"way[s] of storing my particular data more efficiently" are utterly
crucial to many applications. But they're not universal compression
schemes.

ChrisA

Gregory Ewing

unread,
Oct 24, 2017, 2:18:47 AM10/24/17
to
danceswi...@gmail.com wrote:
> 1234999988884321
>
> It only takes seven 8 bit bytes to represent this

This is not surprising. The theoretical minimum size
for 16 arbitrary decimal digits is:

log2(10) * 16 = 53.15 bits = 6.64 bytes

I think you misunderstand what is meant by the phrase
"random data cannot be compressed".

A string of decimal digits represented in ASCII is
*not* completely random. There are 256 possible values
for each byte, and you're only using 10 of them.

What you *can't* do is compress 16 random decimal
digits to less than 6.64 bytes.

If you have something that you think can do better
than that, we would like to see it.

--
Greg

Gregory Ewing

unread,
Oct 24, 2017, 2:20:28 AM10/24/17
to
danceswi...@gmail.com wrote:
> I did that quite a while ago. 352,954 kb.

Are you sure? Does that include the size of all the
code, lookup tables, etc. needed to decompress it?

But even if you have, you haven't disproved the theorem about
compressing random data. All you have is a program that
compresses *that particular* sequence of a million digits.

To disprove the theorem, you would need to exhibit an
algorithm that can compress *any* sequence of a million
digits to less than 415,241 bytes.

--
Greg

Marko Rauhamaa

unread,
Oct 24, 2017, 2:36:30 AM10/24/17
to
Gregory Ewing <greg....@canterbury.ac.nz>:

> What you *can't* do is compress 16 random decimal digits to less than
> 6.64 bytes.

More precisely:

Regardless of the compression scheme, the probability of shortening
the next bit sequence is less than 0.5 if the bits are distributed
evenly, randomly and independently.

Often the distribution is not even, or there is interdependence between
the bits. Then, a compression scheme can do miracles.


Marko

Gregory Ewing

unread,
Oct 24, 2017, 3:24:58 AM10/24/17
to
danceswi...@gmail.com wrote:

> Compress this:
>
> 4135124325
>
> Bin to dec...still very large
> 11110110
> 01111000
> 11111101
> 01100101

Wait right there! You're cheating by dropping off leading
0 bits. The maximum value of a 10 digit decimal number is
9999999999, which in hex is 2540be3ff. That's 34 bits.
That's in line with the theoretical number of bits needed:

log2(10) * 10 = 33.219

So the binary version of your number above is really

00
11110110
01111000
11111101
01100101

You may think you can get away without storing or
transmitting those leading 0 bits, because the decoder
can always pad out the data as needed.

But to do that, the decoder needs to know *how many*
bits to pad out to. That information somehow needs to
be part of the encoded data.

You need to imagine you're sending the data to someone
over a wire. The only thing you can send along the wire
are ones and zeroes. You can't convey any information
by timing, or turning off the power, or anything like
that. How is the receiver going to know when he's got
the whole message?

There are only two possibilities. Either you decide
in advance that all messages will be exactly the same
length, which in this case means always sending
exactly 34 bits. Or you add some extra bits to the
message -- prepend the length in binary, or add an
end-of-message code at the end, or something like
that.

Whatever you do, you'll find that *on average* you
will need *at least* 34 bits to be able to represent
all possible 10-digit decimal numbers. Some might
be shorter, but then others will be longer, and
the average won't be less than 34.

> New compression method:
>
> 11000101
> 11000111
> 01001111
>
> A full byte less than bin.

You need to be *very* careful about what you're claiming here.
Are you saying that your algorithm compresses *all possible*
sequences of 10 decimal digits to 3 bytes or less? Or can
some of them come out longer?

--
Greg
Message has been deleted

Christian Gollwitzer

unread,
Oct 24, 2017, 3:48:51 AM10/24/17
to
Am 23.10.17 um 12:13 schrieb Marko Rauhamaa:
> Thomas Jollans <tj...@tjol.eu>:
>
>> On 2017-10-23 11:32, danceswi...@gmail.com wrote:
>>> According to this website. This is an uncompressable stream.
>>>
>>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>>
>>> 1234999988884321
>>
>> No, it's not. According to that article, that string is incompressible
>> by a particular algorithm. I can see no more general claims.
>
> Here's a compression algorithm that manages to compress that string into
> a 0-bit string:
>
> * If the original string is 1234999988884321 (whatever that means),
> return the empty bit string.
>
> * Otherwise, prepend a don't-care bit to the original string and return
> the result of the concatenation.
>

...and that's why there is the "Kolmogorov complexity". You need to
append the decompression program to the data to show how much you really
saved, which will turn out to be nothing compared to the "trivial
decompressor"

print "1234999988884321"

Christian
Message has been deleted

Gregory Ewing

unread,
Oct 24, 2017, 4:43:33 AM10/24/17
to
danceswi...@gmail.com wrote:
>
> My 8 year old can decode this back into base 10,

Keep in mind that your 8 year old has more information
than just the 32 bits you wrote down -- he can also
see that there *are* 32 bits and no more. That's
hidden information that you're not counting.

--
Greg

Gregory Ewing

unread,
Oct 24, 2017, 4:43:33 AM10/24/17
to
Paul Moore wrote:
> But that's not "compression", that's simply using a better encoding.
> In the technical sense, "compression" is about looking at redundancies
> that go beyond the case of how effectively you pack data into the
> bytes available.

There may be a difference in the way the terms are used, but
I don't think there's any fundamental difference. Compression
is about finding clever ways to make the encoding better.

Either way, the information-theoretic limits on the number
of bits needed are the same.

--
Greg

Paul Moore

unread,
Oct 24, 2017, 6:21:10 AM10/24/17
to
On 24 October 2017 at 09:43, Gregory Ewing <greg....@canterbury.ac.nz> wrote:
> Paul Moore wrote:
>>
>> But that's not "compression", that's simply using a better encoding.
>> In the technical sense, "compression" is about looking at redundancies
>> that go beyond the case of how effectively you pack data into the
>> bytes available.
>
>
> There may be a difference in the way the terms are used, but
> I don't think there's any fundamental difference. Compression
> is about finding clever ways to make the encoding better.

Agreed - I was trying (probably futilely, given the way this thread
has gone...) to make a distinction between purely local properties
that are typically considered in "how you encode the data" and the
detection of more global patterns, which is where what are typically
referred to as "compression" algorithms get their power. But sadly, I
don't think the OP is actually interested in understanding the
background, so the distinction wasn't really worth making :-(

> Either way, the information-theoretic limits on the number
> of bits needed are the same.

Precisely.
Paul

Ben Bacarisse

unread,
Oct 24, 2017, 6:23:32 AM10/24/17
to
danceswi...@gmail.com writes:

> Finally figured out how to turn this into a random binary compression
> program. Since my transform can compress more than dec to binary. Then
> i took a random binary stream,

Forget random data. For one thing it's hard to define, but more
importantly no one cares about it. By its very nature, random data is
not interesting. What people want is a reversible compression algorithm
that works on *arbitrary data* -- i.e. on *any* file at all, no matter
how structured and *non-random* it is.

For example, run the complete works of Shakespeare through your program.
The result is very much not random data, but that's the sort of data
people want to compress. If you can compress the output of your
compressor you have made a good start. Of course what you really want
to be able to do is to compress the output that results from compressing
your compressed out. And, of course, you should not stop there. Since
you can compress *any* data (not just the boring random stuff) you can
keep going -- compressing the compressed output again and again until
you end up with a zero-length file.

Then you publish in a major journal. Post the link to the journal
article when you are done.

<snip>
--
Ben.

Steve D'Aprano

unread,
Oct 24, 2017, 6:42:43 AM10/24/17
to
Indeed -- but let's give Dancerswithnumbers his due. *IF* he is right (a very
big "if" indeed) about being able to compress the Rand Corporation "Million
Random Digits" in binary form, as given, that alone would be an impressive
trick.

Compressing the digits in text form is not impressive in the least. As Ben
Bacarisse pointed out, most of us will probably already have half a dozen
programs that do that.

Paul Moore

unread,
Oct 24, 2017, 6:51:05 AM10/24/17
to
On 24 October 2017 at 11:23, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
> For example, run the complete works of Shakespeare through your program.
> The result is very much not random data, but that's the sort of data
> people want to compress. If you can compress the output of your
> compressor you have made a good start. Of course what you really want
> to be able to do is to compress the output that results from compressing
> your compressed out. And, of course, you should not stop there. Since
> you can compress *any* data (not just the boring random stuff) you can
> keep going -- compressing the compressed output again and again until
> you end up with a zero-length file.

Oh, and just for fun, if you are able to guarantee compressing
arbitrary data, then

1. Take a document you want to compress.
2. Compress it using your magic algorithm. The result is smaller.
3. Compress the compressed data. The result is still smaller.
4. Repeat until you hit 0 bytes.

Congratulations - apparently you have a reversible algorithm that
compresses every data set to an empty file. (Caveat - there's actually
"hidden data" here, as you need to know how many compressions it takes
to hit 0 bytes. Because you decrease the size every time, though, that
number must be no greater than the size of the original file).

Paul

Ben Bacarisse

unread,
Oct 24, 2017, 7:04:52 AM10/24/17
to
Paul Moore <p.f....@gmail.com> writes:

> On 24 October 2017 at 11:23, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>> For example, run the complete works of Shakespeare through your program.
>> The result is very much not random data, but that's the sort of data
>> people want to compress. If you can compress the output of your
>> compressor you have made a good start. Of course what you really want
>> to be able to do is to compress the output that results from compressing
>> your compressed out. And, of course, you should not stop there. Since
>> you can compress *any* data (not just the boring random stuff) you can
>> keep going -- compressing the compressed output again and again until
>> you end up with a zero-length file.
>
> Oh, and just for fun, if you are able to guarantee compressing
> arbitrary data, then

It's a small point, but you are replying to a post of mine and saying
"you". That could make people think that /I/ am claiming to have a perfect
compression algorithm.

> 1. Take a document you want to compress.
> 2. Compress it using your magic algorithm. The result is smaller.
> 3. Compress the compressed data. The result is still smaller.
> 4. Repeat until you hit 0 bytes.

Isn't this just repeating what I said? I must has not written is
clearly enough.

<snip>
--
Ben.

Paul Moore

unread,
Oct 24, 2017, 7:27:20 AM10/24/17
to
On 24 October 2017 at 12:04, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
> Paul Moore <p.f....@gmail.com> writes:
>
>> On 24 October 2017 at 11:23, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>> For example, run the complete works of Shakespeare through your program.
>>> The result is very much not random data, but that's the sort of data
>>> people want to compress. If you can compress the output of your
>>> compressor you have made a good start. Of course what you really want
>>> to be able to do is to compress the output that results from compressing
>>> your compressed out. And, of course, you should not stop there. Since
>>> you can compress *any* data (not just the boring random stuff) you can
>>> keep going -- compressing the compressed output again and again until
>>> you end up with a zero-length file.
>>
>> Oh, and just for fun, if you are able to guarantee compressing
>> arbitrary data, then
>
> It's a small point, but you are replying to a post of mine and saying
> "you". That could make people think that /I/ am claiming to have a perfect
> compression algorithm.

Sorry. I intended the meaning "If one is able to..." but I was unclear. My bad.

>> 1. Take a document you want to compress.
>> 2. Compress it using your magic algorithm. The result is smaller.
>> 3. Compress the compressed data. The result is still smaller.
>> 4. Repeat until you hit 0 bytes.
>
> Isn't this just repeating what I said? I must has not written is
> clearly enough.

More accurately, I didn't read it carefully enough. Again sorry.

However, I guess it serves as an example of a compression algorithm -
we can trivially compress the content of our two posts into a single
post with just as much information content, by deleting my post :-)

Paul

Steve D'Aprano

unread,
Oct 24, 2017, 7:45:14 AM10/24/17
to
On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:

> Forget random data. For one thing it's hard to define,

That bit is true.

> but more importantly no one cares about it.

But that's wrong.

For instance:

- Encrypted data looks very much like random noise. With more and more data
traversing the internet in encrypted form, the ability to compress random
noise would be worth billions.

- Compressed data looks somewhat like random noise (with a bit of structure).
The more it is compressed, the more random it looks. If you could compress
random noise, you could take already compressed data, and compress it again,
saving even more space.

- Many multimedia formats (images, sound, video) are compressed using
dedicated encoders. The better the encoder, the more it compresses the data
(whether lossy or not) the harder it is to compress it further. If you could
compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
etc, saving even more storage and transmission costs.

And most importantly:

- Random data is a superset of the arbitrary structured data you mention
below. If we could compress random data, then we could compress any data
at all, no matter how much or little structure it contained.

This is why the ability to compress random data (if it were possible, which it
is not) is interesting. Its not because people want to be able to compress
last night's lottery numbers, or tables of random digits.


> By its very nature, random data is
> not interesting. What people want is a reversible compression algorithm
> that works on *arbitrary data* -- i.e. on *any* file at all, no matter
> how structured and *non-random* it is.

In a sense you are right. Compressing randomly generated data would be a
parlour trick and not specifically very useful. But if you had such an
algorithm, that would change the face of the world.

It would be as revolutionary and paradigm breaking as a perpetual motion
machine, or discovery of a new continent the size of China in the middle of
the Atlantic, or that π actually does equal 22/7 exactly.

And just as impossible.


> For example, run the complete works of Shakespeare through your program.
> The result is very much not random data, but that's the sort of data
> people want to compress. If you can compress the output of your
> compressor you have made a good start. Of course what you really want
> to be able to do is to compress the output that results from compressing
> your compressed out. And, of course, you should not stop there. Since
> you can compress *any* data (not just the boring random stuff) you can
> keep going -- compressing the compressed output again and again until
> you end up with a zero-length file.

Indeed.

That proof by contradiction is yet another reason we know we can't compress
random data -- that is to say, *arbitrary* data. If we had a compression
program which could guarantee to ALWAYS shrink ANY file by at least one bit,
then we could just apply it over and over again, shrinking the compressed
file again and again, until we were left with a zero-byte file:

original.dat = 110 MB
original.dat.zip.zip.zip.zip.zip.zip.zip = 0 MB

And then reverse the process, to turn an empty file back into the original.

But given an empty file, how do you distinguish the empty file you get
from 'music.mp3' and the identical empty file you get from 'movie.avi'?

Obviously you cannot. So we know that the only way to *guarantee* to shrink
every possible file is if the compression is lossy.


> Then you publish in a major journal. Post the link to the journal
> article when you are done.

These days there are plenty of predatory journals which will be happy to take
Dancerswithnumber's money in return for publishing it in a junk journal.

https://en.wikipedia.org/wiki/Predatory_open_access_publishing

Steve D'Aprano

unread,
Oct 24, 2017, 7:48:40 AM10/24/17
to
On Tue, 24 Oct 2017 06:46 pm, danceswi...@gmail.com wrote:

> Greg, you're very smart, but you are missing a big key. I'm not padding,
> you are still thinking inside the box, and will never solve this by doing
> so. Yes! At least you see my accomplishment, this will compress any random
> file.

Talk is cheap.

Ben Bacarisse

unread,
Oct 24, 2017, 9:29:31 AM10/24/17
to
Steve D'Aprano <steve+...@pearwood.info> writes:

> On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:
>
>> Forget random data. For one thing it's hard to define,
>
> That bit is true.
>
>> but more importantly no one cares about it.
>
> But that's wrong.

All generalisations are false. I was being hyperbolic.

> For instance:
>
> - Encrypted data looks very much like random noise. With more and more data
> traversing the internet in encrypted form, the ability to compress random
> noise would be worth billions.
>
> - Compressed data looks somewhat like random noise (with a bit of structure).
> The more it is compressed, the more random it looks. If you could compress
> random noise, you could take already compressed data, and compress it again,
> saving even more space.
>
> - Many multimedia formats (images, sound, video) are compressed using
> dedicated encoders. The better the encoder, the more it compresses the data
> (whether lossy or not) the harder it is to compress it further. If you could
> compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
> etc, saving even more storage and transmission costs.

But these are not random data. We care about these because they are are
highly structured, non-random data.

> And most importantly:
>
> - Random data is a superset of the arbitrary structured data you mention
> below. If we could compress random data, then we could compress any data
> at all, no matter how much or little structure it contained.

Yes, that's part of my point. Arbitrary data includes random data but
it avoids arguments about what random means.

> This is why the ability to compress random data (if it were possible, which it
> is not) is interesting. Its not because people want to be able to compress
> last night's lottery numbers, or tables of random digits.

The trouble is a pedagogic one. Saying "you can't compress random data"
inevitably leads (though, again, this is just my experience) to endless
attempts to define random data. My preferred way out of that is to talk
about algorithmic complexity but for your average "I've got a perfect
compression algorithm" poster, that is step too far.

I think "arbitrary data" (thereby including the results of compression
by said algorithm) is the best way to make progress.

<snip>
>> Then you publish in a major journal. Post the link to the journal
>> article when you are done.
>
> These days there are plenty of predatory journals which will be happy to take
> Dancerswithnumber's money in return for publishing it in a junk
> journal.

Sure, but you usually get a huge advantage -- a text to criticise. Your
average Usenet crank will keep changing what they say to avoid being
pinned down. Plus you get to note the fact that the journal is junk.

--
Ben.

Ben Bacarisse

unread,
Oct 24, 2017, 9:34:31 AM10/24/17
to
Steve D'Aprano <steve+...@pearwood.info> writes:

> On Tue, 24 Oct 2017 06:46 pm, danceswi...@gmail.com wrote:
>
>> Greg, you're very smart, but you are missing a big key. I'm not padding,
>> you are still thinking inside the box, and will never solve this by doing
>> so. Yes! At least you see my accomplishment, this will compress any random
>> file.
>
> Talk is cheap.

But highly prized. Most Usenet cranks only want to be talked to (they
see it as being taken seriously, no matter how rude the respondents are)
so for the cost of something cheap (a little babbling) they get an
endless stream of highly prized attention.

--
Ben.

Lele Gaifax

unread,
Oct 24, 2017, 11:41:16 AM10/24/17
to
Steve D'Aprano <steve+...@pearwood.info> writes:

> But given an empty file, how do you distinguish the empty file you get
> from 'music.mp3' and the identical empty file you get from 'movie.avi'?

That's simple enough: of course one empty file would be
"music.mp3.zip.zip.zip", while the other would be
"movie.avi.zip.zip.zip.zip.zip"... some sort of
https://en.wikipedia.org/wiki/Water_memory applied to file system entries :-)

ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.

Tim Golden

unread,
Oct 24, 2017, 11:46:41 AM10/24/17
to
On 24/10/2017 16:40, Lele Gaifax wrote:
> Steve D'Aprano <steve+...@pearwood.info> writes:
>
>> But given an empty file, how do you distinguish the empty file you get
>> from 'music.mp3' and the identical empty file you get from 'movie.avi'?
>
> That's simple enough: of course one empty file would be
> "music.mp3.zip.zip.zip", while the other would be

I swear this looks like the lyrics of something or another...

"Music MP3 - zip - zip - zip"

TJG

Peter J. Holzer

unread,
Oct 24, 2017, 4:10:22 PM10/24/17
to
On 2017-10-23 04:21, Steve D'Aprano <steve+...@pearwood.info> wrote:
> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>
> If the probability of certain codes (either single codes, or sequences of
> codes) are non-equal, then you can take advantage of that by encoding the
> common cases into a short representation, and the uncommon and rare cases
> into a longer representation. As you say:
>
>
>> Otherwise, if ( 0, 0 ) is much more frequent,
>> we can encode ( 0, 0 ) by "0" and
>>
>> ( 0, 1 ) by "101",
>> ( 1, 0 ) by "110", and
>> ( 1, 1 ) by "111".
>>
>> And we could then use /less/ than two bits on the
>> average.
>
> That's incorrect. On average you use 2.5 bits.
>
> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

I disagree. If the distribution is not equal, then the average needs to
take the different probabilities into account.

Let's assume that (0, 0) has a probability of 90 %, (0, 1) a probability
of 10 % and (1, 0) and (1, 1) a probability of 5 % each.

Then the average length is

0.9 * 1 bit + 0.1 * 3 bits + 0.05 * 3 bits + 0.05 * 3 bits = 1.5 bits.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | | Man feilt solange an seinen Text um, bis
| | | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

Peter Pearson

unread,
Oct 24, 2017, 5:18:22 PM10/24/17
to
On Tue, 24 Oct 2017 14:51:37 +1100, Steve D'Aprano wrote:
On Tue, 24 Oct 2017 01:27 pm, danceswi...@gmail.com wrote:
> Yes! Decode reverse is easy......sorry so excited i could shout.

Then this should be easy for you:

http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

All you need to do is compress this file:

http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin

to less than 415241 bytes, and you can win $100.

Then, on Mon, 23 Oct 2017 21:13:00 -0700 (PDT), danceswithnumbers wrote:
> I did that quite a while ago.


But 352,954 kb > 415241 bytes, by several orders of magnitude; so
you didn't "do that". (Or are we using the European decimal point?)

If you're claiming 352,954 *bytes*, not kb, I invite you to explain
why you have not collected Mark Nelson's $100 prize, and untold fame
and glory; failing which, your credibility will evaporate.

--
To email me, substitute nowhere->runbox, invalid->com.

Ian Kelly

unread,
Oct 24, 2017, 5:25:55 PM10/24/17
to
On Tue, Oct 24, 2017 at 12:20 AM, Gregory Ewing
<greg....@canterbury.ac.nz> wrote:
> danceswi...@gmail.com wrote:
>>
>> I did that quite a while ago. 352,954 kb.
>
>
> Are you sure? Does that include the size of all the
> code, lookup tables, etc. needed to decompress it?

My bet is that danceswithnumbers does indeed have a file of that size
which is in some way derived from the million random digits, but
without any means of losslessly "decompressing" it (thus making it
junk data).

Steve D'Aprano

unread,
Oct 24, 2017, 6:11:15 PM10/24/17
to
On Wed, 25 Oct 2017 02:40 am, Lele Gaifax wrote:

> Steve D'Aprano <steve+...@pearwood.info> writes:
>
>> But given an empty file, how do you distinguish the empty file you get
>> from 'music.mp3' and the identical empty file you get from 'movie.avi'?
>
> That's simple enough: of course one empty file would be
> "music.mp3.zip.zip.zip", while the other would be
> "movie.avi.zip.zip.zip.zip.zip"... some sort of
> https://en.wikipedia.org/wiki/Water_memory applied to file system entries
> :-)


Does that mean if I name an empty file

serenity2-by-joss-whedon.avi.zip.zip.zip.zip.zip

Dancerswithnumbers' magic algorithm will recreate the movie from some
alternative universe where it actually exists?

Awesome.

Steve D'Aprano

unread,
Oct 24, 2017, 6:31:09 PM10/24/17
to
On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:

> On 2017-10-23 04:21, Steve D'Aprano <steve+...@pearwood.info> wrote:
>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>>
>> If the probability of certain codes (either single codes, or sequences of
>> codes) are non-equal, then you can take advantage of that by encoding the
>> common cases into a short representation, and the uncommon and rare cases
>> into a longer representation. As you say:
>>
>>
>>> Otherwise, if ( 0, 0 ) is much more frequent,
>>> we can encode ( 0, 0 ) by "0" and
>>>
>>> ( 0, 1 ) by "101",
>>> ( 1, 0 ) by "110", and
>>> ( 1, 1 ) by "111".
>>>
>>> And we could then use /less/ than two bits on the
>>> average.
>>
>> That's incorrect. On average you use 2.5 bits.
>>
>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
>
> I disagree. If the distribution is not equal, then the average needs to
> take the different probabilities into account.

I think I would call that the *weighted* average rather than the average.

Regardless of what we call it, of course both you and Stefan are right in how
to calculate it, and such a variable-length scheme can be used to compress
the data.

E.g. given the sequence 00000011 which would take 8 bits in the obvious
encoding, we can encode it as "000111" which takes only 6 bits.

But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10
bits.

The end result is that averaged over all possible bit sequences (of a certain
size), this encoding scheme requires MORE space than the obvious 0/1 bits.

But in practice we don't care much, because the data sequences we care about
are usually not "all possible bit sequences", but a heavily restricted subset
where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.

Chris Angelico

unread,
Oct 24, 2017, 7:10:11 PM10/24/17
to
On Wed, Oct 25, 2017 at 9:11 AM, Steve D'Aprano
<steve+...@pearwood.info> wrote:
> On Wed, 25 Oct 2017 02:40 am, Lele Gaifax wrote:
>
>> Steve D'Aprano <steve+...@pearwood.info> writes:
>>
>>> But given an empty file, how do you distinguish the empty file you get
>>> from 'music.mp3' and the identical empty file you get from 'movie.avi'?
>>
>> That's simple enough: of course one empty file would be
>> "music.mp3.zip.zip.zip", while the other would be
>> "movie.avi.zip.zip.zip.zip.zip"... some sort of
>> https://en.wikipedia.org/wiki/Water_memory applied to file system entries
>> :-)
>
>
> Does that mean if I name an empty file
>
> serenity2-by-joss-whedon.avi.zip.zip.zip.zip.zip
>
> Dancerswithnumbers' magic algorithm will recreate the movie from some
> alternative universe where it actually exists?
>
> Awesome.

Yes, but then you'd get
dmca-takedown-request.pdf.zip.zip.zip.zip.zip.zip.zip which would also
be empty.

ChrisA

Richard Damon

unread,
Oct 24, 2017, 7:15:39 PM10/24/17
to
My understanding of the 'Random Data Comprehensibility' challenge is
that is requires that the compression take ANY/ALL strings of up to N
bits, and generate an output stream no longer than the input stream, and
sometime less. It admits that given no-uniformly distributed data, it is
possible to compress some patterns, the common ones, and expand others,
the uncommon ones, to lower the net average length. What it says can't
be done is to have a compression method that compresses EVERY input
pattern. That is where the 'Pigeon Hole' principle comes into play which
the people who claim to be able to compress random data like to ignore
or just attempt to say doesn't apply.

Gregory Ewing

unread,
Oct 25, 2017, 3:32:27 AM10/25/17
to
Steve D'Aprano wrote:
> - Encrypted data looks very much like random noise.

There's actually a practical use for that idea. If you can feed
the output of an encryption algorithm through a compressor and
make it smaller, it means there is a cryptographic weakness
in the algorithm that could potentially be exploited. Good
encryption algorithms produce output that looks almost completely
random to anyone who doesn't know how to decrypt it.

--
Greg

Gregory Ewing

unread,
Oct 25, 2017, 4:11:42 AM10/25/17
to
Ben Bacarisse wrote:
> The trouble is a pedagogic one. Saying "you can't compress random data"
> inevitably leads (though, again, this is just my experience) to endless
> attempts to define random data.

It's more about using terms without making sure everyone agrees
on the definitions being used.

In this context, "random data" really means "uniformly distributed
data", i.e. any bit sequence is equally likely to be presented as
input. *That's* what information theory says can't be compressed.

> I think "arbitrary data" (thereby including the results of compression
> by said algorithm) is the best way to make progress.

I'm not sure that's much better, because it doesn't home in
on the most important thing, which is the probability
distribution.

--
Greg

Gregory Ewing

unread,
Oct 25, 2017, 4:25:23 AM10/25/17
to
Lele Gaifax wrote:
> That's simple enough: of course one empty file would be
> "music.mp3.zip.zip.zip", while the other would be
> "movie.avi.zip.zip.zip.zip.zip"... some sort of
> https://en.wikipedia.org/wiki/Water_memory applied to file system entries :-)

If you're allowed to alternate between two compression
methods, then the way you decompress
music.mp3.zip.zip.tgz.zip...........tgz.zip.tgz
is to output 0 each time zip was applied and 1 each
time tar/gz was applied.

You may be able to take some shortcuts in some
cases, e.g. anything beginning with "movie.flv"
almost certainly contains a cute kitten video.
(Unless it's found inside an encrypted disk
partition, in which case it contains porn.)

--
Greg

Ian Kelly

unread,
Oct 25, 2017, 9:28:32 AM10/25/17
to
On 10/24/17, Richard Damon <Ric...@damon-family.org> wrote:
> My understanding of the 'Random Data Comprehensibility' challenge is
> that is requires that the compression take ANY/ALL strings of up to N
> bits, and generate an output stream no longer than the input stream, and
> sometime less.

That's incorrect, at least of the challenge being discussed. Here's the link:

http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

The challenge is just to take a single known file of a million random
digits and make it smaller, including the size of the decompresser and
without hiding data. So far in 15 years nobody has succeeded even at
this, and nobody knows whether it's impossible. For instance it may be
the case that the data in the file happens to be the nth prime, in
which case it could simply be compressed to the number n with a
decompresser that calculates process.

> It admits that given no-uniformly distributed data, it is
> possible to compress some patterns, the common ones, and expand others,
> the uncommon ones, to lower the net average length. What it says can't
> be done is to have a compression method that compresses EVERY input
> pattern. That is where the 'Pigeon Hole' principle comes into play which
> the people who claim to be able to compress random data like to ignore
> or just attempt to say doesn't apply.

There is a second challenge on that page that is as you state, but the
page admits that this second challenge is a bit of a troll since this
is provably impossible.
It is loading more messages.
0 new messages