Compressing million random digits file

Matt Mahoney

unread,

Jun 10, 2004, 4:50:43 PM6/10/04

to

Seriously, I believe I have found an artifact (a small one) that could be
exploited by a data compressor. According to
http://www.rand.org/publications/classics/randomdigits/
the data was generated using an "electronic roulette wheel", a noisy 100 KHz
pulse generator and 5-bit counter sampled about once per second. Each
sample in the range 0-19 generated one digit modulo 10. Counter values of
20-31 were discarded. The remaining data was punched onto 20,000 cards
holding 50 digits each.

According to the website, small statistical biases were found in this data,
and these were removed by adding the corresponding digit on the previous
card modulo 10. Thus, it should be possible to recover these biases by
subtracting the digit 50 positions back in the file. These biases would
most likely occur in the higher order counter bits.

I calculated the distribution of low (0-4) and high (5-9) digits in the
sequence
x[i] - x[i-d] mod 10, where x is the million digits and the offset d ranges
from 1 to 100 (with wraparound for the first d digits). Then I sorted the
list of 100 distributions from most skewed to least skewed. If the data
were random, we would expect the order of this list to be random as well.
Here is the actual list.

d 0-4 5-9 entropy
50 501369 498631 0.999994592
100 501230 498770 0.999995635
39 501049 498951 0.999996825
2 500998 499002 0.999997126
44 499004 500996 0.999997138
69 500923 499077 0.999997542
78 500875 499125 0.999997791
19 500856 499144 0.999997886
43 500767 499233 0.999998303
72 500749 499251 0.999998381
82 499255 500745 0.999998399
22 499283 500717 0.999998517
71 499311 500689 0.999998630
57 500688 499312 0.999998634
61 499320 500680 0.999998666
96 499324 500676 0.999998681
64 500672 499328 0.999998697
47 499346 500654 0.999998766
86 500612 499388 0.999998919
75 499427 500573 0.999999053
25 499439 500561 0.999999092
17 499457 500543 0.999999149
81 500542 499458 0.999999152
42 500535 499465 0.999999174
66 500500 499500 0.999999279
52 499529 500471 0.999999360
93 499532 500468 0.999999368
97 500456 499544 0.999999400
27 500455 499545 0.999999403
62 499549 500451 0.999999413
92 500441 499559 0.999999439
26 500439 499561 0.999999444
90 500434 499566 0.999999457
95 500432 499568 0.999999462
33 499571 500429 0.999999469
31 499579 500421 0.999999489
94 499587 500413 0.999999508
30 499598 500402 0.999999534
54 499599 500401 0.999999536
68 500394 499606 0.999999552
13 500377 499623 0.999999590
7 499628 500372 0.999999601
91 500367 499633 0.999999611
49 499648 500352 0.999999642
6 499655 500345 0.999999657
10 499675 500325 0.999999695
59 500296 499704 0.999999747
65 499706 500294 0.999999751
70 500291 499709 0.999999756
41 499716 500284 0.999999767
98 499718 500282 0.999999771
29 500270 499730 0.999999790
48 499732 500268 0.999999793
79 499737 500263 0.999999800
34 499739 500261 0.999999803
8 499743 500257 0.999999809
11 499747 500253 0.999999815
16 499752 500248 0.999999823
56 499765 500235 0.999999841
23 499765 500235 0.999999841
67 500226 499774 0.999999853
84 500214 499786 0.999999868
36 500210 499790 0.999999873
58 500191 499809 0.999999895
73 499826 500174 0.999999913
89 499829 500171 0.999999916
46 499838 500162 0.999999924
80 500154 499846 0.999999932
35 500152 499848 0.999999933
55 499851 500149 0.999999936
63 500132 499868 0.999999950
53 500125 499875 0.999999955
99 499880 500120 0.999999958
37 499879 500121 0.999999958
38 499882 500118 0.999999960
88 499883 500117 0.999999961
87 500112 499888 0.999999964
51 499892 500108 0.999999966
21 500107 499893 0.999999967
3 499896 500104 0.999999969
45 500102 499898 0.999999970
76 500102 499898 0.999999970
40 500102 499898 0.999999970
77 499901 500099 0.999999972
5 499919 500081 0.999999981
12 499926 500074 0.999999984
4 499932 500068 0.999999987
83 499934 500066 0.999999987
32 500058 499942 0.999999990
74 499940 500060 0.999999990
60 499948 500052 0.999999992
24 500048 499952 0.999999993
15 499950 500050 0.999999993
1 500043 499957 0.999999995
28 499962 500038 0.999999996
18 500038 499962 0.999999996
85 499961 500039 0.999999996
9 500032 499968 0.999999997
14 499978 500022 0.999999999
20 499994 500006 1.000000000

The odds that 50 would be at the top of the list is 1%. The odds that 50
and 100 would be top two entries is 0.01%.

I checked higher values of d, but the trend does not continue. Also, I did
not find any correlation between adjacent digits with d = 50 or 100.

I doubt that this statistical anomaly could be exploited very well. The
entropy savings over the whole file is less than 6 bits for d = 50 or 5 bits
for d = 100. I doubt that anyone could build a decompressor smaller than 11
bits so I am not worried about any challenges (yet).

-- Matt Mahoney

Phil Carmody

unread,

Jun 10, 2004, 5:37:46 PM6/10/04

to

"Matt Mahoney" <matma...@yahoo.com> writes:
> d 0-4 5-9 entropy
> 50 501369 498631 0.999994592
> 100 501230 498770 0.999995635

...

> 75 499427 500573 0.999999053
> 25 499439 500561 0.999999092

...

> The odds that 50 would be at the top of the list is 1%. The odds that 50
> and 100 would be top two entries is 0.01%.

What are the chances that 25 and 75 would be next to each other in the list?

Phil
--
1st bug in MS win2k source code found after 20 minutes: scanline.cpp
2nd and 3rd bug found after 10 more minutes: gethost.c
Both non-exploitable. (The 2nd/3rd ones might be, depending on the CRTL)

David A. Scott

unread,

Jun 10, 2004, 7:33:44 PM6/10/04

to

"Matt Mahoney" <matma...@yahoo.com> wrote in
news:DQ3yc.20779$Yd3....@newsread3.news.atl.earthlink.net:

>
> According to the website, small statistical biases were found in this
> data, and these were removed by adding the corresponding digit on the
> previous card modulo 10. Thus, it should be possible to recover these
> biases by subtracting the digit 50 positions back in the file. These
> biases would most likely occur in the higher order counter bits.
>
>

Thanks for the info. Now I don't trust it to be randomly generated.

David A. Scott
--
My Crypto code
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott19u.zip
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott16u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Matt Mahoney

unread,

Jun 10, 2004, 7:34:41 PM6/10/04

to

"Phil Carmody" <thefatphi...@yahoo.co.uk> wrote in message
news:87hdtjw...@nonospaz.fatphil.org...

> "Matt Mahoney" <matma...@yahoo.com> writes:
> > d 0-4 5-9 entropy
> > 50 501369 498631 0.999994592
> > 100 501230 498770 0.999995635
> ...
> > 75 499427 500573 0.999999053
> > 25 499439 500561 0.999999092
> ...
>
> > The odds that 50 would be at the top of the list is 1%. The odds that
50
> > and 100 would be top two entries is 0.01%.
>
> What are the chances that 25 and 75 would be next to each other in the
list?

I didn't even notice that. I'd say 2%, just like any other pair of numbers.

I don't think 50 and 100 would be remarkable if we didn't know how the data
was generated. The entropy reduction is only about the size of d in binary.
I can't explain the 100 based on the documentation. Maybe they made 2
passes through the card deck to improve the randomization and left out this
little detail.

-- Matt Mahoney

Matt Mahoney

unread,

Jun 10, 2004, 9:56:48 PM6/10/04

to

"David A. Scott" <daVvid_...@email.com> wrote in message
news:Xns9504B2C9864E3H1...@130.133.1.4...

> "Matt Mahoney" <matma...@yahoo.com> wrote in
> news:DQ3yc.20779$Yd3....@newsread3.news.atl.earthlink.net:
>
> >
> > According to the website, small statistical biases were found in this
> > data, and these were removed by adding the corresponding digit on the
> > previous card modulo 10. Thus, it should be possible to recover these
> > biases by subtracting the digit 50 positions back in the file. These
> > biases would most likely occur in the higher order counter bits.
> >
> >
>
> Thanks for the info. Now I don't trust it to be randomly generated.

It gets worse. Add up every 50'th digit starting at any position from 0 to
49. The result is always even. That's 50 more bits of entropy.

Also, I was mistaken that x[i] - x[i-50] recovers the original pairs of
cards that were added. It is not possible to recover that information
directly. For example, the two sequences

1 2 3 4
0 3 2 5

both result in the following sequence of adjacent sums (where the first term
is wrapped around).

5 3 5 7

We know from the above result that the sum was indeed wrapped around, rather
than generated as the sum of 20,001 adjacent cards. Also, we know that
there was only one pass, or else the sums would be multiples of 4 (they are
not).

I think if the original sequence can be deduced from the 10^50 possibilities
that more biases might be found. (Correlations between adjacent digits,
perhaps).

-- Matt Mahoney

Mark Nelson

unread,

Jun 10, 2004, 10:20:49 PM6/10/04

to

Has anyone any opinion about ent, available here:

http://www.fourmilab.ch/random/

"This page describes a program, ent, which applies various tests to
sequences of bytes stored in files and reports the results of those tests.
The program is useful for those evaluating pseudorandom number generators
for encryption and statistical sampling applications, compression
algorithms, and other applications where the information density of a file
is of interest."

Here's what it has to say about the million random digits:
********************program output
Entropy = 7.999538 bits per byte.

Optimum compression would reduce the size
of this 415241 byte file by 0 percent.

Chi square distribution for 415241 samples is 266.11, and randomly
would exceed this value 50.00 percent of the times.

Arithmetic mean value of data bytes is 127.3824 (127.5 = random).
Monte Carlo value for Pi is 3.140883738 (error 0.02 percent).
Serial correlation coefficient is 0.000295 (totally uncorrelated = 0.0).
********************eof

So it's internal model would be able to squeeze 24 bytes out of the file.

There are a couple of random number sites on the web that claim to use
quantum effects for true random number generation. Given time it would be
interesting to see how well they did when measured up against this file.

--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| The Ultimate Compression Resource - http://www.datacompression.info
| Sponsored by Xceedsoft.com - Serious About Components
|

"Matt Mahoney" <matma...@yahoo.com> wrote in message
news:Aj8yc.5466$Y3....@newsread2.news.atl.earthlink.net...

Matt Mahoney

unread,

Jun 10, 2004, 10:48:57 PM6/10/04

to

"Mark Nelson" <ma...@ieee.org> wrote in message
news:5G8yc.66602$3x.4024@attbi_s54...

> Has anyone any opinion about ent, available here:
>
> http://www.fourmilab.ch/random/
>
> "This page describes a program, ent, which applies various tests to
> sequences of bytes stored in files and reports the results of those tests.
> The program is useful for those evaluating pseudorandom number generators
> for encryption and statistical sampling applications, compression
> algorithms, and other applications where the information density of a file
> is of interest."
>
> Here's what it has to say about the million random digits:
> ********************program output
> Entropy = 7.999538 bits per byte.

I wonder if that's based on an order 0 distribution. We would expect such a
calculation to be slightly less than 8 because the probabilities are not all
exactly 1/256 if you just count bytes. If my calculations are correct, the
expected entropy for 415,241 random bytes is:

H = -sum p log p, where I set half of the p = 1/256 + 1 stdev and half to
1/256 - 1 stdev. = 7.999527 bpc.

I've been doing my tests on the base 10 representation. I think the
conversion to base 256 (arithmetic coded?) would hide even obvious
statistical anomalies.

Also, data can be non-random in ways that can't be detected. For example, a
string of encrypted zero bytes will pass all tests for randomness until you
guess the key. In general it is equivalent to finding the Kolmogorov
complexity, which is equivalent to deciding the halting problem.

I doubt the program tried converting the file to a base 10 representation
and adding up every 50th digit.

-- Matt Mahoney

George Johnson

unread,

Jun 10, 2004, 11:12:34 PM6/10/04

to

"Matt Mahoney" <matma...@yahoo.com> wrote in message

news:DQ3yc.20779$Yd3....@newsread3.news.atl.earthlink.net...

| Seriously, I believe I have found an artifact (a small one) that could be
| exploited by a data compressor. According to
| http://www.rand.org/publications/classics/randomdigits/
| the data was generated using an "electronic roulette wheel", a noisy 100
KHz
| pulse generator and 5-bit counter sampled about once per second. Each
| sample in the range 0-19 generated one digit modulo 10. Counter values of
| 20-31 were discarded. The remaining data was punched onto 20,000 cards
| holding 50 digits each.
|
| According to the website, small statistical biases were found in this
data,
| and these were removed by adding the corresponding digit on the previous
| card modulo 10. Thus, it should be possible to recover these biases by
| subtracting the digit 50 positions back in the file. These biases would
| most likely occur in the higher order counter bits.
|
| I calculated the distribution of low (0-4) and high (5-9) digits in the
| sequence
| x[i] - x[i-d] mod 10, where x is the million digits and the offset d
ranges
| from 1 to 100 (with wraparound for the first d digits). Then I sorted the
| list of 100 distributions from most skewed to least skewed. If the data
| were random, we would expect the order of this list to be random as well.
| Here is the actual list.

[clipped]

Good boy, you figured out something obvious I figured out decades ago.

If you shuffle a deck of cards, the limit of randomness is limited to
the order of shuffling (move every 4th card to the bottom of the deck for
100 times and the randomness will not increase or decrease). If you shuffle
using a good pseudo-random number pattern (every 3rd card, then 1 card, then
the 4th card, then 1 card, then the 5th card, etc...
31415926535897932384626433832795... or 14142135623730950488016887242097...
or the non-integer conversion of a number) then you can actually DECREASE
THE RANDOMNESS because the randomness level cannot be increased using a
random shuffling technique. Of course, shuffling is not the limit to
conversions applicable on any file, but merely the beginning.

The problems associated with this is computational time and fine-tuning
the choice of pseudo-random generation techniques before actually
compressing later. You can understand that it is less time-consuming trying
to educate this group (given the paranoia and personality flaws in their
genius). It is much more productive to slowly leave a trail of breadcrumbs
so they can figure out the obvious truths for themselves.

Phil Carmody

unread,

Jun 11, 2004, 4:40:40 AM6/11/04

to

"Matt Mahoney" <matma...@yahoo.com> writes:

Matt, can you mail me (or make downloadable) your base-10 version of
the file please? (no point in more than oe person writing a program
to process the .bin).

Cheers,

Phil Carmody

unread,

Jun 11, 2004, 4:47:31 AM6/11/04

to

"George Johnson" <matr...@voyager.net> writes:
> Good boy, you figured out something obvious I figured out decades ago.

Oooh, patronising Matt, eh?

You'd better be bulletproof...

> If you shuffle a deck of cards, the limit of randomness is limited to
> the order of shuffling (move every 4th card to the bottom of the deck for
> 100 times and the randomness will not increase or decrease). If you shuffle
> using a good pseudo-random number pattern (every 3rd card, then 1 card, then
> the 4th card, then 1 card, then the 5th card, etc...
> 31415926535897932384626433832795... or 14142135623730950488016887242097...
> or the non-integer conversion of a number) then you can actually DECREASE
> THE RANDOMNESS because the randomness level cannot be increased using a
> random shuffling technique. Of course, shuffling is not the limit to
> conversions applicable on any file, but merely the beginning.

What a load of bullshit. You obviously haven't got the faintest idea of
the meaning of the terms you use in the above. Either that or you're
deliberately playing stupid so that you can get flamed for ignorance on
Usenet. Well, cogratulations - it worked!

> The problems associated with this is computational time and fine-tuning
> the choice of pseudo-random generation techniques before actually
> compressing later. You can understand that it is less time-consuming trying
> to educate this group (given the paranoia and personality flaws in their
> genius). It is much more productive to slowly leave a trail of breadcrumbs
> so they can figure out the obvious truths for themselves.

My god, you've been studying the methods of the comp.compression loons
over the years, I see. Quite why you chose to adopt their wrongthink, I
really don't wish to know. Lithium should work.

Phil Carmody

unread,

Jun 11, 2004, 6:12:51 AM6/11/04

to

"Mark Nelson" <ma...@ieee.org> writes:
> > It gets worse. Add up every 50'th digit starting at any position from 0
> to
> > 49. The result is always even. That's 50 more bits of entropy.
> >
> > Also, I was mistaken that x[i] - x[i-50] recovers the original pairs of
> > cards that were added. It is not possible to recover that information
> > directly. For example, the two sequences
> >
> > 1 2 3 4
> > 0 3 2 5
> >
> > both result in the following sequence of adjacent sums (where the first
> term
> > is wrapped around).
> >
> > 5 3 5 7
> >
> > We know from the above result that the sum was indeed wrapped around,
> rather
> > than generated as the sum of 20,001 adjacent cards. Also, we know that
> > there was only one pass, or else the sums would be multiples of 4 (they
> are
> > not).
> >
> > I think if the original sequence can be deduced from the 10^50
> possibilities

10 possiblities, you just have to do it 50 times.

Here's the distribution and entropy rate (in bits/digit) for columns 0 and 1:

NB - I barely trust these figures, don't take them as gospel.

bash-2.05b$ for i in xa?; do perl -e '$_=<>;chomp$_;$d[$_]++for(split//,$_);$e-=(log($_/20000)/log(2)*$_)/20000 for(@d);print(join(" ",@d),"= $e\n")' < $i; done
1963 2021 1973 2073 2015 2071 1993 1940 1931 2020= 3.3215327526006
2020 1994 2022 2013 2129 2019 1940 1947 2014 1902= 3.32131186867644
1933 2021 2034 2078 2017 1998 1973 2014 1918 2014= 3.32156077081434
2015 1973 2077 2038 1947 1971 2072 1944 2014 1949= 3.32151272890723
1989 2071 1977 1946 1992 2021 1942 2072 1975 2015= 3.32159532378674
2071 1993 1940 1931 2020 1963 2021 1973 2073 2015= 3.3215327526006
2019 1940 1947 2014 1902 2020 1994 2022 2013 2129= 3.32131186867644
1998 1973 2014 1918 2014 1933 2021 2034 2078 2017= 3.32156077081434
1971 2072 1944 2014 1949 2015 1973 2077 2038 1947= 3.32151272890723
2021 1942 2072 1975 2015 1989 2071 1977 1946 1992= 3.32159532378674

bash-2.05b$ perl -pe '$_=substr($_,1,1);' < million.asc >| million.1.txt
bash-2.05b$ perl -e '$_=<>;for $i(0..9){$d=$i;for($j=0; $j<20000; ++$j){$d=(10+substr($_,$j,1)-$d)%10;print($d);}print("\n")}' < million.1.txt >| million.1.all
bash-2.05b$ split -l1 million.1.all
bash-2.05b$ for i in xa?; do perl -e '$_=<>;chomp$_;$d[$_]++for(split//,$_);$e-=(log($_/20000)/log(2)*$_)/20000 for(@d);print(join(" ",@d),"= $e\n")' < $i; done
2016 1943 2026 1996 2022 2018 2018 1923 1965 2073= 3.32160606457746
2087 2013 1928 2031 1987 2102 2006 1919 1945 1982= 3.32130953971763
1979 2072 2018 1919 2111 1975 2003 2028 1936 1959= 3.32135301752889
1944 1984 2063 2098 1907 2012 1997 2020 2042 1933= 3.32134325101331
1938 1935 2064 2051 1999 1929 2029 2011 2017 2027= 3.32153383436245
2018 2018 1923 1965 2073 2016 1943 2026 1996 2022= 3.32160606457746
2102 2006 1919 1945 1982 2087 2013 1928 2031 1987= 3.32130953971763
1975 2003 2028 1936 1959 1979 2072 2018 1919 2111= 3.32135301752888
2012 1997 2020 2042 1933 1944 1984 2063 2098 1907= 3.32134325101331
1929 2029 2011 2017 2027 1938 1935 2064 2051 1999= 3.32153383436245

bash-2.05b$ perl -pe '$_=substr($_,2,1);' < million.asc >| million.2.txt
bash-2.05b$ perl -e '$_=<>;for $i(0..9){$d=$i;for($j=0; $j<20000; ++$j){$d=(10+substr($_,$j,1)-$d)%10;print($d);}print("\n")}' < million.2.txt >| million.2.all
bash-2.05b$ split -l1 million.2.all
bash-2.05b$ for i in xa?; do perl -e '$_=<>;chomp$_;$d[$_]++for(split//,$_);$e-=(log($_/20000)/log(2)*$_)/20000 for(@d);print(join(" ",@d),"= $e\n")' < $i; done
1993 1968 2010 2003 2019 1982 2025 1982 2026 1992= 3.32186395277318
1948 1989 2022 2029 1957 1992 2010 2001 1990 2062= 3.32174594506632
2058 2002 2008 1976 2002 1985 1968 2018 2037 1946= 3.32175021930767
2000 2077 1956 1981 2004 1978 1993 2004 1974 2033= 3.32173881914972
2052 1954 2050 1984 1957 2012 2014 1949 2000 2028= 3.32169086532993
1982 2025 1982 2026 1992 1993 1968 2010 2003 2019= 3.32186395277318
1992 2010 2001 1990 2062 1948 1989 2022 2029 1957= 3.32174594506632
1985 1968 2018 2037 1946 2058 2002 2008 1976 2002= 3.32175021930767
1978 1993 2004 1974 2033 2000 2077 1956 1981 2004= 3.32173881914972
2012 2014 1949 2000 2028 2052 1954 2050 1984 1957= 3.32169086532993

lg(10) = 3.321928094887362347870319429 (note,

2 values stick out for column 0, viz. 1 and 6, but it's not so
clearcut for column 1, with 4 high, 6 low.

I don't know how significcant these skews are. Would anyone (Matt/Mark?)
like to put a figure on them?

> > that more biases might be found. (Correlations between adjacent digits,
> > perhaps).

If you're looking for correlations between adjacent digits, then there are
only 100 possibilities, and you have to do the task 50 times.

Some of the 100 possibilities can be deemed less likely depending on
your view of the different entropy measurements above.

I think there's plenty of opportunity for cross-correlation.

Matt Mahoney

unread,

Jun 11, 2004, 9:48:27 AM6/11/04

to

"Phil Carmody" <thefatphi...@yahoo.co.uk> wrote in message

news:874qpiw...@nonospaz.fatphil.org...

> "Matt Mahoney" <matma...@yahoo.com> writes:
>
> Matt, can you mail me (or make downloadable) your base-10 version of
> the file please? (no point in more than oe person writing a program
> to process the .bin).

I downloaded from here
http://www.rand.org/publications/classics/randomdigits/randomdata.html

For my tests, I stripped out the line numbers and white space and checked
that I had exactly 1,000,000 bytes. I used the perl script:

while (<>) {
chop;
s/ //g;
print substr($_,5,50);
}

I'm not sure of the details in converting to/from the binary format. I
figure I could work that out later if something fruitful turns up. I'm
assuming its an arithmetic code. Note 10^6 x log2(10) / 8 =
415,241.01186... so a tiny bit of information was lost.

-- Matt Mahoney

John Reiser

unread,

Jun 11, 2004, 9:54:42 AM6/11/04

to

> Matt, can you mail me (or make downloadable) your base-10 version of
> the file please?

-----base10.c
#include <stdio.h>
#include <gmp.h>

/* gcc -o base10 base10.c -lgmp
*
* # Pre-pend the file length (as a four-byte big-endian number)
* # of AMillionRandomDigits.bin to create AMillionRandomDigits.mpz;
* # THIS STEP IS NOT SHOWN HERE.
*
* ./base10 < AMillionRandomDigits.mpz > AMillionRandomDigits.dec
*/
main()
{
mpz_t r1;

mpz_init(r1);
mpz_inp_raw(r1, stdin);
mpz_out_str(stdout, 10, r1);
return 0;
}
-----end base10.c

George Johnson

unread,

Jun 11, 2004, 11:02:08 AM6/11/04

to

"Phil Carmody" <thefatphi...@yahoo.co.uk> wrote in message

news:87zn7av...@nonospaz.fatphil.org...

Oh hardly. I am NOT bullshitting you.

An already random file can indeed be made less random by using a
pseudo-random generator to scramble the file more. Using orderly techniques
will just shuffle the randomness around, but not reduce the randomness.
With a pseudo-random shuffle or bit-flipping or bit-swapping or XOR'ing or a
mix of these, the file can be reduced in randomness dramatically enough for
compression.

The key is recognizing that pseudo-random generators are NOT RANDOM.

If it cannot generate random numbers then it cannot increase randomness.

However, it can be used to decrease randomness. The key is fine-tuning
your pseudo-random variables and recognizing the large increase in search
time, computational time, and temporary memory cost. I mean, you are
cynical and paranoid about your concepts being swiped and patented behind
your back, but goodness sakes, DUH! This techniques is OBVIOUS to even the
most jaded person here I would think. Even though I have a 150 IQ, it
should be plain to anyone with even the barest grasp of Number Theory.

The resulting file then can be compressed with standard techniques.
You cannot compress it well though if the file is near the 50% range of
the entropy Bell Curve.
Think Einstein's Relativity Equation --- Space can be compressed in
exchange for an expansion of Time AND Time can be compressed for an
expansion in Space with both costing mucho Energy. Information (Space) can
be compressed in exchange for an expansion in Computational Time with a
temporary cost in Computational Memory (with an increase in the available
compression techniques stored in the compression program and the indexing
cost in the resulting file). The limits are the size available to store the
compression program, the required generic indexing of the resulting files
(as a ratio of the compression techniques in use by the program), and the
computational search time to use the compression techniques.

The Pigeonhole Principle pretty much denies the existence of data
compression itself (DUH!) Therefore, it is a waste of time and an annoyance
to lateral-thinking folks. Seriously, the Pigeonhole Principle denies the
existence of all existing compression programs because it treats numbers as
numbers rather than the real-world compression program's mindset of Numbers
& Indexing Data. It should be obvious that the better that future
compression become, there will be an increase in the time required to
compress information and the size of the compression programs themselves.
There are limits to compression (near-infinite computation time or
near-infinite temporary compression memory required or near-infinite
indexing required for the resulting file output inside the compressor
itself). Of which, the Pigeonhole Principal comes obviously into play, but
then again, the Pigeonhole Principle already DENIES THE EXISTENCE OF
COMPRESSION ITSELF.

The Pigeonhole Principle is to Data Compression what the Godwin
"Nazi-Coddling" Rule is to allowing fascist assholes to escape explaining
why they are Nazis. Seriously folks, this should be obvious to anyone with
a basic understanding of Information Theory and Number Theory... DUH! You
cannot randomize information by attempting to randomize it more. The result
is a DECREASE in randomness. You can shuffle it up using orderly
non-pseudo-random techniques, but that only rearranges the numbers in a
level of equal randomness. I mean, DUH... OBVIOUS... how can it not be
plainer?

Christian Siebert

unread,

Jun 11, 2004, 11:01:36 AM6/11/04

to

>> Matt, can you mail me (or make downloadable) your base-10 version of
>> the file please?
>
>
> -----base10.c

<snip>

Very nice this gmp - thanks for this tip! But the base may vary only
from 2 to 36.

I had done something similar but much worse (in speed and coding) but
without boundaries (feel free to change it to your needs):

// toAnyBase.java - takes a file, interprets it as a big integer and
// prints its digits to a given base (in reverse order)

// call it with a file name and a base (no error checking is done!)

import java.io.*;
import java.math.*;

public class toAnyBase
{
public static void main(String[] args) throws FileNotFoundException,
IOException
{
RandomAccessFile raf = new RandomAccessFile(args[0], "r");
BigInteger base = new BigInteger(args[1]);
BigInteger tmp = null;
Long flen = new Long(raf.length());
byte[] ba = new byte[flen.intValue()];
raf.read(ba);
BigInteger code = new BigInteger(ba);
while (code.bitCount() > 0) {
tmp = code.remainder(base);
System.out.println(tmp.toString());
code = code.divide(base);
}
raf.close();
}
}

Willem

unread,

Jun 11, 2004, 11:17:16 AM6/11/04

to

George wrote:
) However, it can be used to decrease randomness. The key is fine-tuning
) your pseudo-random variables and recognizing the large increase in search
) time, computational time, and temporary memory cost. I mean, you are
) cynical and paranoid about your concepts being swiped and patented behind
) your back, but goodness sakes, DUH! This techniques is OBVIOUS to even the
) most jaded person here I would think. Even though I have a 150 IQ, it
) should be plain to anyone with even the barest grasp of Number Theory.

You must mean Numerology.

I don't believe that you have an IQ of 150 is you seriously believe that
you can decrease randomness by shuffling data. At the very least, you
probably don't have a firm grasp on the meaning of the word 'random'.

I'll put it to you simply:

Suppose I have 52 cards, in a completely random order.
In other words, the order of the cards is one of 52! possibilities, with
equal probability.

Then, I 'shuffle' the cards, with a pseudo-random generator. 'Shuffling'
is a permutation that doesn't depend on the current ordering, obviously.

With a bit of simple mathematics, you can see that after this shuffling,
the order of the cards is another one of the 52! possibilities, with equal
probability. In other words, it is exactly as random as it was before.

In short:
shuffling only redistributes the probabilities, it can't change them.

) The Pigeonhole Principle pretty much denies the existence of data
) compression itself (DUH!)

You've got it. There is no such thing as data compression.
All we can do is remove the redundancy that most real-world data has.

) ...
) then again, the Pigeonhole Principle already DENIES THE EXISTENCE OF
) COMPRESSION ITSELF.

Prove it.

I'll give you a hint: All existing compressors will, on average,
expand the data that you run it on. It's just that they happen to make
'interesting' data (such as english text, photographs, etc.) smaller.

SaSW, Willem
--

Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be

drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Matt Mahoney

unread,

Jun 11, 2004, 1:22:44 PM6/11/04

to

"Matt Mahoney" <matma...@yahoo.com> wrote in message
news:DQ3yc.20779$Yd3....@newsread3.news.atl.earthlink.net...

> Seriously, I believe I have found an artifact (a small one) that could be
> exploited by a data compressor. According to
> http://www.rand.org/publications/classics/randomdigits/

Here is a followup. As I posted previously, if you arrange the data in a
table with 20,000 rows (cards) and 50 columns, then the sum of every column
is even. The probability of this happening by chance is 2^-50. The more
likely explanation is that the data was generated by adding successive cards
from the original, biased data (x1...x20000), such that every input card was
used an even number of times.

If we assume that the order was maintained with wraparound (i.e. y1...y20000
= x1+x2, x2+x3, x3+x4,... x19999+x20000, x20000+x1) then if we alternately
add and subtract cards, then the sum should be 0 in all 50 columns. This
would mean that one output card is entirely predictable, allowing 166 bits
of compression. Also, it would be possible to derive the input sequence by
guessing the digits of the first card one at a time and using the
statistical biases over the 20,000 digit series to test each of the 10
possibilities. The biases found in the difference of adjacent output cards
(y[j] - y[j-1] = (x[j] + x[j+1]) - (x[j-1] + x[j]) = -x[j-1] + 2x[j] -
x[j+1]) are about 0.1-0.2%, suggesting the input bias should be large enough
to do this, perhaps 3%. A 3% bias would allow about 0.1% compression, or a
few hundred bytes. This would be big enough to write a real decompressor.

Alas, the alternating sum is not 0. I also exhaustively tested alternating
sums of all subranges of the data (about 2 x 10^8), in addition to
alternating sums of the form 0,1,0,-1; 1,0,-1,0; 1,1,-1,-1; and 1,-1,-1,1.
I did not find any other sequences that add up to 0 mod 10, or even to all
even digits. All other ranges have from 4 to 44 even digits in the 50 sums,
which would be the normal range for random data.

If you take just one card out of the ordered output sequence and move it,
then the alternating sum will be an apparently random sequence of even
digits. Thus, it is not possible to do a directed search for the output
permutation among the 20000! possibilities. Even if a brute force search
was feasible, it would fail because on average 20000!/5^50 of these would
sum to 0 by chance. This is still a huge number.

There are statistical biases in y[i]-y[i-d] for d = 1 or 2, with the bias
for 2 near the value expected for random data. I did not find biases for
higher d. The only explanation for d = 2 is that adjacent cards were
separated by one card in between, as if the cards were shuffled. Othrewise
there should be no bias because y[i]-y[i-2] = x[i+1]+x[i]-x[i-1]-x[i-2] is a
combination of 4 independent cards, which should be more evenly distributed
than y.

Knowing that the cards were shuffled doesn't help if the cards were shuffled
manually, which is likely given the high cost of using CPU time and memory
for shuffling algorithmically in 1947-1955. If we model the shuffle by
splitting the card stack into 2 piles and picking from each pile randomly,
then there are 2^20000 possible permutations. This still doesn't help
because 2^20000 >> 5^50. Also, if the cards were shuffled by hand, then the
deck was probably divided into stacks of a couple hundred cards each,
shuffled, then reassembled in random order.

It is possible that the shuffle could be reversed if it could be modeled by
picking from a set of less than 5^50 possible permutations. That I don't
know.

-- Matt Mahoney

Guenther von Knakspott

unread,

Jun 11, 2004, 3:23:02 PM6/11/04

to

Hail you dork, as a distiguished member of the horde of idiots showing
up recently, perhaps you can answer the following question: ¿where are
all you dimwits creeping out of? We urgently need to tell someone to
shut the door and lock it again.

"George Johnson" <matr...@voyager.net> wrote in message news:<10ci8oe...@corp.supernews.com>...

[snip Mahoneys op lest it be sullied any further]

>
> Good boy, you figured out something obvious I figured out decades ago.
>

This comes as a surprise, since one would reckon on first impression,
that you are some poor kid suffering from a particularly debasing bout
of puberty related neurosis. If we are forced to asume that you are
indeed several decades old, we are left with only one conclusion; you
are the victim of severe mental retardation, maybe you are an autist
of sorts. Can you instantly count spilled toothpicks?

> If you shuffle a deck of cards, the limit of randomness is limited to
> the order of shuffling (move every 4th card to the bottom of the deck for
> 100 times and the randomness will not increase or decrease). If you shuffle
> using a good pseudo-random number pattern (every 3rd card, then 1 card, then
> the 4th card, then 1 card, then the 5th card, etc...
> 31415926535897932384626433832795... or 14142135623730950488016887242097...
> or the non-integer conversion of a number) then you can actually DECREASE
> THE RANDOMNESS because the randomness level cannot be increased using a
> random shuffling technique. Of course, shuffling is not the limit to
> conversions applicable on any file, but merely the beginning.
>
> The problems associated with this is computational time and fine-tuning
> the choice of pseudo-random generation techniques before actually
> compressing later. You can understand that it is less time-consuming trying

It is quite obvious, that the concepts being discussed by Mahoney are,
and will remain forever, beyond the grasp of your impaired intellect.

> to educate this group (given the paranoia and personality flaws in their
> genius). It is much more productive to slowly leave a trail of breadcrumbs
> so they can figure out the obvious truths for themselves.

It would be better for all if you went back to discussing Harry Potter
and TV shows, even though these infantile activities seem to impose an
unbearably heavy duty on your pronouncedly feeble abilities.

Phil Carmody

unread,

Jun 11, 2004, 7:56:46 PM6/11/04

to

"George Johnson" <matr...@voyager.net> writes:
> Oh hardly. I am NOT bullshitting you.

I'll count to 10.

1, 2, 3, 4, 5, ...

> An already random file can indeed be made less random by using a

... 6, 7, 8, 9, 10

Bullshit.

*_PLONK_*

Ernst Berg

unread,

Jun 12, 2004, 2:56:26 AM6/12/04

to

Willem <wil...@stack.nl> wrote in message news:<slrnccjj7s....@toad.stack.nl>...

Pardon but, http://www.dogma.net/markn/articles/bwt/bwt.htm

Data Compression with the Burrows-Wheeler Transform

by Mark Nelson

The world is not yet exhausted; let me see something tomorrow which I
never saw before. -- Samuel Johnson

In all chaos there is a cosmos, in all disorder a secret order. --
Carl Jung

Isn't BWT the kind of Shuffle George is talking about?
Maybe I missed the point.

Ernst

Severian

unread,

Jun 12, 2004, 3:28:05 AM6/12/04

to

On 11 Jun 2004 12:23:02 -0700, guenther.v...@gmx.de (Guenther
von Knakspott) wrote:

Well said, again. Talk.origins needs someone like you to swat the
creationists.

--
Sev

SuperFly

unread,

Jun 12, 2004, 3:33:29 AM6/12/04

to

I think what we consider random in this newsgroup, is the group of
files which is not compressible by the usual one pass model+order0
coder method by at least one bit. And imho this is a rather subjective
definition, because it will decrease whenever someone comes up with
better models, or new AI methods to recognize patterns and model them.
I wonder if there is research available which defines random from a
pattern recognition point of view and gives it a coefficient which
tells us exactly which shade of random it is compared to some point of
reference. Kind of like those computerised decompressed image quality
checks.

Phil Carmody

unread,

Jun 12, 2004, 4:34:28 AM6/12/04

to

Ernst...@sbcglobal.net (Ernst Berg) writes:
(quoting Willem, quoting loon:)

> > ) then again, the Pigeonhole Principle already DENIES THE EXISTENCE OF
> > ) COMPRESSION ITSELF.
> >
> > Prove it.
> >
> > I'll give you a hint: All existing compressors will, on average,
> > expand the data that you run it on. It's just that they happen to make
> > 'interesting' data (such as english text, photographs, etc.) smaller.
> >
> >
> > SaSW, Willem
>
> Pardon but, http://www.dogma.net/markn/articles/bwt/bwt.htm
>
> Data Compression with the Burrows-Wheeler Transform
>
> by Mark Nelson

Willem is as usual entirely correct with his statement.

> Isn't BWT the kind of Shuffle George is talking about?
> Maybe I missed the point.

George has no point.

BWT does not make the data less random, it simply makes it more amenable
to compression by other common algorithms.

Note that DATA compresses exactly as well using algorithm (OTHER o BWT)
as BWT(DATA) does using algorithm OTHER. i.e. BWT(DATA) is no more
random than BWT(DATA).

Willem

unread,

Jun 12, 2004, 5:07:02 AM6/12/04

to

It's good usenet practise to snip the bits you're not replying to.
Especially if you're replying to one bit in the middle of a post.
I'll take a wild guess and snip my own post, so that your answer makes the
most sense.

)> I don't believe that you have an IQ of 150 is you seriously believe that
)> you can decrease randomness by shuffling data. At the very least, you
)> probably don't have a firm grasp on the meaning of the word 'random'.
)>
)> I'll put it to you simply:
)>
)> Suppose I have 52 cards, in a completely random order.
)> In other words, the order of the cards is one of 52! possibilities, with
)> equal probability.
)>
)> Then, I 'shuffle' the cards, with a pseudo-random generator. 'Shuffling'
)> is a permutation that doesn't depend on the current ordering, obviously.
)>
)> With a bit of simple mathematics, you can see that after this shuffling,
)> the order of the cards is another one of the 52! possibilities, with equal
)> probability. In other words, it is exactly as random as it was before.
)>
)> In short:
)> shuffling only redistributes the probabilities, it can't change them.

Ernst wrote:

) Pardon but, http://www.dogma.net/markn/articles/bwt/bwt.htm
)
) Data Compression with the Burrows-Wheeler Transform
)
) by Mark Nelson
)
) <snip some weird quotes>
)
) Isn't BWT the kind of Shuffle George is talking about?
) Maybe I missed the point.

Yes, you missed the bit where I (kinda) defined what 'shuffling' was.
I'll repeat it:

)> 'Shuffling' is a permutation that doesn't depend on the current
)> ordering, obviously.

BWT is a kind of sorting algorithm. This *does* depend on the current
ordering. Do you look at the card faces when you're shuffling cards ?
If you did, I would say you're cheating.

Now, a more relaxed definition of 'shuffling' could be 'a transformation
that is reversible without side-channel information', in which case BWT
almost qualifies, because it only uses a single extra index to reverse
the transform. But then, as Phil pointed out, the BWT doesn't decrease
randomness, it just maps high-order redundancy to low-order redundancy,
making subsequent compression easier.

Try running the BWT on a random source, for example.

Matt Mahoney

unread,

Jun 12, 2004, 10:44:21 AM6/12/04

to

"SuperFly" <no....@email.net> wrote in message
news:31alc01smf3utotlj...@4ax.com...

The randomness of a single string has a precise definition in the form of
its algorithmic or Kolmogorov complexity with respect to some programming
language (or Turing machine). You could use other definitions, but I don't
think we could come to any agreement about which one we ought to use. A
string may may pass a chi squared test for randomness using an order 1 model
but not an order 2 model, for example. A source may be random enough for
deciding who gets an experimental drug and who gets a placebo, but might not
be random enough for generating RSA keys.

There is no foolproof test for algorithmic randomness. To prove an n-bit
string is random, you have to show that none of the 2^n - 1 programs shorter
than n bits will output it. Some of these programs may run forever, but you
won't know which ones. It can be proven that there is no test that can read
a program source and tell you if it halts or not.

The situation of the million random digits file is analogous to the case
where a standard reference was generated by encrypting a string of zeros,
and we have no idea about the security of the key. For all we know, it
could have come from /dev/random or it could be the author's dog's name
spelled backwards.

We know there are 50 bits of redundancy, thus the question. We know that a
permutation of the cards will give us a way to completely determine one
card, or 665 bits, and most likely yield a biased stream which might give up
several thousand more bits. The permutation may be a random shuffle, or may
be as simple as moving one card. Right now we don't know. The key may be
found tomorrow, or never.

-- Matt Mahoney