http://datacompression.info/Miscellaneous/AMillionRandomDigits.bin
The source, (appended to this message as well) is at:
http://datacompression.info/Miscellaneous/Convert.java
The bin file stymied every compressor I threw at it in a limited test. I'd
love to find a program that can at least squeeze a few bytes out of it.
If you choose to run the converter on your own system, be warned that the
conversion process took about 2 hours on a 900 MHz Celeron. Yes, Java has a
downside as well as an upside!
Now, who's going to volunteer to compare Gordon's file with mine to see if
we have the same bit stream, possibly reversed or o/w mangled?
----------------------------------Convert.java------------------------------
-
import java.io.*;
import java.math.BigInteger;
public class Convert {
String fileName;
final static int maxLines = 20000;
String decimalData;
public Convert( String fileName ) {
this.fileName = fileName;
}
public void readData() throws Exception {
BufferedReader in = new BufferedReader( new FileReader(
this.fileName ) );
StringBuffer buf = new StringBuffer( maxLines * 50 );
for ( int i = 0 ; i < maxLines ; i++ ) {
if ( ( i % 100 ) == 0 ) System.out.println( i );
String s = in.readLine();
s = s.substring( 8, 13 ) + s.substring( 14, 19 ) +
s.substring( 21, 26 ) + s.substring( 27, 32 ) +
s.substring( 34, 39 ) + s.substring( 40, 45 ) +
s.substring( 47, 52 ) + s.substring( 53, 58 ) +
s.substring( 60, 65 ) + s.substring( 66, 71 );
buf.append( s );
}
in.close();
decimalData = buf.toString();
}
public void writeBigInteger( String filename ) throws Exception {
System.out.println( "About to convert to binary. This will take some
time..." );
BigInteger big = new BigInteger( decimalData );
System.out.println( "Done!" );
FileOutputStream outputStream = new FileOutputStream( filename );
outputStream.write( big.toByteArray() );
outputStream.close();
}
public static void main(String[] args) {
Convert convert = new Convert( "A Million Random Digits.txt" );
try {
convert.readData();
convert.writeBigInteger( "A Million Random Digits.bin" );
}
catch ( Exception e ) {
System.out.println( "Exception caught: " + e );
}
}
}
----------------------------------EOF-------------------------------
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
I know this isn't a fair comparison, but I can't resist Java bashing.
Here are the line/token/character counts for the two implementations:
20 48 369 decbin.c
20 106 651 marknelson.java
--
#include <stdio.h>
int d[200000];
char c;
main(){
register i,j,r,first,n;
for (n=0;1 == scanf("%d",&d[n]);n++);
first = 0;
while (first < n) {
r = 0;
for (j=first;j<n;j++) {
d[j] += r*100000;
r = d[j]%256;
d[j] /= 256;
if (j==first && !d[j]) first++;
}
c = r;
fwrite(&c,1,1,stdout);
}
}
--
Gordon V. Cormack CS Dept, University of Waterloo, Canada N2L 3G1
gvco...@uwaterloo.ca http://cormack.uwaterloo.ca/cormack
>Gordon Cormack already posted a pointer to this, but out of personal
>interest I duplicated his work. I wrote a short Java program (got to
>love those Java libraries!) that converts the RAND Corp million digit
>text file to a 400K+ binary file, using the quite spiffy BigInteger
>class. I've posted the binary file temporarily at:
>
>http://datacompression.info/Miscellaneous/AMillionRandomDigits.bin
>
Just for the heck of it I throw a couple of my compressors at
it. One only expanded the file by "one byte". But none got it to
the same size or smaller. Just like the one from Patrick. Some get
close but none of mine quite bet it.
I know my can't compress random. But if I make one that compresses
this file I will let you know since it would mean most likely the
data is really not random.
David A. Scott
--
My Crypto code
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott19.zip
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott16.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
Such files are good to research upon. Can you please point me to a bigger
file (About 10 MB). (I already have the 3 MB 5000$ challenge file.)
Sachin Garg. [India]
http://sachingarg.cjb.net
http://www.geocities.com/schngrg
There is a bug in your program, the array d[] is too small. The correct size
should be 220000.
The first column of digits.txt is a sequence number, not
part of the random sequence.
My program assumes this column has been removed. I forgot
to mention that. Sorry.
The files are the same, modulo reversal of the byte order. Here's the
program that shows it.
--
#include <string.h>
#include <stdio.h>
#include <assert.h>
char me[10000000], mark[10000000];
int nme, nmark;
main(){
int i;
FILE *fme = fopen("digits.binary","r");
FILE *fmark = fopen("AMillionRandomDigits.bin","r");
nme = fread(me,1,10000000,fme);
nmark = fread(mark,1,10000000,fmark);
printf("File length: me %d Mark %d\n",nme,nmark);
assert(nme == nmark);
for (i=0;i<nme;i++) {
if (me[i] != mark[nme-i-1]) {
printf("not the same at position %d (%d): %02x (%02x)\n",i,nme-i-1,me[i],mark[nme-i-1]);
exit(1);
}
}
printf("Files are the same (bytes reversed)\n");
Yes, but the Java code is so esthetically pleasing!
And of course, Gordon, you are cheating a bit - your code won't read the
million digit file, looks like you applied a bit of preprocessing. So no
more chest thumping, okay?
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
"Gordon V Cormack" <gvco...@uwaterloo.ca> wrote in message
news:ahnpgd$io5$1...@speedy.uwaterloo.ca...
Thanks for the confirmation, that's great!
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
"Gordon V Cormack" <gvco...@uwaterloo.ca> wrote in message
news:ahp9pe$mup$1...@speedy.uwaterloo.ca...
The random binary file is once again the winner, and still undefeated
champion!
Only Jules has been brash enough to guarantee victory. Will his Namathesque
confidence lead to comeuppance or humiliation?
I'm still guessing that there's a compressor out there somewhere that can
squeeze a byte or two out of it.
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
"SCOTT19U.ZIP_GUY" <david_...@emailv.com> wrote in message
news:9255DE4DAH110W...@216.168.3.40...
To each his own.
>And of course, Gordon, you are cheating a bit - your code won't read the
>million digit file, looks like you applied a bit of preprocessing. So no
>more chest thumping, okay?
Fair enough (it was an oversight on my part - when I saw your post I just
copied the file; I forgot I removed that column beforehand).
alright, I'm thinking of participating in this 'contest' as well, but now
that we have two (2) random files in reverse order, things are becoming a
bit more complex, now aren't they? Are we now supposed to write a compressor
that can compress *both* files, or would being able to compress one of them
be enough. Altough doing the impossible once is, well, erm, impossible but
doing it twice would make it even more impossible, right?
Sincerely,
Frank
"Mark Nelson" <ma...@ieee.org> schreef in bericht
news:va209.667340$cQ3.105906@sccrnsc01...
If you can reproduce it once you can reproduce it *n* times. It is the
same information.
In article <3d41015c$0$22324$fb62...@news1.zeelandnet.nl>,
Frank Schoep <fsc...@zeelandnet.nl> wrote:
>Hi,
>
>alright, I'm thinking of participating in this 'contest' as well, but now
>that we have two (2) random files in reverse order, things are becoming a
>bit more complex, now aren't they? Are we now supposed to write a compressor
>that can compress *both* files, or would being able to compress one of them
>be enough. Altough doing the impossible once is, well, erm, impossible but
>doing it twice would make it even more impossible, right?
>
>Sincerely,
>Frank
>I know this isn't a fair comparison, but I can't resist Java bashing.
When Mark posted his program I was curious whether I could improve it
(ran 42 minutes on my system). I couldn't find anything in that
BigInteger(String) constructor, though.
When you posted your version, I assumed you had a much better
algorithm than whatever BigInteger does and I ported it to Java
(syntax isn't that different) and ran both overnight after I saw that
decbin wasn't that fast either. Now your program took 3h 51m (gcc 2.91
/ cygwin), the Java port of it 3h 46m (JRE 1.4.0_01). I wonder what
your results were?! Not so much the five minutes that Java decbin was
faster but the factor ~ 5.5 between your program and Mark's?
>Here are the line/token/character counts for the two implementations:
>
> 20 48 369 decbin.c
> 20 106 651 marknelson.java
Ah, low token counts, the true sign of a good program... Actually, my
program is a bit smaller than Mark's, but can't quite reach yours.
Maybe the comparison was fairer if Java's runtime library named its
classes st, br, fop etc. ("C style") or if you used self-explaining
variables names - on the other hand, that would have increased the
character count, so it's probably out of the question! ;-)
Anyway, here's my port:
=== convert2.java
import java.io.*;
public class convert2 {
public static void
main(String[] args) throws Exception {
int[] d = new int[220000];
int n = 0;
StreamTokenizer st = new StreamTokenizer(new BufferedReader(new
InputStreamReader(System.in)));
int token;
while ((token = st.nextToken()) != StreamTokenizer.TT_EOF)
if (token == StreamTokenizer.TT_NUMBER)
d[n++] = (int)st.nval;
FileOutputStream out = new FileOutputStream(FileDescriptor.out);
int first = 0;
while (first < n) {
int r = 0;
for (int j=first;j<n;j++) {
d[j] += r*100000;
r = d[j]%256;
d[j] /= 256;
if (j==first && d[j] == 0) first++;
}
out.write(r);
}
}
}
===
Regards,
Marco
Hi,
Yep, I have a slightly modified adaptive arithmetic coder that can.But I am
having trouble implementing a part of it. (Will take a few days). But I can
predict it will work.
It can compress the $5000 challenge file too, but can not compress enough to
cover up the bytes used by the decompressor. :-(
And gotta mention, it compresses only a byte or two. ;-)
(actually a few 10s)
Sachin Garg. [India]
http://sachingarg.cjb.net
Mark, your news client, whichever you are using, doesn't puts the ">" before
the lines when you reply. It makes it kinda hard to read your messages.
It will be good if you fix it.
I'm astonished that the gcc version and java port ran at nearly the same
speed! I would have thought that C/C++ had a huge advantage for something so
bound up in computation. I'd like to know what JVM you are running? Perhaps
this program would make a good benchmark...
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
"Marco Schmidt" <marcos...@geocities.com> wrote in message
news:fri2kus3jgdm0i0q8...@4ax.com...
I'm not sure what you're saying??? I usually manually quote the bare minimum
context, just like in this message, and leave the original unchanged below
the sig. As in this message. Which part don't you like, the quoted line
above or the intact message below?
After my bashing from Chris, I will be doing whatever it takes to please, so
just say the word!
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| DataCompression.info - http://www.datacompression.info
|
"Sachin Garg" <sch...@yahoo.com> wrote in message
news:ahs2pf$i72$1...@newsreader.mailgate.org...
The part below actually. But I didn't noticed that you were quoting the part
above. It's fine this way but it will be better if you have the > added
before every line of previous message.
Don't take it too seriously.
>Interesting experiment, Marco! It sounds like the implementation of
>BigInteger is probably pretty good.
I don't know enough about that subject, so I can't tell. Maybe there
are specialized algorithms for this task (long ASCII number => binary
representation) that are much better.
>I'm astonished that the gcc version and java port ran at nearly the same
>speed! I would have thought that C/C++ had a huge advantage for something so
>bound up in computation.
Actually, numerical computing is one of the things that Java is pretty
good at: <http://math.nist.gov/javanumerics/>,
<http://www.javagrande.org/>. Things like array access are still
relatively expensive with Java, but Sun's Hotspot JIT compiler creates
good native code for everything that gets executed often (like the
nested while and for loops in decbin). gcc in the latest version with
all speed optimizations on may be better, or maybe the Java native
compiler JET (with array checking off, and all other speed
optimizations on) on the Java version of decbin, but that's not what I
was really interested in, just the relative performance between
BigInteger's algorithm and decbin.
>I'd like to know what JVM you are running? Perhaps this program
>would make a good benchmark...
The Windows version of the JVM that comes with Sun's latest stable
JDK:
$ java -version
java version "1.4.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)
Regards,
Marco
Unless Java has changed how it handls numeric computing in the last 5
years, I simply can't agree with you on that one.
See:
How Java's Floating-Point Hurts Everyone Everywhere
By Professor Kahan.
Phil
Why? If you find a general purpose compressor that can compress this
file by even just one byte you would have shown that either your claim
that the data is random was false or that it is possible to compress
random data and Jules Gilbert was right all along.
Either:
1. Your modifications are specific to the data files you are compressing.
If this is the case your claim is not very impressive as anyone can write
a specialist compressor that can compress any file in a set of up to
256 files to a single byte.
2. Your modified compressor is still a general purpose compressor,
implying:
2a. The data files are not random after all.
2b. You have found a general method of compressing random data.
This makes your claims equivalent to those of Jules Gilbert.
After all, the result of compressing a random file can't be
any more random so you can just put the compressed file back
into the compressor and you have your own perpetual compression
scheme. If you can't do this then you have to fall back to either
1 or 2a above.
> > The bin file stymied every compressor I threw at it in a limited test.
I'd
> > love to find a program that can at least squeeze a few bytes out of it.
>
> Why? If you find a general purpose compressor that can compress this
> file by even just one byte you would have shown that either your claim
> that the data is random was false or that it is possible to compress
> random data and Jules Gilbert was right all along.
Oh, not at all.
If an off-the-shelf compressor managed to knock 10 bytes off the 400K bin
file, it certainly wouldn't disprove anything. It would just be an
interesting note.
For Jules to be right requires that size of the decompressor plus the size
of any data be less than the size of the input file. That's a bit different.
--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| The Ultimate Compression Resource - http://www.datacompression.info
| Sponsored by Xceedsoft - Serious About Components
|
"The 3 Wise Men US" <the3wi...@yahoo.com> wrote in message
news:dd9a1641.02080...@posting.google.com...
Off-the-shelf compressors compress by removing redunduncy from data.
Random data doesn't have any redundancy. If you don't include the size
of the decompressor, the only way to "compress" a random binary file
is to hide information specific to the file in the decompressor.
The only way an off-the-shelf compressor could compress the file
is if it included some sort of random data compression "Easter Egg"
that just happened to work for this file, e.g. it searches for a
particular 4 byte sequence in the data and if this occurs in the
first 65536 bytes, it stores the offset to this sequence in the
second and third bytes of the compressed file. The first byte is
a header that tells the decompressor whether the compressed file
is a normal compressed file or an Easter Egg file. If the 4 byte
sequence occurred early enough in your data file (1 in 65536 chance)
then this compressor could "compress" it by one byte.
How else could an off-the-shelf compressor compress this data if
it is truly random?
>Off-the-shelf compressors compress by removing redunduncy from data.
>Random data doesn't have any redundancy. If you don't include the size
>of the decompressor, the only way to "compress" a random binary file
>is to hide information specific to the file in the decompressor.
>
Actaully since a random file can be anything. I think its highly
unlikely you get zero redunduncy in the file. So assuming the
file has some redundancy the question is. Is there encugh to
actaully exploit.
>The only way an off-the-shelf compressor could compress the file
>is if it included some sort of random data compression "Easter Egg"
>that just happened to work for this file, e.g. it searches for a
>particular 4 byte sequence in the data and if this occurs in the
>first 65536 bytes, it stores the offset to this sequence in the
>second and third bytes of the compressed file. The first byte is
>a header that tells the decompressor whether the compressed file
>is a normal compressed file or an Easter Egg file. If the 4 byte
>sequence occurred early enough in your data file (1 in 65536 chance)
>then this compressor could "compress" it by one byte.
>
If the compressor is a general compressor which uses an assumed
static frequency distribution for a given symbol set. It has a
fair chance of compressing the file a byte or two. If its an
adaptive compression then the table information has to be stored
in the file which tends to make the output file longer so the chances
of compression go down.
The only real question is for long files is there one that has
"ZERO REDUNDANCY" and if it does so what. Since there are really
unlimited number of general compressors. That could use very
strange character sets. Eample the set A = 0 B = 10 C = 11 could
be used. In which case the file would have to have the right
distributions not to be able to be compressed with it.
Take a long file that is a multiply of 256 and thats use the
normal each symbol is one of the possible byte values. Suppose
there is ZERO redunacry in this file at every order tested using
the 256 symbol set. Here is a general compressor that compresses
it. Leave the file unchanged except don't write the last byte.
On uncompression it counts the occurranace of each symbol. At
end of file if the occurances are off by just one in one of the
symbols you write out that byte. So it would compress every file
of that length that had the required zero rendunacy for that
symbol set. AS to the rest of files who cares they can all expand
it doesn't have to be bijective.