Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Determining whether two files are identical... is it reasonable to use Digest::MD5?

7 views
Skip to first unread message

Kevin B. Pease

unread,
Nov 29, 2001, 11:15:03 AM11/29/01
to

Hi all,

I'm writing a script which, as part of it's processing, will need to
compare the contents of two arbitrary files -- during the course of it's
run, I expect that the script will have to handle both binary and text
files. What I need to do is decide whether or not the files' contents are
identical. After reading a bit about the MD5 message-digest algorithm
(implemented in Digest::MD5), it seems as if this might be a reasonable way
to determine whether the file contents are identical.

From the MD5 RFC, section 1, Executive Summary:
[ . . . ] It is conjectured that it is computationally
infeasible to produce two messages having the same message digest, or to
produce any message having a given prespecified target message digest. [ . .
.. ]

==> Now, The question: Given the information above, is it reasonable
to expect that the code I've pasted below will reliably determine if the
contents of two given files are the same or different?

== code begins below ==
#!/tools/perl_5.005.02/bin/perl -w

use strict;
use Getopt::Long;
use Digest::MD5 qw{ md5_hex };
use vars qw{ $target_file $source_file $source_path $target_path };

my %target = ();
my %source = ();

# Read in the options from the command line. We need target file & path, as
well as source file & path.
GetOptions( "target:s" => \$target_file,
"source:s" => \$source_file,
"from:s" => \$source_path,
"to:s" => \$target_path);

my $full_target_name = "$target_path/$target_file";
my $full_source_name = "$source_path/$source_file";

# Undefine $/ temporarily to allow easy read of entire file into scalar.
my $temp_undef = $/;
undef $/;

# Open the target & source files, read them in, then set $/ back to its
original value & close the filehandles.
open (FILE1, "< $full_target_name");
open (FILE2, "< $full_source_name");
my $target_contents = <FILE1>;
my $source_contents = <FILE2>;
$/ = $temp_undef;
close(FILE1);
close(FILE2);

# Produce MD5 Hex digest of target file's contents & source file's contents.
my $file1_hex = md5_hex($target_contents);
my $file2_hex = md5_hex($source_contents);

if ($file1_hex eq $file2_hex) {
print "The files appear to be identical.\n";
} else {
print "The files are not identical.\n";
print "MD5 Hex digest of contents of $target_file = $file1_hex\n";
print "MD5 Hex digest of contents of $source_file = $file2_hex\n";
}
=== End code inclusion ===

And, even if the code does work... is there an easier way to do it that
I haven't found in the faqs?

Any information would be greatly appreciated... thank you!

Kevin Pease


Josh Vura-Weis

unread,
Nov 29, 2001, 1:44:55 PM11/29/01
to
Kevin B. Pease wrote:

> # Produce MD5 Hex digest of target file's contents & source file's
contents.
> my $file1_hex = md5_hex($target_contents);
> my $file2_hex = md5_hex($source_contents);
>
> if ($file1_hex eq $file2_hex) {
> print "The files appear to be identical.\n";
> } else {


Since you're already reading the whole files into memory, why not just do

if ($source_contents eq $target_contents){
#equal
} else {
#not equal
}

Or, so that you don't have to read a huge file all at once, just read a
K at a time and compare that:

#!/usr/bin/perl -w

my ($f1, $f2) = @ARGV;

open (F1, $f1) || die $!;
open (F2, $f2) || die $!;

my ($buff1, $buff2);
my $len = 1024;
my $offset=0;
my ($numread1, $numread2);
while(1){
$numread1 = read(F1, $buff1, $len, $offset);
$numread2 = read(F2, $buff2, $len, $offset);

if ($numread1 != $numread2){
die "Not equal";
}

if ($buff1 ne $buff2){
die "Not equal";
}

last if ($numread1 < $len);
$offset += $len;
}

print "equal\n";


This seems to work for both binary and text files. However, when I
tested it on a large (7M) .rpm file, I noticed that the perl process
grew from about 3M at startup to over 13M by the time it finished. It
doesn't look to me like I should be keeping more than about 2K in use at
once, so this seems strange to me. I'm guessing it may be linux
cacheing the file blocks I've already read and attributing that memory
to the perl process, but thought I should check to see if I'm doing
something wrong with my read call.

perl -v gives me : This is perl, v5.6.1 built for i686-linux-ld
and I'm running RH 7.1, kernel 2.4.9-6

Thanks,
Josh


Peter J. Acklam

unread,
Nov 29, 2001, 1:27:51 PM11/29/01
to
"Kevin B. Pease" <kevin...@fmr.com> wrote:

> == code begins below ==
>

> [...]


>
> # Undefine $/ temporarily to allow easy read of entire file into scalar.
> my $temp_undef = $/;
> undef $/;
>
> # Open the target & source files, read them in, then set $/ back to its
> original value & close the filehandles.
> open (FILE1, "< $full_target_name");
> open (FILE2, "< $full_source_name");
> my $target_contents = <FILE1>;
> my $source_contents = <FILE2>;
> $/ = $temp_undef;
> close(FILE1);
> close(FILE2);

Firstly, you should check the sizes. If the sizes differ, the
files can't be identical, and there is no need to read the
contents at all. You can do this with "-s $file".

Secondly, always check the return value from open()!

Thirdly, since you read both files into memory, why bother using
MD5 at all? You could simply use "eq" or "ne" to compare the
data.

If the files are really large, I would compare fixed size blocks
and abort as soon as two different blocks are found.

I have included a simple file comparison program I once wrote
which illustrates what I mean.

------------------------------------------------------------------------
#!/usr/bin/perl -w
#
# filecmp - compare two files
#
# Compare two files and tell whether they are identical or not.

use strict; # restrict unsafe constructs

die "$0: expected two file names\n" unless @ARGV == 2;

my ($file1, $file2) = @ARGV;

die "$0: $file1: no such file or directory\n" unless -e $file1;
die "$0: $file1: not a regular file\n" unless -f _;
my $size1 = -s _;

die "$0: $file2: no such file or directory\n" unless -e $file2;
die "$0: $file2: not a regular file\n" unless -f _;
my $size2 = -s _;

if ($size1 != $size2) {
print "files are different\n";
exit;
}

$/ = \8192;

open FILE1, $file1 or die "$0: $file1: can't open for reading\n";
binmode FILE1;

open FILE2, $file2 or die "$0: $file2: can't open for reading\n";
binmode FILE2;

while (defined(my $buf1 = <FILE1>) and defined(my $buf2 = <FILE2>)) {
next if $buf1 eq $buf2;
print "files are different\n";
exit;
}
print "files are identical\n";
------------------------------------------------------------------------

Peter

--
#!/local/bin/perl5 -wp -*- mode: cperl; coding: iso-8859-1; -*-
# matlab comment stripper (strips comments from Matlab m-files)
s/^((?:(?:[])}\w.]'+|[^'%])+|'[^'\n]*(?:''[^'\n]*)*')*).*/$1/x;

Joe Smith

unread,
Nov 29, 2001, 2:48:11 PM11/29/01
to
In article <b7tN7.4$M3...@news-srv1.fmr.com>,

Kevin B. Pease <kevin...@fmr.com> wrote:
> I'm writing a script which, as part of it's processing, will need to
>compare the contents of two arbitrary files -- during the course of it's
>run, I expect that the script will have to handle both binary and text
>files. What I need to do is decide whether or not the files' contents are
>identical. After reading a bit about the MD5 message-digest algorithm
>(implemented in Digest::MD5), it seems as if this might be a reasonable way
>to determine whether the file contents are identical.

MD5 is a great way of determining whether your local copy of a file
matches a remote copy of a file. Get the remote system to calculate
the MD5 digest based on what it has, and send back just the digest.
Compare that to the locally calculated digest. The end result is
the exchange of a few bytes of data, instead of the entire file.

It is also handy if you are planning to store the MD5 digest. For example,
to check if the file has been tampered with, or to locate possibly
indentical files before doing a byte-for-byte comparison.

The MD5 digest is not the best thing to use if you can do a
a byte-for-byte comparison between two local files.

># Undefine $/ temporarily to allow easy read of entire file into scalar.
>my $temp_undef = $/;
>undef $/;
>
># Open the target & source files, read them in, then set $/ back to its
>original value & close the filehandles.
>open (FILE1, "< $full_target_name");
>open (FILE2, "< $full_source_name");
>my $target_contents = <FILE1>;
>my $source_contents = <FILE2>;
>$/ = $temp_undef;
>close(FILE1);
>close(FILE2);

Use local() instead of my for $/, and put it in a block.

{ # Scope of local() below
local $/ = undef;
open (FILE1, $file1) or die "open($file1): $!\n";
my $contents1 = <FILE1>; close FILE1;
open (FILE2, $file2) or die "open($file2): $!\n";
my $contents2 = <FILE2>; close FILE2;
if ($contents1 eq $contents2) {
print "Identical: $file1 $file2\n";
} else {
print "Different: $file1 $file2\n";
}
} # Set $/ back to what it was

-Joe
--
See http://www.inwap.com/ for PDP-10 and "ReBoot" pages.

Malcolm Dew-Jones

unread,
Nov 29, 2001, 4:00:09 PM11/29/01
to
Joe Smith (in...@best.com) wrote:
: In article <b7tN7.4$M3...@news-srv1.fmr.com>,

: Kevin B. Pease <kevin...@fmr.com> wrote:
: > I'm writing a script which, as part of it's processing, will need to

: MD5 is a great way of determining whether your local copy of a file


: matches a remote copy of a file. Get the remote system to calculate
: the MD5 digest based on what it has, and send back just the digest.
: Compare that to the locally calculated digest. The end result is
: the exchange of a few bytes of data, instead of the entire file.

: It is also handy if you are planning to store the MD5 digest. For example,
: to check if the file has been tampered with, or to locate possibly
: indentical files before doing a byte-for-byte comparison.

: The MD5 digest is not the best thing to use if you can do a
: a byte-for-byte comparison between two local files.

$0.02

I wonder if zipping the files first will make it easier to tell them
apart.

If you zip the files before you compare them then that should increase the
differences, making it easier to tell them apart. The zipped sizes may
change, and even small differences may make the contents very different so
that just reading a few well placed chunks would detect the changes
(assuming the zipped files were even the same size).

You might have a reason to zip the files anyway, which is whay this comes
to mind, and then the zip work would be free. You'd have to figure out
whatever options and/or zip routine to use that would not vary the zip
file except based on the contents of the file.


Randal L. Schwartz

unread,
Nov 29, 2001, 3:55:14 PM11/29/01
to
>>>>> "Peter" == Peter J Acklam <pjac...@online.no> writes:

Peter> Firstly, you should check the sizes. If the sizes differ, the
Peter> files can't be identical, and there is no need to read the
Peter> contents at all. You can do this with "-s $file".

Peter> Secondly, always check the return value from open()!

Peter> Thirdly, since you read both files into memory, why bother using
Peter> MD5 at all? You could simply use "eq" or "ne" to compare the
Peter> data.

Peter> If the files are really large, I would compare fixed size blocks
Peter> and abort as soon as two different blocks are found.

Peter> I have included a simple file comparison program I once wrote
Peter> which illustrates what I mean.

Or you could just use File::Compare, in the CPAN for a few years now,
written by a good buddy of mine, Joseph Hall.

print "Just another Perl hacker,"

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Edward Avis

unread,
Nov 29, 2001, 3:19:22 PM11/29/01
to
"Kevin B. Pease" <kevin...@fmr.com> writes:

>What I need to do is decide whether or not the files' contents are
>identical. After reading a bit about the MD5 message-digest
>algorithm (implemented in Digest::MD5), it seems as if this might be
>a reasonable way to determine whether the file contents are
>identical.

No. If the checksums are different then the files are different; but
if two files have the same MD5 sum they are not necessarily identical.
It is quite hard to produce two files with the same checksum, and you
would be very unlucky to accidentally find two different files with
the same MD5 sum, but do you want code that depends on luck?

You haven't explained why you cannot just compare the files' contents
byte for byte. In most cases this will be _less_ expensive than
computing the MD5 checksum, since you'll only need to read the first
N bytes of each file if they differ at the Nth byte. Computing the
checksums means reading in all of both files.

>open (FILE1, "< $full_target_name");
>open (FILE2, "< $full_source_name");
>my $target_contents = <FILE1>;
>my $source_contents = <FILE2>;

># Produce MD5 Hex digest of target file's contents & source file's contents.


>my $file1_hex = md5_hex($target_contents);
>my $file2_hex = md5_hex($source_contents);

You have the file contents, why not just compare them using eq? Later
on if you want to speed things up you can try something more
sophisticated like comparing 1000-byte chunks at a time, or checking
that the file sizes match before comparing contents.

(Computing an MD5 sum would be useful if, for example, one file were
stored on a faraway server and one locally. You could ask the server
to compute the checksum and send you that instead of the whole big
file.)

--
Ed Avis <ep...@doc.ic.ac.uk>
Finger for PGP key

Gaal Yahas

unread,
Nov 30, 2001, 1:18:00 AM11/30/01
to
On Thu, Nov 29, 2001 at 10:44:55AM -0800, Josh Vura-Weis wrote:
> Or, so that you don't have to read a huge file all at once, just read a
> K at a time and compare that:
[...]

> while(1){
> $numread1 = read(F1, $buff1, $len, $offset);
> $numread2 = read(F2, $buff2, $len, $offset);
>
> if ($numread1 != $numread2){
> die "Not equal";
> }
>
> if ($buff1 ne $buff2){
> die "Not equal";
> }
[...]

> This seems to work for both binary and text files. However, when I
> tested it on a large (7M) .rpm file, I noticed that the perl process
> grew from about 3M at startup to over 13M by the time it finished. It
> doesn't look to me like I should be keeping more than about 2K in use at
> once, so this seems strange to me. I'm guessing it may be linux
> cacheing the file blocks I've already read and attributing that memory
> to the perl process, but thought I should check to see if I'm doing
> something wrong with my read call.

Four-arg read doesn't work the way you're expecting it to. The fourth
argument, OFFSET, is the offset to the SCALAR, not the offset in the
file. If you want to seek, seek. What your code above does is append
the newly read blocks to $buff1 and $buff2 on every read.

Since your reads are sequential, you don't need to seek anyway; after
each read, the file should be positioned nicely for the next one. Get
rid of your $offset variable, and the fourth argument to read, and
your code should work.

--
Gaal Yahas <ga...@forum2.org>
http://www.forum2.org/gaal/
http://www.livejournal.com/~gaal/

Mike Taylor

unread,
Nov 30, 2001, 10:59:53 AM11/30/01
to
> Date: 29 Nov 2001 20:19:22 +0000
> From: Edward Avis <ep...@doc.ic.ac.uk>

>
>> What I need to do is decide whether or not the files' contents are
>> identical. After reading a bit about the MD5 message-digest
>> algorithm (implemented in Digest::MD5), it seems as if this might
>> be a reasonable way to determine whether the file contents are
>> identical.
>
> No. If the checksums are different then the files are different; but
> if two files have the same MD5 sum they are not necessarily identical.
> It is quite hard to produce two files with the same checksum, and you
> would be very unlucky to accidentally find two different files with
> the same MD5 sum, but do you want code that depends on luck?

This is NOT something to worry about.

The MD5 module produces 16-bytes digests, that's 128 bits. So the
chance that any given file will have the same checksum as any given
other file is 1/2^128 == 1/340282366920938463463374607431768211456.

If you checked a hundred files every second (which would be very
fast), then you'd be likely to get one false hit every
3402823669209384634633746074317682115 seconds
= 5671372782015641057722910123862803524 minutes
= 94522879700260684295381835397713392 hours
= 3938453320844195178974243141571391 days
= 10790283070806014188970529154990 years
= 770734505057572442069 times the estimated age of the universe.

Let's not lose sleep over it :-)

_/|_ _______________________________________________________________
/o ) \/ Mike Taylor <mi...@miketaylor.org.uk> www.miketaylor.org.uk
)_v__/\ "Wagner's music is nowhere near as bad as it sounds" --
Mark Twain.


Mark-Jason Dominus

unread,
Nov 30, 2001, 10:38:54 AM11/30/01
to

> It is quite hard to produce two files with the same checksum,

"Quite hard" is an understatement. So far, it has proved impossible.

This is despite years of efforts by experts, who would love to find a
collission.

> and you would be very unlucky to accidentally find two different
> files with the same MD5 sum, but do you want code that depends on
> luck?

Luck affects all of us; there is no getting around it. The world is
not entirely under our control. A stray cosmic ray might pass through
your computer's memory, altering one of the bits. The disk might
catch fire. A crazed maniac with a shotgun might shoot out the power
cables. A meteorite could strike the computer room. The printed
circuitry on the motherboard might spontaneously evaporate. All of
these things are more likely to occur than an accidental MD5
collision. We don't write our software to be robust in the face of
such unlikely events. We buy insurance for some of them; others are
too unlikely to even be worth insuring. They are just part of the
risk of doing business in the physical universe.

You probably have a lot fewer than a hundred million files. But even
with one hundred million different files, the chances of an accidental
MD5 collision are about 34,028,237,032,376,216,670,100 to one. This
is even smaller than the chance of the Detroit Lions beating Chicago
next Sunday.

Suppose it takes one extra second of programming time to write the
program without MD5, and the MD5-less solution runs just as fast. Say
the programmer is paid $50,000 per year, so that the additional cost
of the MD5-less version was 0.694 cents. The cost-benefit analysis
says the time was wasted unless the cost of a spurious match is at
least $236,307,201,613,723,726,724.46.

(The risk analysis may produce a smaller number, but the result will
be the same. Do you really want to bet the company on an absence of
MD5 collisions? Maybe you think a 1/10^22 chance is too large when
the whole company is at stake. But you have *already* bet the company
against much worse odds: deranged employees who burn the machine room
on the same day that your offsite backups are swallowed up by
earthquakes; terrorist attacks that destroy downtown Manhattan,
nuclear wars, and so on. Is a scary world, and MD5 collisions are
among the least of your worries.)

The 'spurious collision' argument against MD5 is not a good one,
because the likelihood of a spurious collision is a lot smaller than a
lot of risks that we take for granted every day. There are good
reasons to avoid MD5, but that is not one of them.


Bart Lateur

unread,
Nov 30, 2001, 7:23:04 AM11/30/01
to
Edward Avis wrote...

> "Kevin B. Pease" <kevin...@fmr.com> writes:
>
> >What I need to do is decide whether or not the files' contents are
> >identical. After reading a bit about the MD5 message-digest
> >algorithm (implemented in Digest::MD5), it seems as if this might be
> >a reasonable way to determine whether the file contents are
> >identical.
>
> No. If the checksums are different then the files are different; but
> if two files have the same MD5 sum they are not necessarily identical.
> It is quite hard to produce two files with the same checksum, and you
> would be very unlucky to accidentally find two different files with
> the same MD5 sum, but do you want code that depends on luck?
>
> You haven't explained why you cannot just compare the files' contents
> byte for byte.

I can figure out a reason: what if you have a lot of files, and you
want to see if there are any duplicates? Surely, Digest::MD5 is a good
way to find out if there are any *candidates* for being identical
files. You can always double check by doing a direct comparison.

I have been using such a mechanism to find duplicate files in a total
of well over 25000 files for quite some time now. I haven't had *any*
false positives so far. The chance for it is pretty slim, anyway.
Search the web for "collisions" if you want to find out more.

Digest::MD5 does provide a way to generate a digest directly from a
filehandle, see the "addfile" method. Also see the remark WRT
binmode().

--
Bart.

Edward Avis

unread,
Nov 30, 2001, 7:43:41 AM11/30/01
to
yf...@victoria.tc.ca (Malcolm Dew-Jones) writes:

>: MD5 is a great way of determining whether your local copy of a file
>: matches a remote copy of a file.

>: The MD5 digest is not the best thing to use if you can do a


>: a byte-for-byte comparison between two local files.

>If you zip the files before you compare them then that should


>increase the differences, making it easier to tell them apart.

Not really - the MD5 checksum already puts the file content through a
lot of 'mangling' and does not particularly depend on 'large'
differences between files. Either a large difference or a small
difference in file content is likely to alter the checksum. Hence
there is no need to 'increase' the file differences by zipping them.

In some cases there must be two different files which have the same
MD5 sum, but differing sums when zipped. OTOH there are also cases
where two different files with different checksums end up with the
same checksum when you compare the zipped versions.

You could think up lots of elaborate schemes like first compare the
MD5 sums of the original files, then if that matches zip both files
and compare again, then ROT13 the results and compare the checksums a
third time. That would indeed be more trustworthy than a plain MD5
comparison, but still not guaranteed to indicate when two files
differ.

The easiest and simplest scheme - and the only one which is guaranteed
to give the right answer - is to compare the content of the two
files.

Kevin B. Pease

unread,
Nov 30, 2001, 2:23:25 PM11/30/01
to
"Edward Avis" <ep...@doc.ic.ac.uk> wrote in message
news:xn9adx5...@texel03.doc.ic.ac.uk...

> "Kevin B. Pease" <kevin...@fmr.com> writes:
> >What I need to do is decide whether or not the files' contents are
> >identical. [...]
> [...]

> You haven't explained why you cannot just compare the files' contents
> byte for byte. In most cases this will be _less_ expensive than
> computing the MD5 checksum, since you'll only need to read the first
> N bytes of each file if they differ at the Nth byte. Computing the
> checksums means reading in all of both files.

The reason I was thinking of approaching it this way is due to the
nature of the application I'm working with... I work as a ClearCase /
Release engineer, and what I'm trying to do is automate some of my build
functionality, specifically, automatically checking in object files which
are different from the "last released" version of the same file. The
criteria I'm trying to check for is this --
IF The files' contents are identical, AND The ClearCase config
records are identical
Then I don't want to check in the file. If either condition is
false, then check in the file.
(For checking the config records, ClearCase provides a utility for
this already, I just need to call it & look at the results)

Due to how ClearCase handles the storage of versions at my site
(text-file deltas & binaries), I was thinking it would be easier to compute
the MD5 digest of each file I check in, and then simply query that value
when I'm trying to determine if a new version needs to be checked in. To
(over)simplify the issue a bit for the benefit of anybody unfamiliar with
ClearCase, ClearCase removes unused copies of checked-in files from it's
network-wide storage area after a few days in order to save space; When
somebody tries to open a particular version of that file after it's removed,
ClearCase re-assembles a copy of that file on the fly from it's source
database, and serves it out to the user requesting the file. I was hoping
that by using MD5, I could avoid the performance hit of having to regenerate
a copy of the file for the sole purpose of comparing it with the version I
just built. Instead, a query on the file version's attribute "MD5_HEX"
would tell me what the digest is, and I'd then just compute the MD5 digest
of the file I'm considering as a candidate for checkin...

However, if it's possible that two different files could conceivably
generate the same MD5 digest, then I guess I'm better off incurring that
performance hit for the sake of certainty. Thanks to everyone for their
suggestions and feedback. I've certainly got a few things to try, and some
general style critiques that will certainly be useful in future coding!

----------
Kevin Pease kevin...@fmr.com

Malcolm Dew-Jones

unread,
Nov 30, 2001, 4:04:08 PM11/30/01
to
Edward Avis (ep...@doc.ic.ac.uk) wrote:
: yf...@victoria.tc.ca (Malcolm Dew-Jones) writes:

: >: MD5 is a great way of determining whether your local copy of a file
: >: matches a remote copy of a file.

: >: The MD5 digest is not the best thing to use if you can do a
: >: a byte-for-byte comparison between two local files.

: >If you zip the files before you compare them then that should
: >increase the differences, making it easier to tell them apart.

: Not really - the MD5 checksum already puts the file content through a
: lot of 'mangling' and does not particularly depend on 'large'

Yes, but to do an MD5 checksum you have to process the whole file, which
you would not normally need to do, whereas you commonly have to zip files
anyway, so using the zip during a compare may require no extra work, as
long as you've thought ahead of time about the correct zip options to use.

As for being easier, a byte for byte compare will be more efficient if you
choose an ordering of the bytes that is most likely to encounter any
differences sooner. E.g., if you are comparing a series of program
versions then the files have large runs that are very similar, and a plain
byte for byte compare in the natural order of the file may have to check a
large part of the file to find the first difference. If the comparison
started in the middle, then you might find the differences sooner (though
maybe not), whereas the zipped versions will have more differences so a
byte for byte compare will detect the difference sooner, and selecting
(for example) to start the compare in the middle would detect the
difference straight away, even if the files were virtually identical.

As I said though, the whole point of using a zip would be that you may
have a reason to zip the files anyway.


Edward Avis

unread,
Nov 30, 2001, 6:18:30 PM11/30/01
to
"Kevin B. Pease" <kevin...@fmr.com> writes:

>IF The files' contents are identical, AND The ClearCase config
>records are identical Then I don't want to check in the file.

>I was hoping that by using MD5, I could avoid the performance hit of


>having to regenerate a copy of the file for the sole purpose of
>comparing it with the version I just built. Instead, a query on the
>file version's attribute "MD5_HEX" would tell me what the digest is,
>and I'd then just compute the MD5 digest of the file I'm considering
>as a candidate for checkin...
>
>However, if it's possible that two different files could conceivably
>generate the same MD5 digest, then I guess I'm better off incurring that
>performance hit for the sake of certainty.

Not really - I would say use the digest if it is faster or makes your
life easier. The chances of it being wrong are really rather small.

I'm afraid that with my earlier scaremongering I misunderstood your
question. I thought you were asking is MD5 a good way to compare
files in general - under circumstances where you have both sets of
file contents to hand. If you can't easily reconstruct one of the
files, or one of the MD5 checksums is already computed for you, then
you might as well go ahead and use it.

If you use TCP/IP then there is a small chance that anything you
download will be corrupted (such as the contents of this message).
The TCP checksum and any other checksums in the network stack can, of
course, get it wrong. But the chance is small enough that it's not
worth worrying about. A mismatch on MD5 sums is probably a lot more
likely than a wrong download that goes unnoticed, but still not
something you're likely to see.

To summarize: in principle you should not ever rely on the MD5 sum to
tell you if two files are equal. In practice feel free to take
shortcuts if a proper comparison is awkward.

Edward Avis

unread,
Nov 30, 2001, 6:08:32 PM11/30/01
to
bart....@pandora.be (Bart Lateur) writes:

[is MD5 a good way to compare two files for equal contents?]

>>You haven't explained why you cannot just compare the files'
>>contents byte for byte.
>
>I can figure out a reason: what if you have a lot of files, and you
>want to see if there are any duplicates?

That's a good reason and I've done that myself sometimes (with the GNU
md5sum program). I meant if you are just comparing two files.

(In most cases though, just getting the file size will be a good
enough way to single out candidates. If all your files are the same
size then maybe a simple CRC checksum of the first (or last) thousand
bytes would do. MD5 probably isn't the thing to use if you are
worried about performance, though it certainly beats comparing each
pair of files individually.)

Another reasonably quick way to find duplicates - trying to steer this
back towards Perl a little - is to slurp the entire contents of each
file, put the files in a list and sort. This assumes you have enough
RAM to hold all the files, but it saves on the CPU time of comparing
every pair of files.

(Has anyone implemented a lazy string type in Perl where the contents
of the string is read from a file as needed? If this class overrode
cmp then comparing two such strings would mean reading the two files
only up to the point where they differ. But the programmer could
treat the objects more or less as strings.)

Edward Avis

unread,
Nov 30, 2001, 6:44:45 PM11/30/01
to
Mike Taylor <mi...@tecc.co.uk> writes:

>>>After reading a bit about the MD5 message-digest
>>>algorithm (implemented in Digest::MD5), it seems as if this might
>>>be a reasonable way to determine whether the file contents are
>>>identical.

>>It is quite hard to produce two files with the same checksum, and


>>you would be very unlucky to accidentally find two different files
>>with the same MD5 sum, but do you want code that depends on luck?

>This is NOT something to worry about.

Perhaps not in practice - but in principle. I just don't like the
idea of writing code that I _know_ to be buggy. (Buggy in the sense
that it will give the wrong answer for some inputs.) Sometimes using
something like MD5 comparison is unavoidable, or so useful that you're
prepared to take the risk of getting a wrong answer. But not if there
is a clear and equally efficient alternative.

It's highly unlikely in the foreseeable future that someone will
'crack' MD5 and find a way to easily generate two different files with
the same MD5 checksum (or worse, to generate a file with a known
checksum). So you are almost certainly safe in using it for
security-related things (many things already do). OTOH, why make your
program depend on the security of MD5 if you don't have to?

For me this is similar to the problem of selecting two different
numbers from zero until 10:

my $a = int(rand() * 10);
my $b;
do { $b = int(rand() * 10) }
until ($a != $b);

Is this code acceptable? For a very unlucky user it might take years
to pick a value for $b! I would prefer to write

my $a = int(rand() * 10);
my $b = int(rand() * 9);
++$b if $b >= $a;

This is guaranteed to pick two different numbers first time and
doesn't depend on luck (other than deciding what the numbers will be,
of course). You can say that a user would need to be astronomically
unlucky to experience a noticeable delay from the first version, and
that is true. But with the second version you don't have to
consider these arguments or weigh up probabilities. You know that it
will work (modulo actual bugs). As a programmer I'd feel a bit
unhappy about the first version - whatever the probability, it just
feels wrong to write that kind of code.

Peter J. Acklam

unread,
Dec 1, 2001, 4:00:20 PM12/1/01
to
Mike Taylor <mi...@tecc.co.uk> wrote:

> The MD5 module produces 16-bytes digests, that's 128 bits. So the
> chance that any given file will have the same checksum as any given
> other file is 1/2^128 == 1/340282366920938463463374607431768211456.

You argument assumes independence. Is that a assumption that can
be justified? Is it really a fact that the probability of any two
random files having the same MD5 checksum is 1 to 2^128?

> Let's not lose sleep over it :-)

Would you say that if the software depending on your assumption
was used in life-support systems, or on a nuclear power plant?

Craig Berry

unread,
Dec 1, 2001, 5:28:34 PM12/1/01
to
Josh Vura-Weis <wtp...@highwire.stanford.edu> wrote in news:3C068227.5030003
@highwire.stanford.edu:

> Or, so that you don't have to read a huge file all at once, just read a
> K at a time and compare that:
>
> #!/usr/bin/perl -w
>
> my ($f1, $f2) = @ARGV;

Doing

die "Not equal" unless -s $f1 == -s $f2;

right here would save a lot of work for most potential file pairs.

--
Craig Berry <http://www.cinenet.net/~cberry/>
"That which is now known, was once only imagined." - William Blake

Dave Tweed

unread,
Nov 30, 2001, 8:38:36 PM11/30/01
to
Edward Avis wrote:
> Another reasonably quick way to find duplicates - trying to steer this
> back towards Perl a little - is to slurp the entire contents of each
> file, put the files in a list and sort. This assumes you have enough
> RAM to hold all the files, but it saves on the CPU time of comparing
> every pair of files.

If you're going to go that far, you might as well just use the file
contents as keys to a hash and find the duplicates directly, rather
than going to the trouble of sorting them and then scanning the list
for duplicates.

In either case, each file will be compared with roughly log2(N) others.

-- Dave Tweed

Mark-Jason Dominus

unread,
Dec 1, 2001, 10:22:25 PM12/1/01
to

pjac...@online.no (Peter J. Acklam) asks:

> Mike Taylor <mi...@tecc.co.uk> wrote:
>
> > The MD5 module produces 16-bytes digests, that's 128 bits. So the
> > chance that any given file will have the same checksum as any given
> > other file is 1/2^128 == 1/340282366920938463463374607431768211456.
>
> You argument assumes independence. Is that a assumption that can
> be justified?

Well, sort of. More yes than no.

> Is it really a fact that the probability of any two
> random files having the same MD5 checksum is 1 to 2^128?

It depends on the random distribution that selects the files. It may
also depend on complex properties of the MD5 algorithm itself.

However, if you make the simplifying assumption that the output of the
MD5 algorithm is a random function of the input, then the answer to
your question is 'yes'.

For larger number n of files, the probability of a collision is
actually about n*(n-1)/(2^129), at least until n gets large enough for the
third-order effects to be significant. (Around 10^21 files or so, so
let's ignore them.) As I mentioned in another article, if you have
100 million files, then the probability of a collision is about
1e16/2^129 or 10^22ish. This is Very Small.

> > Let's not lose sleep over it :-)
>
> Would you say that if the software depending on your assumption
> was used in life-support systems, or on a nuclear power plant?

That sounds suspiciously like a rhetorical device that is intended to
provoke an irrational panic response. But I will give you the benefit
of the doubt ant answer your question.

Yes, I would say that, absolutely.

I might lose sleep worrying about the implementation of the MD5
algorithm. I might lose sleep worrying about the use to which the MD5
was put in the application. I might worry about simple hardware
failures. I might lose sleep worrying that a stray cosmic ray would
change the bits in the computer to cause an a comparison to be
performed incorrectly. But I absolutely will not lose sleep by
worring about the possibility of a chance MD5 hash collision.

According to

http://impact.arc.nasa.gov/introduction/faq-neo.html

we expect the Earth to be hit by one or two major meteorites (1
million megatons energy) every million years or so. A nuclear power
plant takes up maybe 1/80000000000 of the earth's surface and has a
lifetime of say 30 years. The likelihood of a major meteor strike on
your nuclear power plant is therefore about 1/80000000000 *
1.5/1000000 * 30 = 10^-15 or so. This is small, but Not So Small.

You have to allocate your resources where they will do the most good.
The likelihood of a meteor strike on your power plant is about
38 million times greater than the likelihood of an MD5 collision
between two of their files. If you're losing sleep over MD5
collisions in the power plant, you should be losing about 38,000,000
times as much sleep over meteor strikes.

And just think, the meteor might not strike the power plant at all.
It might strike *you*. NASA says that you might have as much as a
1/20000 chance of dying from a once-in-a-million-years meteor strike.
To get your chance of a random MD5 collision that high, you'd have to
have a filesystem that engraved several trillion files on the surface
of each and every proton in the entire universe.

We are engineers, not cave dwellers. We can be rational about this,
and not cower in superstitious fear of the unknown. Let's do that.

Edward Avis's suggestion that

do { $b = int(rand() * 10) }
until ($a != $b);

might take years is even more absurd. Leaving aside all arguments
about the putative randomness of rand(), the chance that this will
take *years* to complete goes beyond the merely astronomical. (The
odds that it will take as much as an eighth of a second are already
astronomical.) It goes beyond the chance that the atoms in the dust
on the floor will suddenly and spontaneously transform themselves into
a pure gold ingot; the gold ingot is a lot more likely than the code
above taking years to select $b. I don't want to argue the merits of
the code. I just want to make a second-level argument: You can make
good arguments about the merits of the code, and bad arguments about
the merits of the code, and the argument that the code might take a
long time to run is one of the worst of the bad arguments; it's
complete nonsense. You would do better argue that the code is bad
because it's haunted by green goblins, because at least I can't prove
that there aren't any goblins.

I note that Edward's proposed alternative code is incorrect, by the
way. Who really needs to worry about a 10^-15 chance of an MD5
collision when the chance of the programmer screwing up seems to be
about fifty-fifty?

Joe Smith

unread,
Dec 1, 2001, 11:15:21 PM12/1/01
to
In article <3c07...@news.victoria.tc.ca>,

Malcolm Dew-Jones <yf...@victoria.tc.ca> wrote:
>As for being easier, a byte for byte compare will be more efficient if you
>choose an ordering of the bytes that is most likely to encounter any
>differences sooner. E.g., if you are comparing a series of program
>versions then the files have large runs that are very similar, and a plain
>byte for byte compare in the natural order of the file may have to check a
>large part of the file to find the first difference. If the comparison
>started in the middle, then you might find the differences sooner (though
>maybe not), whereas the zipped versions will have more differences

Not really. If the plaintext files have a difference 80% into the file,
then the zipped version will also have its first difference 80% into
the zipped file. Granted, a zipped file has fewer bytes to compare, but
the difference will be the same relative percentage in zipped and unzipped.

> so a
>byte for byte compare will detect the difference sooner, and selecting
>(for example) to start the compare in the middle would detect the
>difference straight away, even if the files were virtually identical.

Nope. Changes at the end of a large file do not cause changes at
the beginning of the zip file. Most compression methods are one-pass;
a stream of bytes is compressed the first time around, instead of
using one pass to calculate an optimum and a second pass to compress
the data. Patterns at the end of the file do not affect the beginning.

Using perl's sysread() function is hard to beat. The overhead for
zipping a nonzipped file is not insignificant.

Steffen Müller

unread,
Dec 1, 2001, 7:39:08 PM12/1/01
to
"Peter J. Acklam" <pjac...@online.no> schrieb im Newsbeitrag
news:u1vabq...@online.no...

| Mike Taylor <mi...@tecc.co.uk> wrote:
| > Let's not lose sleep over it :-)
|
| Would you say that if the software depending on your assumption
| was used in life-support systems, or on a nuclear power plant?

perldoc -q "When shouldn't I program in Perl?"
comes to mind.

"The new, native-code compiler for Perl may eventually reduce the
limitations given in the previous statement to some degree, but understand
that Perl remains fundamentally a dynamically typed language, not a
statically typed one. You certainly won't be chastised if you don't trust
nuclear-plant or brain-surgery monitoring code to it. And Larry will sleep
easier, too--Wall Street programs not withstanding. :-)"

Just trying to connect this discussion to Perl.

Steffen
--
$_=q;0cb212c210b0bb010c0113bb0c410c0b516c0bb3d212c2b0b0b016b6cb2b2c21010c0
b41110b3bba0e0c0d2c4b2b6bc013d2c0d0b01012b0b0;;s/\n//g;s/(\d)/$1<2?$1:'0'x
$1/ge;s/([a-f])/'1'x(ord($1)-97)/ge;$o=$_;push@o,substr($o,$_*8,8) for(0..
24);for(@o){print"\0"x(26-$i).chr(oct('0b'.($_)))."\r";$i++};print"\n"#stm


Peter J. Acklam

unread,
Dec 2, 2001, 11:00:37 AM12/2/01
to
"Steffen Müller" <5l25...@sneakemail.com> wrote:

> "Peter J. Acklam" <pjac...@online.no> schrieb:


>
> > Would you say that if the software depending on your
> > assumption was used in life-support systems, or on a nuclear
> > power plant?
>
> perldoc -q "When shouldn't I program in Perl?"
> comes to mind.
>
> "The new, native-code compiler for Perl may eventually reduce
> the limitations given in the previous statement to some degree,
> but understand that Perl remains fundamentally a dynamically
> typed language, not a statically typed one. You certainly won't
> be chastised if you don't trust nuclear-plant or brain-surgery
> monitoring code to it. And Larry will sleep easier, too--Wall
> Street programs not withstanding. :-)"
>
> Just trying to connect this discussion to Perl.

You have a point, and I wouldn't use Perl for brain-surgery
monitoring, but there is more software related to this than the
software actually running during a brain surgery. There is also
the software used to manage the software, CASE software, version
control software, test suits, etc. I believe Perl is suited for
some of the latter.

Craig Berry

unread,
Dec 2, 2001, 1:50:43 PM12/2/01
to
"Steffen Müller" <5l25...@sneakemail.com> wrote in
news:9ubt4m$g1s$02$1...@news.t-online.com:

> "Peter J. Acklam" <pjac...@online.no> schrieb im Newsbeitrag
> news:u1vabq...@online.no...
>| Mike Taylor <mi...@tecc.co.uk> wrote:
>| > Let's not lose sleep over it :-)
>|
>| Would you say that if the software depending on your assumption was
>| used in life-support systems, or on a nuclear power plant?
>
> perldoc -q "When shouldn't I program in Perl?"
> comes to mind.

But note that this discussion is not Perl-specific; rather, it is a
language-independent question concerning how to compare files for equality,
and the risks of using message digesting for this purpose. So the FAQ entry
is irrelevant. All the points on this thread would be equally applicable to
writing file comparison code in C or Ruby or Ada or any other language.

Malcolm Dew-Jones

unread,
Dec 2, 2001, 1:42:06 PM12/2/01
to
Joe Smith (in...@best.com) wrote:
: In article <3c07...@news.victoria.tc.ca>,

: Malcolm Dew-Jones <yf...@victoria.tc.ca> wrote:
: >As for being easier, a byte for byte compare will be more efficient if you

: >choose an ordering of the bytes that is most likely to encounter any

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

: Not really. If the plaintext files have a difference 80% into the file,


: then the zipped version will also have its first difference 80% into
: the zipped file.

If you the make an intelligent choice for the *ordering* of the bytes that
you compare, for example if you chose to start comparing at the *end* of
the file instead of at the beginning, then you will encounter the
difference straight away because the zip algorithm will likely propagate
the differences to the end of the file.

The effect of zipping is to spread small differences (i.e. even just a
single byte) into larger regions that are different. So for example, if
you choose to order the compare in randomly ordered chunks, then you are
more likely to find the differences sooner in the zipped files than
unzipped files.

You'll notice that a well chosen ordering of the comparison of bytes is
one of the key speedup techniques in a traditional grep (fgrep probably).
The algorithm determines the locations at which differences are most
likely to appear and compares those bytes first.

It all depends on choosing an ordering of the bytes used for the compare.

(And on the original assumption that you had some other reason to zip the
files anyway.)


Matt Sergeant

unread,
Dec 3, 2001, 5:04:35 AM12/3/01
to
"Kevin B. Pease" <kevin...@fmr.com> wrote in message news:<b7tN7.4$M3...@news-srv1.fmr.com>...
> Hi all,

>
> I'm writing a script which, as part of it's processing, will need to
> compare the contents of two arbitrary files -- during the course of it's
> run, I expect that the script will have to handle both binary and text
> files. What I need to do is decide whether or not the files' contents are
> identical. After reading a bit about the MD5 message-digest algorithm

> (implemented in Digest::MD5), it seems as if this might be a reasonable way
> to determine whether the file contents are identical.

You should check CPAN first. See File::Same or File::Compare
(File::Same does processing on directories as well as individual
files, and does some caching, so it's a bit faster for large directory
compares).

File::Same uses MD5 for it's cache.

Matt.

Edward Avis

unread,
Dec 3, 2001, 1:04:46 PM12/3/01
to
Dave Tweed <dtw...@acm.org> writes:

>>Another reasonably quick way to find duplicates - trying to steer
>>this back towards Perl a little - is to slurp the entire contents of
>>each file, put the files in a list and sort.

>If you're going to go that far, you might as well just use the file


>contents as keys to a hash and find the duplicates directly,

>In either case, each file will be compared with roughly log2(N) others.

But which has better worst-case behaviour: a simple 'le' comparsion
for sorting or Perl's hash function? In fact, which has better
performance full stop? (You can compare two strings stopping at the
character where they differ, whereas hashing means reading both
strings from beginning to end - back to the MD5 discussion.)

Which reminds me: I was wondering why Perl uses a hash implementation
for its associative arrays, er I mean hashes. Why not some more
'elegant' data structure like a binary tree or b-tree? I mean, I'm
not disputing that hashes work very well in practice, but who made the
lucky decision?

Edward Avis

unread,
Dec 3, 2001, 1:09:12 PM12/3/01
to
Mark-Jason Dominus <m...@plover.com> writes:

>Edward Avis's suggestion that
>
> do { $b = int(rand() * 10) }
> until ($a != $b);
>
>might take years is even more absurd.

But it *might*. You can't argue with that. Especially if your random
number generator is buggy :-).

It might also just take three attempts to generate $b instead of one.
In some putative real-time system that depends on random numbers that
could be significant. Of course you would probably not be using Perl
under these circumstances.

>I note that Edward's proposed alternative code is incorrect, by the
>way. Who really needs to worry about a 10^-15 chance of an MD5
>collision when the chance of the programmer screwing up seems to be
>about fifty-fifty?

Argh. I just _knew_ this would happen, I spent longer than usual
checking the code in an attempt to avoid embarassment. I still can't
spot the bug, could you point it out?

Wolfram Pfeiffer

unread,
Dec 3, 2001, 1:56:59 PM12/3/01
to
Malcolm Dew-Jones <yf...@victoria.tc.ca> wrote:
[ checking if files differ ]

> I wonder if zipping the files first will make it easier to tell them
> apart.

Apart from other reasons that have already been pointed out:

This might give wrong results if the zip-program used includes
meta-information like the time of compression into the archive.

Assume a file foo.pl, and you zip it first into foo1.archive and later
into foo2.archive. If the archiver included the time of compression,
your check will tell that the files differ, although the original files
within the archive are equal.

Gruss,
Wolfram

--
Even a god who were responsible for the platypus would have done better
than XML. -- Erik Naggum, comp.lang.lisp

Malcolm Dew-Jones

unread,
Dec 3, 2001, 8:11:32 PM12/3/01
to
Wolfram Pfeiffer (wolfram....@bigfoot.com) wrote:

: Malcolm Dew-Jones <yf...@victoria.tc.ca> wrote:
: [ checking if files differ ]
: > I wonder if zipping the files first will make it easier to tell them
: > apart.

: Apart from other reasons that have already been pointed out:

I should not have posted this, and never will post anything like this
again, I assure you, other that the final thought at the bottom.

Removed by the poster was this caveat, requoted now from the third
(fourth?) paragraph of my original post...

"You'd have to figure out whatever options and/or zip routine to
use that would not vary the zip file except based on the contents
of the file. "

Also, some details such as differing meta information could likely be
avoided by skipping that information, assuming it is all at the front of
the zip file (I do not know if that is true), which is something the user
could also think about, or perhaps some compression programs do not add
this kind of information at all.

$0.02

This does make me wonder though, whether (probabilistically speaking) a
string equality function that compared strings in some non-standard order
might be more efficient on average than one that simply compares two
strings in the natural order.

/* pseudo C code, compare every tenth byte, starting
at offsets 0 thru 9 */
str_equal(char *string1, char *string2) {
for (i = 0; i<10; i++)
{
char *p=&string1[i];
char *q=&string2[i];
while ( p < end_of_string_1 )
{
if (*p != *q) return 0;
p += 10;
q += 10;
}
}
return 1;
}

The non-equality of two random strings should be detected just as soon as
in a normal string compare, but (perhaps) the non-equality of two strings
which are similar would be detected sooner, on average, than comparing
each byte in order. This requires knowing the length of the strings, but
in a language like perl that is presumbly true (and of course you can't do
a comparison, just an equality test).


Joe Smith

unread,
Dec 3, 2001, 7:26:02 PM12/3/01
to
In article <xn9lmgk...@sync22.doc.ic.ac.uk>,

Edward Avis <ep...@doc.ic.ac.uk> wrote:
>Which reminds me: I was wondering why Perl uses a hash implementation
>for its associative arrays, er I mean hashes. Why not some more
>'elegant' data structure like a binary tree or b-tree? I mean, I'm
>not disputing that hashes work very well in practice, but who made the
>lucky decision?

My guess is that it was Aho, Weinberger, and Kernighan, the developers
of 'awk'. Perl was designed to do the sorts of things that 'awk' and
'sed' do; 'awk' had associative arrays.

Dave Tweed

unread,
Dec 3, 2001, 6:51:44 PM12/3/01
to
Edward Avis wrote:

> Dave Tweed <dtw...@acm.org> writes:
> >If you're going to go that far, you might as well just use the file
> >contents as keys to a hash and find the duplicates directly,
>
> >In either case, each file will be compared with roughly log2(N) others.
>
> But which has better worst-case behaviour: a simple 'le' comparsion
> for sorting or Perl's hash function? In fact, which has better
> performance full stop? (You can compare two strings stopping at the
> character where they differ, whereas hashing means reading both
> strings from beginning to end - back to the MD5 discussion.)

If you use a hash as I suggest, each file's contents gets hashed
exactly once, to find which "bucket" it goes into. This is O(N).
Each new file needs to be compared (for equality) with each file
that's already in the same bucket. The number of compares should
be roughly the same for the hash approach and for the sort
approach, O(N log N).

The hash approach finds the duplicates directly, but the sort
approach requires *another* pass of comparisons (O(N)) in order
to identify adjacent identical files in the sorted list.

Overall, I would expect the performance of both approaches to
be similar. I just think the logic of the hash approach would
be more direct, clearer to understand and easier to maintain.

-- Dave Tweed

Peter J. Acklam

unread,
Dec 4, 2001, 7:48:19 AM12/4/01
to
Edward Avis <ep...@doc.ic.ac.uk> wrote:

> Mark-Jason Dominus <m...@plover.com> writes:
>
> >Edward Avis's suggestion that
> >
> > do { $b = int(rand() * 10) }
> > until ($a != $b);
> >
> >might take years is even more absurd.
>
> But it *might*. You can't argue with that.

Mark-Jason is being practical. Of course you are right, there is
a small probability that it might take years, but that quantity is
so small that it is only of theoretical value, at best. For all
practical purposes you can safely assume that will not take years.

Peter

Peter J. Acklam

unread,
Dec 4, 2001, 7:43:39 AM12/4/01
to
Mark-Jason Dominus <m...@plover.com> wrote:

> pjac...@online.no (Peter J. Acklam) asks:
>

> > Is it really a fact that the probability of any two random
> > files having the same MD5 checksum is 1 to 2^128?
>
> It depends on the random distribution that selects the files.
> It may also depend on complex properties of the MD5 algorithm
> itself.
>
> However, if you make the simplifying assumption that the output
> of the MD5 algorithm is a random function of the input, then the
> answer to your question is 'yes'.

I am not doubting that a probability of 2^-128 can be neglected
even in critical systems. What I was questioning was rather
whether that probability 2^-128 was correct.

Clearly, if one or a few checksums occurred frequently, and the
others hardly ever, then the probability of two identical
checksums could have been much higher than 2^-128. Apparently,
the MD5 algorithm is so well designed that this is nothing to
worry about.

It would be interesting to read more about the statistical
properties of the MD5 algorithm though. For instance, if the file
length is random (say, exponentially distributed) and each byte is
uniformly distributed (between 0x00 and 0xff), is the MD5 output
uniformly distributed?

Anyway, it is no longer a Perl issue, so I think I'll stop here.

Peter

0 new messages