Compare two huge files without knowing the corresponding checksum infomation.

Hongyi Zhao

unread,

Jun 20, 2020, 10:12:10 PM6/20/20

to

Hi,

I downloaded the intel parallel studio cluster edition from this location by two downloading tools: http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.

Now, I want to check whether these two downloaded files are exactly same but I don't know the corresponding checksum information about them. Any hints for doing this job quickly and efficiently?

Regards,
HY

Bit Twister

unread,

Jun 21, 2020, 1:41:55 AM6/21/20

to

On Sat, 20 Jun 2020 19:12:06 -0700 (PDT), Hongyi Zhao wrote:
> Hi,

Hi yourself.
I can recommend getting a real Usenet client which follows Usenet
guidelines and wraps your lines around 72 characters and a free Usenet
account.

One or more subject matter experts may not bother responding if the
post is from google groups or if they have to reformat your text to
~72 characters per line.

> I downloaded the intel parallel studio cluster edition from this
> location by two downloading tools: http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>
> Now, I want to check whether these two downloaded files are exactly

> the same

Personally I would have used cmp.

> but I don't know the corresponding checksum information about them.

Well, pick a checksum type, use that tools syntax.

> Any hints for doing this job quickly and efficiently?

Think about it, you are using a program to read the binary file and
produce a checksum value. You do it again for the other file,
then compare the results.

If using cmp, it reads and compares both files and gives you a
result/status in one pass.

That seems more efficient than separate compute checksum then making the
the test.

Your checksum code could look somewhat like

set -- $(md5sum $File_one)
F1_sum="$1"
set -- $(md5sum $File_two)
F2_sum="$1"

if [ "$F1_sum" != "$F2_sum" ] ; then
echo "$File_one does not match $File_two"
exit 1
fi

Whereas with cmp it would be

cpm --quiet --status $File_one $File_two
if [ $? -ne 0 ] ; then
echo "$File_one does not match $File_two"
exit 1
fi

Chris F.A. Johnson

unread,

Jun 21, 2020, 5:08:07 AM6/21/20

to

On 2020-06-21, Bit Twister wrote:
...

> Whereas with cmp it would be
>
> cpm --quiet --status $File_one $File_two

I don't see the --status option in any version of cmp that I have, nor
in the POSIX spec..

> if [ $? -ne 0 ] ; then
> echo "$File_one does not match $File_two"
> exit 1
> fi

if ! cpm -s "$File_one" "$File_two"

then
echo "$File_one does not match $File_two"
exit 1
fi

--
Chris F.A. Johnson <http://cfajohnson.com/>
=========================== Author: ===============================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux shell (2009, Apress)

Jens Schweikhardt

unread,

Jun 21, 2020, 6:01:53 AM6/21/20

to

Chris F.A. Johnson <cfajo...@cfaj.ca> wrote
in <5sh3sg-...@news.cfaj.ca>:
# On 2020-06-21, Bit Twister wrote:
# ...
#> Whereas with cmp it would be
#>
#> cpm --quiet --status $File_one $File_two
#
# I don't see the --status option in any version of cmp that I have, nor
# in the POSIX spec..
#
#> if [ $? -ne 0 ] ; then
#> echo "$File_one does not match $File_two"
#> exit 1
#> fi
#
# if ! cpm -s "$File_one" "$File_two"
# then
# echo "$File_one does not match $File_two"
# exit 1
# fi

s/cpm/cmp/

Regards,

Jens
--
Jens Schweikhardt http://www.schweikhardt.net/
SIGSIG -- signature too long (core dumped)

Hongyi Zhao

unread,

Jun 21, 2020, 7:11:10 AM6/21/20

to

在 2020年6月21日星期日 UTC+8下午1:41:55，Bit Twister写道：

> On Sat, 20 Jun 2020 19:12:06 -0700 (PDT), Hongyi Zhao wrote:
> > Hi,
>
> Hi yourself.
> I can recommend getting a real Usenet client which follows Usenet
> guidelines and wraps your lines around 72 characters and a free Usenet
> account.
>
> One or more subject matter experts may not bother responding if the
> post is from google groups or if they have to reformat your text to
> ~72 characters per line.
>
> > I downloaded the intel parallel studio cluster edition from this
> > location by two downloading tools: http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
> >
> > Now, I want to check whether these two downloaded files are exactly
> > the same
>
> Personally I would have used cmp.
>
> > but I don't know the corresponding checksum information about them.
>
> Well, pick a checksum type, use that tools syntax.
>
> > Any hints for doing this job quickly and efficiently?
>
> Think about it, you are using a program to read the binary file and
> produce a checksum value. You do it again for the other file,
> then compare the results.
>
> If using cmp, it reads and compares both files and gives you a
> result/status in one pass.

How about using diff like the following:

$ if ! diff $File_one $File_two >/dev/null 2>&1; then

Bit Twister

unread,

Jun 21, 2020, 7:25:57 AM6/21/20

to

On Sun, 21 Jun 2020 04:20:53 -0400, Chris F.A. Johnson wrote:
> On 2020-06-21, Bit Twister wrote:
> ...
>> Whereas with cmp it would be
>>
>> cpm --quiet --status $File_one $File_two

Sorry for the cmp misspell.

>
> I don't see the --status option in any version of cmp that I have, nor
> in the POSIX spec..

AH, frap, too many open xterms. I was looking at the man md5sum
page, instead of cmp page when typing the code snippet.

Thanks to Jens Schweik and you for catching my mistakes/screw ups.

Janis Papanagnou

unread,

Jun 21, 2020, 10:01:42 AM6/21/20

to

On 21.06.2020 07:41, Bit Twister wrote:
>> [...]
>
> [...]

>
> That seems more efficient than separate compute checksum then making the
> the test.

Indeed. Using a checksum makes sense if you compare new files against
existing ones. In that case you need to compute just one new checksum
per file and have a fast test of checksums only.

>
> Your checksum code could look somewhat like
>
> set -- $(md5sum $File_one)
> F1_sum="$1"
> set -- $(md5sum $File_two)
> F2_sum="$1"

Using 'set' is unnecessary if you redirect the (BTW quoted) input files
for md5sum.

F1_sum=$(md5sum < "$File_one")
F2_sum=$(md5sum < "$File_two")

>
> if [ "$F1_sum" != "$F2_sum" ] ; then
> echo "$File_one does not match $File_two"
> exit 1
> fi
>
> Whereas with cmp it would be
>
> cpm --quiet --status $File_one $File_two
> if [ $? -ne 0 ] ; then

(And directly tesing the 'cmp' command status has already been suggested
upthread.)

> echo "$File_one does not match $File_two"
> exit 1
> fi
>

Janis

Lew Pitcher

unread,

Jun 21, 2020, 11:25:43 AM6/21/20

to

Whatever tool you use, it will have to read and process the entire contents
of each of two files. So, depending on the size of the files, /not/ quickly.

You could just compare the two files: tools like
diff(1)
or
cmp(1)
running in "binary" mode.

You could also just checksum each of the two files, and compare the
checksums. That would take
md5sum(1)
(again, running in "binary" mode).

--
Lew Pitcher
"In Skills, We Trust"

Janis Papanagnou

unread,

Jun 21, 2020, 11:52:51 AM6/21/20

to

On 21.06.2020 17:25, Lew Pitcher wrote:
> On June 20, 2020 22:12, Hongyi Zhao wrote:
>
>> Hi,
>>
>> I downloaded the intel parallel studio cluster edition from this location
>> by two downloading tools:
>> http://registrationcenter-
> download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>>
>> Now, I want to check whether these two downloaded files are exactly same
>> but I don't know the corresponding checksum information about them. Any
>> hints for doing this job quickly and efficiently?
>
> Whatever tool you use, it will have to read and process the entire contents
> of each of two files. So, depending on the size of the files, /not/ quickly.

Well, not quite; it depends on the data and on the tool. For example cmp will
have an early exit if there's just status information zu return; e.g. test on
binary data:

$ time cmp f[10]
f0 f1 differ: byte 1, line 1

real 0m0.04s
user 0m0.00s
sys 0m0.00s

$ time cmp f[12]

real 0m11.85s
user 0m1.72s
sys 0m0.89s

Janis

Luuk

unread,

Jun 21, 2020, 2:43:23 PM6/21/20

to

On 21-6-2020 07:41, Bit Twister wrote:
> On Sat, 20 Jun 2020 19:12:06 -0700 (PDT), Hongyi Zhao wrote:
>> Hi,
>
> Hi yourself.
> I can recommend getting a real Usenet client which follows Usenet
> guidelines and wraps your lines around 72 characters and a free Usenet
> account.
>
> One or more subject matter experts may not bother responding if the
> post is from google groups or if they have to reformat your text to
> ~72 characters per line.
>

from: https://tools.ietf.org/html/draft-ietf-usefor-useage-01
paragraph 3.2.3
>> In plain-text articles (those with no MIME headers, or those with a
>> MIME Media Type of "text/plain") posting agents SHOULD endeavour to
>> keep the length of body lines within some reasonable limit. The size
>> of this limit is a matter of policy, the default being to keep within
>> 79 characters at most, and preferably within 72 characters (to allow
>> room for quoting in followups). Except where "format=flowed" is
>> being used (3.1.2.2), the line breaks shown to the poster during
>> editing SHOULD be exactly as they will appear in the posted article.

Ok, so far, but the text in paragraph 3.1.1, which also is about characters
per line is more worrying:
>> NOTE: The reason for the figure 79 is to ensure that all lines
>> will fit in a standard 80-column screen without having to be
>> wrapped. The limit is 79 not 80 because, while 80 fit on a
>> line, any character in the last column often forces a line-wrap.

Who has an 80-column, character oriented, screen nowadays?

--
Luuk

Janis Papanagnou

unread,

Jun 21, 2020, 2:54:19 PM6/21/20

to

On 21.06.2020 20:43, Luuk wrote:
> [big snip]

> Who has an 80-column, character oriented, screen nowadays?

Wrong question.

Janis

Keith Thompson

unread,

Jun 21, 2020, 3:16:56 PM6/21/20

to

Hongyi Zhao <hongy...@gmail.com> writes:
> 在 2020年6月21日星期日 UTC+8下午1:41:55，Bit Twister写道：

[...]

>> If using cmp, it reads and compares both files and gives you a
>> result/status in one pass.
>
> How about using diff like the following:
>
> $ if ! diff $File_one $File_two >/dev/null 2>&1; then
> echo "$File_one does not match $File_two"
> exit 1
> fi

diff is designed for text files. It goes to a lot of effort to show you
*how* two files differ, looking for ranges of lines that match. Some
versions could also have problems with some binary files (for example,
there might be line length limits, and a large binary file might appear
to have extremely long lines when interpreted as text). cmp just tells
you whether two binary files differ or not, and if they do at what byte
offset the first difference occurs.

If you don't have both files on the same system, comparing checksums can
avoid the need to copy one of the files, or you can save the checksums
for later comparison. If they're on the same system, just use cmp;
that's what it's for.

[...]

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Lew Pitcher

unread,

Jun 21, 2020, 4:00:09 PM6/21/20

to

On June 21, 2020 14:43, Luuk wrote:
[snip]

> Who has an 80-column, character oriented, screen nowadays?

Ever read a book? Or a newspaper? (I know that these are obsolete media,
but...) Ever read a passage that doesn't include paragraph breaks?

The eye doesn't read; the mind does. And, there is an upper limit on the
amount of "clutter" the mind and eye can tolerate before something becomes
unreadable. "Walls of text", long lines, etc can confuse the mind, and you
either lose readability or comprehension at some arbitrary point.

Note that I said /arbitrary/. The point differs between readers.

But, 80 characters comes close to that point. Too much more than 80
characters, and you get the "wall of text" effect, or lose the point of the
sentence; around 80 characters and below, text is still comprehendable.

Computers don't have this problem; they eat 64k character lines as easily as
80 character lines. But humans /do/ have this problem; we can read and write
coherently only in blocks about that large.

So, pick a number. 80? 100? 200? At some point, your number will be too
large. Or, you can (mostly) stick with an (to you, obsolete) established
standard and write and read coherent information.

Your choice. I've already made mine.

Cydrome Leader

unread,

Jun 23, 2020, 11:15:13 PM6/23/20

to

diff --brief <(sha256 file1) <(sha256 file2)

change sha256 to any checksumming program you like. It could be md5,
md5sum etc.

Keith Thompson

unread,

Jun 24, 2020, 5:19:56 AM6/24/20

to

Cydrome Leader <pres...@MUNGEpanix.com> writes:

> Hongyi Zhao <hongy...@gmail.com> wrote:
>> I downloaded the intel parallel studio cluster edition from this
>> location by two downloading tools:
>> http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>>
>> Now, I want to check whether these two downloaded files are exactly
>> same but I don't know the corresponding checksum information about
>> them. Any hints for doing this job quickly and efficiently?
>

> diff --brief <(sha256 file1) <(sha256 file2)
>
> change sha256 to any checksumming program you like. It could be md5,
> md5sum etc.

What is the advantage of that over
cmp file1 file2
?

jo...@schily.net

unread,

Jun 24, 2020, 6:30:16 AM6/24/20

to

In article <rcugfu$4ja$5...@reader1.panix.com>,

Cydrome Leader <pres...@MUNGEpanix.com> wrote:

>diff --brief <(sha256 file1) <(sha256 file2)

Besides the fact that this is slow....

diff --brief
diff: illegal option -- brief
usage: diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] file1 file2
diff [-abBiNptw] [-C number | -U number] file1 file2
diff [-abBiNptw] [-D string] file1 file2
diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] [-l] [-r] [-s] [-S name] directory1 directory2

--
EMail:jo...@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.s...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/

Kenny McCormack

unread,

Jun 24, 2020, 7:51:05 AM6/24/20

to

In article <rcv9vk$5qe$1...@news2.open-news-network.org>,

<jo...@schily.net> wrote:
>In article <rcugfu$4ja$5...@reader1.panix.com>,
>Cydrome Leader <pres...@MUNGEpanix.com> wrote:
>
>>diff --brief <(sha256 file1) <(sha256 file2)
>
>Besides the fact that this is slow....

Actually, if your version of "diff" doesn't support --brief, then this will
run quite quickly indeed.

>diff --brief
>diff: illegal option -- brief
>usage: diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] file1 file2
> diff [-abBiNptw] [-C number | -U number] file1 file2
> diff [-abBiNptw] [-D string] file1 file2
> diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] [-l] [-r] [-s] [-S
>name] directory1 directory2

You are probably using some broken, old version of Unix.

In Linux, "man diff | grep brief" shows that --brief is an alias for -q.

--
Trump has normalized hate.

The media has normalized Trump.

Janis Papanagnou

unread,

Jun 24, 2020, 11:25:54 AM6/24/20

to

On 24.06.2020 11:19, Keith Thompson wrote:

> Cydrome Leader <pres...@MUNGEpanix.com> writes:
>>
>> diff --brief <(sha256 file1) <(sha256 file2)
>

> What is the advantage of that over
> cmp file1 file2
> ?

The former will make your computer produce more heat which might be
an advantage during winter season.

Janis ;-)

jo...@schily.net

unread,

Jun 24, 2020, 11:34:46 AM6/24/20

to

In article <rcven5$cqo$1...@news.xmission.com>,

Kenny McCormack <gaz...@shell.xmission.com> wrote:
>Actually, if your version of "diff" doesn't support --brief, then this will
>run quite quickly indeed.
>
>>diff --brief
>>diff: illegal option -- brief
>>usage: diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] file1 file2
>> diff [-abBiNptw] [-C number | -U number] file1 file2
>> diff [-abBiNptw] [-D string] file1 file2
>> diff [-abBiNptw] [-c | -e | -f | -h | -n | -q | -u] [-l] [-r] [-s] [-S
>>name] directory1 directory2
>
>You are probably using some broken, old version of Unix.
>
>In Linux, "man diff | grep brief" shows that --brief is an alias for -q.

OK, so you are not on UNIX....

Kaz Kylheku

unread,

Jun 24, 2020, 2:06:07 PM6/24/20

to

On 2020-06-24, Keith Thompson <Keith.S.T...@gmail.com> wrote:
> Cydrome Leader <pres...@MUNGEpanix.com> writes:
>> Hongyi Zhao <hongy...@gmail.com> wrote:
>>> I downloaded the intel parallel studio cluster edition from this
>>> location by two downloading tools:
>>> http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>>>
>>> Now, I want to check whether these two downloaded files are exactly
>>> same but I don't know the corresponding checksum information about
>>> them. Any hints for doing this job quickly and efficiently?
>>
>> diff --brief <(sha256 file1) <(sha256 file2)
>>
>> change sha256 to any checksumming program you like. It could be md5,
>> md5sum etc.
>
> What is the advantage of that over
> cmp file1 file2

The advantage is that if you already know that the files are different
and you see a sha256 match anyway, your 15 minutes of fame may
have arrived.

Cydrome Leader

unread,

Jun 26, 2020, 12:11:03 AM6/26/20

to

Keith Thompson <Keith.S.T...@gmail.com> wrote:
> Cydrome Leader <pres...@MUNGEpanix.com> writes:
>> Hongyi Zhao <hongy...@gmail.com> wrote:
>>> I downloaded the intel parallel studio cluster edition from this
>>> location by two downloading tools:
>>> http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>>>
>>> Now, I want to check whether these two downloaded files are exactly
>>> same but I don't know the corresponding checksum information about
>>> them. Any hints for doing this job quickly and efficiently?
>>
>> diff --brief <(sha256 file1) <(sha256 file2)
>>
>> change sha256 to any checksumming program you like. It could be md5,
>> md5sum etc.
>
> What is the advantage of that over
> cmp file1 file2

Portability with predicatable results. Good luck with figuring out what
version of cmp may or may not exist on a system. Case and point, "tar" as
shipped on solaris machines is in general, worthless/obsolete. Who knows
if cmp on older systems have dumb issues like 2GB file limits. This is
probably not an issue for most folks, but the variation in how "classic"
commands run between different vendors and linux can be large.

You also can save the checksums to compare files later. cmp is all or
nothing. Does it take more CPU time? yes, so does ssh. No doubt there's
plenty of relics in this group still using telnet, to save power or to
keep the stress of their 10base-2 networks and their SCO release 3
machines down.

William Ahern

unread,

Jun 26, 2020, 1:00:11 AM6/26/20

to

Cydrome Leader <pres...@mungepanix.com> wrote:
> Keith Thompson <Keith.S.T...@gmail.com> wrote:
>> Cydrome Leader <pres...@MUNGEpanix.com> writes:
>>> Hongyi Zhao <hongy...@gmail.com> wrote:
>>>> I downloaded the intel parallel studio cluster edition from this
>>>> location by two downloading tools:
>>>> http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/16526/parallel_studio_xe_2020_update1_cluster_edition.tgz.
>>>>
>>>> Now, I want to check whether these two downloaded files are exactly
>>>> same but I don't know the corresponding checksum information about
>>>> them. Any hints for doing this job quickly and efficiently?
>>>
>>> diff --brief <(sha256 file1) <(sha256 file2)
>>>
>>> change sha256 to any checksumming program you like. It could be md5,
>>> md5sum etc.
>>
>> What is the advantage of that over
>> cmp file1 file2
>
> Portability with predicatable results. Good luck with figuring out what
> version of cmp may or may not exist on a system.

cmp -s works identically on every AIX, Linux (including coreutils and
busybox), FreeBSD, OpenBSD, NetBSD, macOS, and Solaris system I've tried. By
contrast, there is no singular checksum utility other than POSIX cksum[1]
that exists on all these systems in the default install. For example, my
OpenBSD system has sha256, my Ubuntu Linux system has sha256sum, and my
macOS instance has shasum, all incompatible[2]. And even in environments with
the same command name, the implementations can still be
incompatible--different options and output format.

[1] cksum is pretty much useless because of the weak collision guarantee of
CRC. Though, OpenBSD's cksum (uniquely, IIRC) has the -a option for
specifying alternative algorithm, like sha256.

[2] It is relatively trivial to write shell code that can wrap a uniform
interface around whichever utility is available, if at all. But that seems
obtuse if cmp suffices.

Keith Thompson

unread,

Jun 26, 2020, 2:11:00 AM6/26/20

to

Cydrome Leader <pres...@MUNGEpanix.com> writes:
> Keith Thompson <Keith.S.T...@gmail.com> wrote:
>> Cydrome Leader <pres...@MUNGEpanix.com> writes:

[...]

>>> diff --brief <(sha256 file1) <(sha256 file2)
>>>
>>> change sha256 to any checksumming program you like. It could be md5,
>>> md5sum etc.
>>
>> What is the advantage of that over
>> cmp file1 file2
>
> Portability with predicatable results.

Nope.

jo...@schily.net

unread,

Jun 26, 2020, 6:23:16 AM6/26/20

to

In article <ftbgsg-...@wilbur.25thandClement.com>,

William Ahern <wil...@25thandClement.com> wrote:

>>>> change sha256 to any checksumming program you like. It could be md5,
>>>> md5sum etc.
>>>
>>> What is the advantage of that over
>>> cmp file1 file2
>>
>> Portability with predicatable results. Good luck with figuring out what
>> version of cmp may or may not exist on a system.

cmp definitely does not result in a problem caused by a hash clash....

>cmp -s works identically on every AIX, Linux (including coreutils and
>busybox), FreeBSD, OpenBSD, NetBSD, macOS, and Solaris system I've tried. By

cmp -s has been on UNIX since at least 40 years, so do not expect this not to
work.

>[1] cksum is pretty much useless because of the weak collision guarantee of
>CRC. Though, OpenBSD's cksum (uniquely, IIRC) has the -a option for
>specifying alternative algorithm, like sha256.

I find it a bit disappointing to see that a typical UNIX installation still has
no command line utility for sha-3 sums.

The only portable implementation I am ware of is "mdigest" from schilytools.

jo...@schily.net

unread,

Jun 26, 2020, 6:42:29 AM6/26/20

to

In article <rd3sgj$7rs$1...@reader1.panix.com>,

Cydrome Leader <pres...@MUNGEpanix.com> wrote:

>version of cmp may or may not exist on a system. Case and point, "tar" as
>shipped on solaris machines is in general, worthless/obsolete. Who knows

The "tar" shipped with Linux is worthless and unreliable.

If cmp fails, this is seen immediately.

If a "tar" implementation advertizes to support inremental backups, multi
volume support, long path name support and more and fails much later when
someone tries to unpack an archive with these properties, this is a real
problem. A "tar", that disregards basic rules for archive structuring thus
creates archives that may not be read by other implementations, this is a
problem. Since this "tar" is frequently even unable to read back own
archives (as mentioned before), this is a real problem.

>if cmp on older systems have dumb issues like 2GB file limits. This is
>probably not an issue for most folks, but the variation in how "classic"
>commands run between different vendors and linux can be large.

I am not sure about what OS you are using, but the large file summit has been
in 1995. All commercial UNIXes did implement large filesupport support in 1996,
so a UNIX either has a working set of basic utilities, or does not support
large files at all.

And BTW: even Linux is now safe as it started to support large files around
y2000.

Bit Twister

unread,

Jun 26, 2020, 6:51:02 AM6/26/20

to

On Fri, 26 Jun 2020 10:23:12 -0000 (UTC), jo...@schily.net wrote:
> In article <ftbgsg-...@wilbur.25thandClement.com>,
> William Ahern <wil...@25thandClement.com> wrote:
>
>>>>> change sha256 to any checksumming program you like. It could be md5,
>>>>> md5sum etc.
>>>>
>>>> What is the advantage of that over
>>>> cmp file1 file2
>>>
>>> Portability with predicatable results. Good luck with figuring out what
>>> version of cmp may or may not exist on a system.
>
> cmp definitely does not result in a problem caused by a hash clash....
>
>>cmp -s works identically on every AIX, Linux (including coreutils and
>>busybox), FreeBSD, OpenBSD, NetBSD, macOS, and Solaris system I've tried. By
>
> cmp -s has been on UNIX since at least 40 years, so do not expect this not to
> work.
>
>>[1] cksum is pretty much useless because of the weak collision guarantee of
>>CRC. Though, OpenBSD's cksum (uniquely, IIRC) has the -a option for
>>specifying alternative algorithm, like sha256.
>
> I find it a bit disappointing to see that a typical UNIX installation still has
> no command line utility for sha-3 sums.
>
> The only portable implementation I am ware of is "mdigest" from schilytools.

Speaking of which I would think
https://sourceforge.net/projects/cdrtools/files/alpha/
Download Latest Version link would point to a more current release.

I also wasted a fare about of time trying to find latest release of cdrtools.
Seems they are now in something like schily-2020-06-09.tar.bz2

jo...@schily.net

unread,

Jun 26, 2020, 6:59:14 AM6/26/20

to

In article <slrnrfbkog.s...@wb.home.test>,

Bit Twister <BitTw...@mouse-potato.com> wrote:

>Speaking of which I would think
> https://sourceforge.net/projects/cdrtools/files/alpha/
>Download Latest Version link would point to a more current release.
>
>I also wasted a fare about of time trying to find latest release of cdrtools.
>Seems they are now in something like schily-2020-06-09.tar.bz2

Well, I thought that it is suficient to have that in the README that is
displayed when you go to that directory.

Do you believe, I should create a symlink from the cdrtools download to the
schilytools download?

I am not sure whether this issupported by the security rules on SF.

Making releases is a time consuming task. This is why I stopped making separate
releases for every project 6 years ago.

Cydrome Leader

unread,

Jun 27, 2020, 2:20:48 AM6/27/20

to

jo...@schily.net wrote:
> In article <rd3sgj$7rs$1...@reader1.panix.com>,
> Cydrome Leader <pres...@MUNGEpanix.com> wrote:
>
>>version of cmp may or may not exist on a system. Case and point, "tar" as
>>shipped on solaris machines is in general, worthless/obsolete. Who knows
>
> The "tar" shipped with Linux is worthless and unreliable.

so, uh what's your favorite version of a tar from an operating system that
died or went EOL?

Solaris? SCO? Tru64, Irix? No doubt they're all quite compatible in every
way possible.

> If cmp fails, this is seen immediately.

Yeah, I love to see EOF errors instead of a check of filesizes from the
start. Anthing as old as biff has to be great program.

> If a "tar" implementation advertizes to support inremental backups, multi
> volume support, long path name support and more and fails much later when
> someone tries to unpack an archive with these properties, this is a real
> problem. A "tar", that disregards basic rules for archive structuring thus
> creates archives that may not be read by other implementations, this is a
> problem. Since this "tar" is frequently even unable to read back own
> archives (as mentioned before), this is a real problem.

example please. I agree the documentation for gnu tar is complete and utter
shit, but I'd like to see this multi-volume, incremental, long file name
create/extract problem.

>>if cmp on older systems have dumb issues like 2GB file limits. This is
>>probably not an issue for most folks, but the variation in how "classic"
>>commands run between different vendors and linux can be large.
>
> I am not sure about what OS you are using, but the large file summit has been
> in 1995. All commercial UNIXes did implement large filesupport support in 1996,
> so a UNIX either has a working set of basic utilities, or does not support
> large files at all.

solaris 10 shipped with utilities that cannot handle file past 2GB, even
though UFS and ZFS could. There's plenty of junk that came with the
commercial operating systems.

> And BTW: even Linux is now safe as it started to support large files around
> y2000.

In the real world, outside summits, conferences and white papers, you find
lots of limitations and problems that should not exist, but do. Bash was
broken for years, and nobody noticed, and the code was out in the open. The
patches for bash 4 actually broken some terrible code I had to work with.
cmp on your data general isn't getting any updates.

Keith Thompson

unread,

Jun 27, 2020, 2:54:23 AM6/27/20

to

Cydrome Leader <pres...@MUNGEpanix.com> writes:
> jo...@schily.net wrote:
[...]

>> If cmp fails, this is seen immediately.
>
> Yeah, I love to see EOF errors instead of a check of filesizes from the
> start. Anthing as old as biff has to be great program.

[...]

A quick experiment with GNU diffutils cmp version 3.7 shows that with
"cmp -s" (suppress all normal output) it doesn't read its input files if
they differ in size (or if they're the same file). I don't know whether
other versions of cmp implemention this particular bit of cleverness.

Without "-s", of course, it has to read up to the point where the input
files differ.

jo...@schily.net

unread,

Jun 27, 2020, 5:19:50 AM6/27/20

to

In article <rd6oft$f0d$1...@reader1.panix.com>,

Cydrome Leader <pres...@MUNGEpanix.com> wrote:
>jo...@schily.net wrote:
>> In article <rd3sgj$7rs$1...@reader1.panix.com>,
>> Cydrome Leader <pres...@MUNGEpanix.com> wrote:

>> The "tar" shipped with Linux is worthless and unreliable.
>
>so, uh what's your favorite version of a tar from an operating system that
>died or went EOL?

I would not be that harsh and say that Linux died, it just has serious problems
with fixing known bugs. I have bugs reported in 1993 against gtar that are not
yet fixed and I reported a bug against gmake in 1998 that has been fixed early
this year but at the same time an even worse bug has been introduced. Before,
gmake did not apply make rules for include files before reading them, now
it applies the rules in a parallelized way without a concept for serializing.
So with the current gmake, you may need to run gmake many times to get through
a project as it aborts because it applies rules in the wrong order.

>> If a "tar" implementation advertizes to support inremental backups, multi
>> volume support, long path name support and more and fails much later when
>> someone tries to unpack an archive with these properties, this is a real
>> problem. A "tar", that disregards basic rules for archive structuring thus
>> creates archives that may not be read by other implementations, this is a
>> problem. Since this "tar" is frequently even unable to read back own
>> archives (as mentioned before), this is a real problem.
>
>example please. I agree the documentation for gnu tar is complete and utter
>shit, but I'd like to see this multi-volume, incremental, long file name
>create/extract problem.

I recommend you to read the gtar mailing list archives.....they are full of bug
reports that verify it is too unreliable to be taken into account for serious
work.

>solaris 10 shipped with utilities that cannot handle file past 2GB, even
>though UFS and ZFS could. There's plenty of junk that came with the
>commercial operating systems.

Do you have more than FUD?

>In the real world, outside summits, conferences and white papers, you find
>lots of limitations and problems that should not exist, but do. Bash was
>broken for years, and nobody noticed, and the code was out in the open. The
>patches for bash 4 actually broken some terrible code I had to work with.
>cmp on your data general isn't getting any updates.

bash is still one of the better tools in the Linux world. It is a one man show
and this is a grant for premium quality.

Let me list the POSIX compliant shells that are preferrable before others:

bosh Enhanced Bourne Shell - a one man show
ksh93 A one man show until recently and good for that time
mksh Mir Korn Shell based on the broken pdksh but mksh as a one man show
is of premium quality.
bash A one man show

The ksh93 modified by several Redhat people is just a nightmare... non-portable,
slow and lost many of the important features.

But since you asked, let me go back to gtar....

multi-volume A volume change that happens inside an extended header causes
gtar to be unable to read back follow up volumes. This is a
conceptional bug that is hard to fix.
Probability to happen in real life: 1-5% of all cases.

Reported in September 2004

inrementals A renamed directory at top level results in ***huge***
archives that usually overflow the filesystem while unpacking
such a series of incrementals for a restore.
This is a conceptional bug from the used proprietary archive
enhancements that cannot be fixed.

Reported in September 2004

incrementals A renamed directory followed by creating a new directory of the
same name results in an abort while trying to restore.
This is a conceptional bug from the used proprietary archive
enhancements that cannot be fixed.

Reported in September 2004

long paths Try unpack the archive star/testscripts/longpath.tar.bz2
from the schilytools tarball.

symlinks Unpacking symlinks causes varying problems since at least
20 years. Caused by strange "security" algorithms in gtar.
Probability low, but serious...

unreliability In general, it has not been verified that archives created
by gtar are later accepted by gtar. Such bug reports come
up very 2-5 years. Typical symptom: "...skipping to next
header" for an archive without visible problems that
unpacks fine with other implementations.

There are many options but too few functionality behind these options. This is
a result of missing overall planning for a global concept.

Cydrome Leader

unread,

Jun 28, 2020, 10:45:10 PM6/28/20

to

We'll leave that to you as you have noted some tar problems, while not
skipping a beat to peddle, or reference some junk from your website.

Sounds pretty cool. Show me how to create one of these errors, or all at
once for that matter. I'd love to test this on various systems.

jo...@schily.net

unread,

Jun 29, 2020, 4:52:13 AM6/29/20

to

In article <rdbkjj$f8m$3...@reader1.panix.com>,

Cydrome Leader <pres...@MUNGEpanix.com> wrote:
>jo...@schily.net wrote:

>>>solaris 10 shipped with utilities that cannot handle file past 2GB, even
>>>though UFS and ZFS could. There's plenty of junk that came with the
>>>commercial operating systems.
>>
>> Do you have more than FUD?
>
>We'll leave that to you as you have noted some tar problems, while not

In other words, you have no evidence but you like majestatis pluralis.

>> But since you asked, let me go back to gtar....
>>
>> multi-volume A volume change that happens inside an extended header causes

...

>> inrementals A renamed directory at top level results in ***huge***

...

>> incrementals A renamed directory followed by creating a new directory of the

...
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV

>> long paths Try unpack the archive star/testscripts/longpath.tar.bz2
>> from the schilytools tarball.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

>> symlinks Unpacking symlinks causes varying problems since at least

...
>> unreliability In general, it has not been verified that archives created

...

>Sounds pretty cool. Show me how to create one of these errors, or all at
>once for that matter. I'd love to test this on various systems.

I do not need to comment that joke.

For the other bugs, I posted scripts to repeat them long ago.... verify e.g.
at stackexchange...

Cydrome Leader

unread,

Aug 8, 2020, 2:48:51 AM8/8/20

to

Did you lose all your bitcoins in a tar catastrophe or something?