Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Comparing Binary Files

68 views
Skip to first unread message

lawren...@gmail.com

unread,
Oct 3, 2016, 12:52:18 AM10/3/16
to
Instead of, for example

diff /dev/dvd disc.iso

which will only say “binary files differ” and nothing more, I use

diff -u <(xxd /dev/dvd) <(xxd disc.iso)

to compare hex dumps of the files instead.

This works great--for example, showing me that the only difference is a few extra sectors of zero padding at the end of the disc. However, it seems to use a lot of memory, even when no differences are found...

Janis Papanagnou

unread,
Oct 3, 2016, 1:25:45 AM10/3/16
to
On 03.10.2016 06:52, lawren...@gmail.com wrote:
> Instead of, for example
>
> diff /dev/dvd disc.iso
>
> which will only say “binary files differ” and nothing more, I use
>
> diff -u <(xxd /dev/dvd) <(xxd disc.iso)
>
> to compare hex dumps of the files instead.
>
> This works great--for example, showing me that the only difference is a few
> extra sectors of zero padding at the end of the disc.

That does not seem to be a satisfying solution in the general case; even
simple byte offsets by a few additional bytes at the beginning of one file
will show the whole files differing. In that case a “binary files differ”
message would be preferable. For the case of changed bytes or additional
data at the end of one file it's okay.

> However, it seems to use a lot of memory, even when no differences are found...

Why? You are creating just to more processes and the shell establishes two
pipes for the diff. I would not expect "a lot" memory demand for that.

Janis

lawren...@gmail.com

unread,
Oct 3, 2016, 1:48:07 AM10/3/16
to
On Monday, October 3, 2016 at 6:25:45 PM UTC+13, Janis Papanagnou wrote:
>
> On 03.10.2016 06:52, Lawrence D’Oliveiro wrote:
>
>> diff -u <(xxd /dev/dvd) <(xxd disc.iso)
>
> ... even simple byte offsets by a few additional bytes at the beginning of
> one file will show the whole files differing.

Won’t happen in this case.

Janis Papanagnou

unread,
Oct 3, 2016, 3:12:54 AM10/3/16
to
Your posting sounded like a general suggestion, and could be perceived as
that by unaware readers.

Janis

Rakesh Sharma

unread,
Oct 4, 2016, 6:03:58 AM10/4/16
to
did you try the "tkdiff" tool

Ivan Shmakov

unread,
Oct 4, 2016, 2:03:03 PM10/4/16
to
>>>>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>>>> On 03.10.2016 06:52, lawren...@gmail.com wrote:

[...]

>> diff -u <(xxd /dev/dvd) <(xxd disc.iso)

>> to compare hex dumps of the files instead.

>> This works great--for example, showing me that the only difference
>> is a few extra sectors of zero padding at the end of the disc.

> That does not seem to be a satisfying solution in the general case;
> even simple byte offsets by a few additional bytes at the beginning
> of one file will show the whole files differing. In that case a
> "binary files differ" message would be preferable. For the case of
> changed bytes or additional data at the end of one file it's okay.

As would be $ cmp -l, I suppose.

>> However, it seems to use a lot of memory, even when no differences
>> are found...

> Why? You are creating just to more processes and the shell
> establishes two pipes for the diff. I would not expect "a lot"
> memory demand for that.

Two reasons I can readily think of:

* xxd(1) can easily turn a 4.5 GB DVD+R image into a 20+ GB
hexdump;

* upon seeing that the files are indeed "binary", diff(1) may
refrain from running a memory-costly LCS algorithm and resort
to "exit (1) on the first byte that differs; exit (0) if none"
behavior that costs virtually no memory at all.

--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

lawren...@gmail.com

unread,
Oct 4, 2016, 2:32:26 PM10/4/16
to
On Wednesday, October 5, 2016 at 7:03:03 AM UTC+13, Ivan Shmakov wrote:
> * upon seeing that the files are indeed "binary", diff(1) may
> refrain from running a memory-costly LCS algorithm ...

I wonder why it needs to do that before actually seeing the first difference...

lawren...@gmail.com

unread,
Oct 5, 2016, 3:18:50 AM10/5/16
to
On Monday, October 3, 2016 at 8:12:54 PM UTC+13, Janis Papanagnou wrote:
> Your posting sounded like a general suggestion ...

Hmm ... how would you make it general?

Seems to me you’d have to have one byte per line:

diff -u <(xxd -c1 file1 | cut -c11,12) <(xxd -c1 file2 | cut -c11,12)

and watch out for the memory-hogging...

Janis Papanagnou

unread,
Oct 5, 2016, 4:01:08 AM10/5/16
to
On 04.10.2016 20:02, Ivan Shmakov wrote:
>>>>>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>>>>> On 03.10.2016 06:52, lawren...@gmail.com wrote:
> [...]
>
> >> However, it seems to use a lot of memory, even when no differences
> >> are found...
>
> > Why? You are creating just to more processes and the shell
> > establishes two pipes for the diff. I would not expect "a lot"
> > memory demand for that.
>
> Two reasons I can readily think of:
>
> * xxd(1) can easily turn a 4.5 GB DVD+R image into a 20+ GB
> hexdump;

And where is the demand of "lot of memory" in that process? - There isn't,
because you have only the constant size IPC buffers - there are no bulky
files created, nor would the diff command need to increase it's internally
used buffers, since all it sees is file handles.

(The increase in processing time would have been a different statement.)

Janis

Michael Paoli

unread,
Oct 6, 2016, 4:32:40 AM10/6/16
to
Of course that's resource intensive, as diff will compare the full contents
of the hex dumps of each.

There's the highly standard Unix utility cmp(1).
A couple quick examples:
$ echo short > short
$ (cat short; 2>>/dev/null dd if=/dev/zero bs=200 count=1) > longer
$ cmp short longer
cmp: EOF on short
$ (cat short /dev/zero) | cmp - longer
cmp: EOF on longer
$ echo nope | cmp - short
- short differ: char 1, line 1
$

By default, cmp(1) only reads the files until it detects a difference.
And no need to convert format - e.g. to hex, nor to read beyond the
first difference detected. So, much more efficient.

lawren...@gmail.com

unread,
Oct 6, 2016, 5:43:37 PM10/6/16
to
On Thursday, October 6, 2016 at 9:32:40 PM UTC+13, Michael Paoli wrote:
> ... diff will compare the full contents of the hex dumps of each.

But why should it use up memory buffering up sections that are identical?

Ivan Shmakov

unread,
Oct 7, 2016, 8:30:57 AM10/7/16
to
>>>>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>>>> On 04.10.2016 20:02, Ivan Shmakov wrote:
>>>>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>>>> On 03.10.2016 06:52, lawren...@gmail.com wrote:

>>>> However, it seems to use a lot of memory, even when no differences
>>>> are found...

>>> Why? You are creating just to more processes and the shell
>>> establishes two pipes for the diff. I would not expect "a lot"
>>> memory demand for that.

>> Two reasons I can readily think of:

>> * xxd(1) can easily turn a 4.5 GB DVD+R image into a 20+ GB hexdump;

> And where is the demand of "lot of memory" in that process?

In the 'diff' part of the pipeline.

But if the real question is "why the longest common subsequence
algorithm, as implemented by diff(1), requires more memory to
compare larger inputs", then I'm afraid that being unfamiliar
with both the algorithm and (the insides of) the implementation,
I cannot answer that.

(Note the "maxresident" values in the transcript below.)

$ time diff -u \
-- <(seq -f %07.f 0 99999) <(seq -f %07.f 0 99999 | sed -e 12345d)
--- /dev/fd/63 2016-10-07 12:27:54.594595945 +0000
+++ /dev/fd/62 2016-10-07 12:27:54.598595754 +0000
@@ -12342,7 +12342,6 @@
0012341
0012342
0012343
-0012344
0012345
0012346
0012347
Command exited with non-zero status 1
0.00user 0.00system 0:00.25elapsed 3%CPU (0avgtext+0avgdata 2536maxresident)k
0inputs+0outputs (0major+693minor)pagefaults 0swaps
$ time diff -u -- \
<(seq -f %07.f 0 999999) <(seq -f %07.f 0 999999 | sed -e 123456d)
--- /dev/fd/63 2016-10-07 12:24:35.740077089 +0000
+++ /dev/fd/62 2016-10-07 12:24:35.744076899 +0000
@@ -123453,7 +123453,6 @@
0123452
0123453
0123454
-0123455
0123456
0123457
0123458
Command exited with non-zero status 1
0.02user 0.03system 0:02.84elapsed 2%CPU (0avgtext+0avgdata 16600maxresident)k
0inputs+0outputs (0major+4208minor)pagefaults 0swaps
$ command time diff -u -- \
<(seq -f %07.f 0 9999999) <(seq -f %07.f 0 9999999 | sed -e 1234567d)
--- /dev/fd/63 2016-10-07 12:23:28.371276047 +0000
+++ /dev/fd/62 2016-10-07 12:23:28.375275857 +0000
@@ -1234564,7 +1234564,6 @@
1234563
1234564
1234565
-1234566
1234567
1234568
1234569
Command exited with non-zero status 1
0.20user 0.34system 0:30.85elapsed 1%CPU (0avgtext+0avgdata 157224maxresident)k
250inputs+0outputs (5major+39359minor)pagefaults 0swaps
$

[...]
0 new messages