"find ... -exec sort -u {} ..." vs "find ... -exec cat {} \;| sort -u ... " vs "find ... | xargs sort -u"

42 views
Skip to first unread message

hongy...@gmail.com

unread,
Nov 27, 2021, 8:25:27 AM11/27/21
to
See my following testings:

$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb
875968 1063102 9971040 ccc
1769651 2144102 20099969 total

So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.

Regards,
HZ

Chris Elvidge

unread,
Nov 27, 2021, 9:11:48 AM11/27/21
to
What, exactly, are you trying to do?

--
Chris Elvidge
England

Lew Pitcher

unread,
Nov 27, 2021, 9:27:32 AM11/27/21
to
On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:

> See my following testings:
>
> $ find ./source -type f -exec sort -u {} -o aaa \;
> $ find ./source -type f | xargs sort -o bbb -u
> $ find ./source -type f -exec cat {} \;| sort -u -o ccc
> $ wc aaa bbb ccc
> 17715 17898 157889 aaa
> 875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
> 1769651 2144102 20099969 total
>
> So, the first usage seems to be incorrect. But I can't understand why
> such a mistake would occur.

There's no mistake. The first command doesn't do the same things
as the second and third command does.

The first command overwrites file aaa with the sorted contents of each
file found. While file aaa will, at times, contain the contents of each
file (sorted, of course), it's final contents are the sorted contents of
the /last/ file found by the find(1) command.

The other two commands /attempt/ to sort the contents of /all/ the found
files into files bbb and ccc respectively.

The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of
files found, /and/ the maximum size of an argument vector (argv[]), there
is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).

The third command uses cat(1) to concatenate the contents of all the found
files into one stream that sort(1) will sort into file ccc.

> Any hints will be highly appreciated.
>
> Regards,
> HZ




--
Lew Pitcher
"In Skills, We Trust"

hongy...@gmail.com

unread,
Nov 27, 2021, 8:53:49 PM11/27/21
to
On Saturday, November 27, 2021 at 10:27:32 PM UTC+8, Lew Pitcher wrote:
> On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:
>
> > See my following testings:
> >
> > $ find ./source -type f -exec sort -u {} -o aaa \;
> > $ find ./source -type f | xargs sort -o bbb -u
> > $ find ./source -type f -exec cat {} \;| sort -u -o ccc
> > $ wc aaa bbb ccc
> > 17715 17898 157889 aaa
> > 875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
> > 1769651 2144102 20099969 total
> >
> > So, the first usage seems to be incorrect. But I can't understand why
> > such a mistake would occur.
> There's no mistake. The first command doesn't do the same things
> as the second and third command does.
>
> The first command overwrites file aaa with the sorted contents of each
> file found. While file aaa will, at times, contain the contents of each
> file (sorted, of course), it's final contents are the sorted contents of
> the /last/ file found by the find(1) command.

Thank you for your prompt. Using the following method will get the same results as other methods:

$ find ./source -type f -exec sort -u {} + > aaa


> The other two commands /attempt/ to sort the contents of /all/ the found
> files into files bbb and ccc respectively.
>
> The second command depends on xargs(1) to provide sort(1) with a list
> of input files. As the size of this list depends on both the number of
> files found, /and/ the maximum size of an argument vector (argv[]), there
> is a chance that the file bbb will only contain the sorted contents of
> a subset of the files found by find(1).
>
> The third command uses cat(1) to concatenate the contents of all the found
> files into one stream that sort(1) will sort into file ccc.

Therefore, the second method is not as reliable as the following two methods and should be avoided:

$ find ./source -type f -exec sort -u {} + > aaa
$ find ./source -type f -exec cat {} + | sort -u -o ccc

Regards,
HZ

hongy...@gmail.com

unread,
Nov 27, 2021, 9:04:48 PM11/27/21
to
Based on the following explanation of the find command man page:

$ man find | egrep -A 12 -- '-exec command \{\} +'
-exec command {} +
This variant of the -exec action runs the specified command on the selected
files, but the command line is built by appending each selected file name at
the end; the total number of invocations of the command will be much less than
the number of matched files. The command line is built in much the same way
that xargs builds its command lines. Only one instance of `{}' is allowed
within the command, and (when find is being invoked from a shell) it should be
quoted (for example, '{}') to protect it from interpretation by shells. The
command is executed in the starting directory. If any invocation with the `+'
form returns a non-zero value as exit status, then find returns a non-zero
exit status. If find encounters an error, this can sometimes cause an immedi‐
ate exit, so some pending commands may not be run at all. This variant of
-exec always returns true.

It seems that use the + variant of the -exec action is faster:

$ time find ./source -type f -exec cat {} \;| sort -u -o ccc

real 0m3.251s
user 0m3.188s
sys 0m0.066s
$ time find ./source -type f -exec cat {} + | sort -u -o ccc

real 0m2.895s
user 0m2.827s
sys 0m0.075s

Regards,
HZ

Janis Papanagnou

unread,
Nov 27, 2021, 9:36:35 PM11/27/21
to
On 28.11.2021 03:04, hongy...@gmail.com wrote:
> [...]
>
> It seems that use the + variant of the -exec action is faster:

Yes, but your numbers below are of little expressiveness, they might
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that post-processing
with sort unnecessarily affects any speed comparisons of find usage
variants. Note also that the find built-in -exec/+ could also be done
using xargs (-print0 | xargs -0).

Janis

hongy...@gmail.com

unread,
Nov 28, 2021, 7:29:30 PM11/28/21
to
On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
> On 28.11.2021 03:04, hongy...@gmail.com wrote:
> > [...]
> >
> > It seems that use the + variant of the -exec action is faster:
> Yes, but your numbers below are of little expressiveness, they might
> also result just from caching effects in that given magnitude. Often
> you get a magnitude or more speed increase. Note that post-processing
> with sort unnecessarily affects any speed comparisons of find usage
> variants. Note also that the find built-in -exec/+ could also be done
> using xargs (-print0 | xargs -0).

Lew Pitcher told the following shortcoming of xargs based solution [1]:

The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of
files found, /and/ the maximum size of an argument vector (argv[]), there
is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).

[1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ

Hence, I try to avoid using xargs.

Janis Papanagnou

unread,
Nov 28, 2021, 7:53:02 PM11/28/21
to
On 29.11.2021 01:29, hongy...@gmail.com wrote:
> On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
>> On 28.11.2021 03:04, hongy...@gmail.com wrote:
>>> [...]
>>>
>>> It seems that use the + variant of the -exec action is faster:
>> Yes, but your numbers below are of little expressiveness, they might
>> also result just from caching effects in that given magnitude. Often
>> you get a magnitude or more speed increase. Note that post-processing
>> with sort unnecessarily affects any speed comparisons of find usage
>> variants. Note also that the find built-in -exec/+ could also be done
>> using xargs (-print0 | xargs -0).
>
> Lew Pitcher told the following shortcoming of xargs based solution [1]:
>
> The second command depends on xargs(1) to provide sort(1) with a list
> of input files. As the size of this list depends on both the number of
> files found, /and/ the maximum size of an argument vector (argv[]), there
> is a chance that the file bbb will only contain the sorted contents of
> a subset of the files found by find(1).
>
> [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ
>
> Hence, I try to avoid using xargs.

It's necessary to understand the mechanics behind the tools to make
an educated decision.

The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').

I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.

"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.

(In the preceding examples the syntactic details have been omitted
for clarity, to better see the differences.)

Janis

hongy...@gmail.com

unread,
Nov 29, 2021, 12:01:32 AM11/29/21
to
Wonderful explanation, which hits the flaw of my knowledge. But I still want to know what is the probability that this problem will occur. OTOH, I think this problem is, to some extent, related to the following values:

$ xargs --show-limits
Your environment variables take up 8115 bytes
POSIX upper limit on argument length (this system): 2086989
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2078874
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647

Execution of xargs will continue now, and it will try to read its input and run commands; if this is not what you wanted to happen, please type the end-of-file keystroke.
Warning: echo will be run at least once. If you do not want that to happen, then press the interrupt keystroke.


Anyway, if the xargs failed to do the trick, maybe parallel [1] doesn't have this issue.

[1] https://www.gnu.org/software/parallel/

hongy...@gmail.com

unread,
Nov 29, 2021, 12:10:19 AM11/29/21
to
Yes. They give exactly the same result:

$ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
$

Janis Papanagnou

unread,
Nov 29, 2021, 1:05:13 AM11/29/21
to
On 29.11.2021 06:01, hongy...@gmail.com wrote:
>>
>> I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
>> issue.
>>
>> "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
>> that issue.
>
> Wonderful explanation, which hits the flaw of my knowledge. But I
> still want to know what is the probability that this problem will
> occur.

It's not so much a question about the probability but rather about
the reliability of a construct.

> OTOH, I think this problem is, to some extent, related to the
> following values:
>
> $ xargs --show-limits
> [...]

The issue stems from the fact of a limited exec-buffer size and
that [shell-external] commands will operate on that limited buffer.
Whenever your sample size - actually the argument list size - will
exceed that limit the outcome is unreliable and depends on the data
used; it may work in 10 cases and fail in 100, or vice versa, it
may work for all your application cases (because you are operating
only on toy data), or it may always fail (because you are working
with huge amounts of scientific data), or anything else.

To understand the issue it suffices to assume small values, say a
buffer-size of 15 and a few short arguments.

Say you have the file arguments A B C D ... Z and want to sort
them. Say in the buffer there's room for only 5, so that sorting
with above 'find'-based constructs will result in many calls;
sort A B C D E
sort F G H I J
...
sort Z
and the output will be the concatenation of the individual calls.
A..E will be sorted, F..J will be sorted, etc. but A..Z will not
be sorted after the concatenation of the individual sorted parts.

Very subtle errors can occur this way if one is not aware of that
fact; the result may look correct if one looks at the first few MB
of the result, but may actually be wrong.

Whether other tools (like the one mentioned below) circumvent the
exec-buffer issue must be checked - but I wouldn't expect it does.
What a tool would need to do is either the ability to see all data
in one call, or to create partly sorted data and make more sort
runs on that partly sorted data; merge-sort is an algorithm that
works that way (which had been used on sequentially operating
tape archives especially in former times).

Janis

Janis Papanagnou

unread,
Nov 29, 2021, 1:15:19 AM11/29/21
to
As explained [in another context] more thoroughly upthread that may
also just be coincidence; it's a hint but it's certainly no proof.
(A difference would have proven it false, but no difference doesn't
say anything, strictly speaking.)

>
> $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
> $

BTW, with 'xargs', in the general case it's usually better to use
NUL-separated data, as in

find ./source -type f -print0 | xargs -0 cat | sort -u


Janis

hongy...@gmail.com

unread,
Nov 29, 2021, 3:21:27 AM11/29/21
to
On Monday, November 29, 2021 at 2:15:19 PM UTC+8, Janis Papanagnou wrote:
> On 29.11.2021 06:10, hongy...@gmail.com wrote:
> > On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
> >> It's necessary to understand the mechanics behind the tools to make
> >> an educated decision.
> >>
> >> The point is that (in the examples) the use of 'cat' is exactly to
> >> avoid that issue (otherwise we wouldn't need it and could just use
> >> 'sort' at the place where you have used 'cat').
> >>
> >> I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
> >> issue.
> >>
> >> "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
> >> that issue.
> >
> > Yes. They give exactly the same result:
> As explained [in another context] more thoroughly upthread that may
> also just be coincidence; it's a hint but it's certainly no proof.
> (A difference would have proven it false, but no difference doesn't
> say anything, strictly speaking.)

I don't quite understand what you mean above. Here, I mean, the two methods you mentioned *don't* have that issue give exactly the same result.

> >
> > $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
> > $
> BTW, with 'xargs', in the general case it's usually better to use
> NUL-separated data, as in
>
> find ./source -type f -print0 | xargs -0 cat | sort -u

Thank you for pointing this out.

Janis Papanagnou

unread,
Nov 29, 2021, 4:08:50 AM11/29/21
to
On 29.11.2021 09:21, hongy...@gmail.com wrote:
> On Monday, November 29, 2021 at 2:15:19 PM UTC+8, Janis Papanagnou
> wrote:
>> On 29.11.2021 06:10, hongy...@gmail.com wrote:
>>>
>>> Yes. They give exactly the same result:
>> As explained [in another context] more thoroughly upthread that may
>> also just be coincidence; it's a hint but it's certainly no proof.
>> (A difference would have proven it false, but no difference
>> doesn't say anything, strictly speaking.)
>
> I don't quite understand what you mean above. Here, I mean, the two
> methods you mentioned *don't* have that issue give exactly the same
> result.

The reasoning to assume that both are equivalent is non-conclusive.
* Observing a difference means you have proven it _wrong_.
* Observing no difference means you have _not proven_ it wrong
(other tests might still prove it wrong).
Not being able to prove something wrong does not automatically mean
that it is correct. - Think about it.
It may be true but that's not proven.
You might be able to change your data in a way where it fails.
Mind, I didn't say it's wrong, I said it's not proven to be correct.

Physics (and other sciences) is full of tries to prove something
wrong without a chance to prove something as correct.

Janis

Chris Elvidge

unread,
Nov 29, 2021, 5:16:05 AM11/29/21
to
On 29/11/2021 06:15 am, Janis Papanagnou wrote:
> On 29.11.2021 06:10, hongy...@gmail.com wrote:
>> On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
>>> It's necessary to understand the mechanics behind the tools to make
>>> an educated decision.
>>>
>>> The point is that (in the examples) the use of 'cat' is exactly to
>>> avoid that issue (otherwise we wouldn't need it and could just use
>>> 'sort' at the place where you have used 'cat').
>>>
>>> I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
>>> issue.
>>>
>>> "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
>>> that issue.
>>
>> Yes. They give exactly the same result:
>
> As explained [in another context] more thoroughly upthread that may
> also just be coincidence; it's a hint but it's certainly no proof.
> (A difference would have proven it false, but no difference doesn't
> say anything, strictly speaking.)
>

You're trying to use logic again.

>>
>> $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
>> $
>
> BTW, with 'xargs', in the general case it's usually better to use
> NUL-separated data, as in
>
> find ./source -type f -print0 | xargs -0 cat | sort -u
>
>
> Janis
>


--
Chris Elvidge
England

Janis Papanagnou

unread,
Nov 29, 2021, 6:21:00 AM11/29/21
to
On 29.11.2021 11:16, Chris Elvidge wrote:
>
> You're trying to use logic again.

Don't recall; was that something bad?

Janis

Chris Elvidge

unread,
Nov 29, 2021, 7:28:18 AM11/29/21
to
It it w.r.t. HZ

--
Chris Elvidge
England

Lew Pitcher

unread,
Nov 29, 2021, 11:42:46 AM11/29/21
to
[snip]

The caution, to me, is that unless you are aware of both the limits of the
tools that you use (like the argv[] limits imposed through xargs(1) ), and
the conditions under which you will use these tools (like the number of
files that find(1) will find, to pass along to xargs(1)), you rely on
"clever code". The problem with "clever code" isn't that it works, but
that it can fail in a non-obvious and unexpected manner, which becomes
difficult to detect, let alone debug.

For your find|xargs|sort solution, how would you have known, in any
particular execution of that pipeline, that find(1) would have exceeded
the number of filenames that xargs(1) could pass to sort(1)?
How would you debug that sort of problem?

Brian Kernighan had, for a while, as his .sig something to the effect that
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.

The find|xargs|sort solution looks to fall under the heading of "clever
code".

Just my opinion, of course.

Kaz Kylheku

unread,
Nov 29, 2021, 1:29:29 PM11/29/21
to
To avoid learning how to understand programming, while a child goes from
birth to graduating with a master's degree in CS.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Geoff Clare

unread,
Nov 30, 2021, 9:11:06 AM11/30/21
to
hongy...@gmail.com wrote:

>> > $ find ./source -type f -exec sort -u {} -o aaa \;
>> > $ find ./source -type f | xargs sort -o bbb -u
>> > $ find ./source -type f -exec cat {} \;| sort -u -o ccc

> Using the following method will get the same results as other methods:
>
> $ find ./source -type f -exec sort -u {} + > aaa

Only if find is able to pass all the pathnames to a single execution
of sort. If there is more than one execution of sort, then this will
give a different result than all three of the other methods.

Unlike the first two, it will not lose any data, but the output will
be in "chunks" where each chunk contains sorted and de-duped data,
but the overall output is highly likely to be disordered at the chunk
boundaries and it may contain duplicates from different chunks.

> Therefore, the second method is not as reliable as the following two methods and should be avoided:
>
> $ find ./source -type f -exec sort -u {} + > aaa
> $ find ./source -type f -exec cat {} + | sort -u -o ccc

The first of these two is not reliable, the second is reliable (as
is the previous version with \; instead of +, but the version
with + is more efficient).

--
Geoff Clare <net...@gclare.org.uk>

hongy...@gmail.com

unread,
Dec 1, 2021, 1:56:36 AM12/1/21
to
Thank you for your elaboration. Yours analysis is basically coincide with the ones pointed by Janis Papanagnou [1]:

I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.

"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.


As a result, I currently use the following two approaches:

$ find ./source -type f -exec cat {} + | sort -uo american-english-exhaustive
or
$ find ./source -type f -print0 | xargs -0 cat | sort -uo american-english-exhaustive


[1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/K6bf2bHABAAJ

Reply all
Reply to author
Forward
0 new messages