On 29.11.2021 06:01,
hongy...@gmail.com wrote:
>>
>> I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
>> issue.
>>
>> "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
>> that issue.
>
> Wonderful explanation, which hits the flaw of my knowledge. But I
> still want to know what is the probability that this problem will
> occur.
It's not so much a question about the probability but rather about
the reliability of a construct.
> OTOH, I think this problem is, to some extent, related to the
> following values:
>
> $ xargs --show-limits
> [...]
The issue stems from the fact of a limited exec-buffer size and
that [shell-external] commands will operate on that limited buffer.
Whenever your sample size - actually the argument list size - will
exceed that limit the outcome is unreliable and depends on the data
used; it may work in 10 cases and fail in 100, or vice versa, it
may work for all your application cases (because you are operating
only on toy data), or it may always fail (because you are working
with huge amounts of scientific data), or anything else.
To understand the issue it suffices to assume small values, say a
buffer-size of 15 and a few short arguments.
Say you have the file arguments A B C D ... Z and want to sort
them. Say in the buffer there's room for only 5, so that sorting
with above 'find'-based constructs will result in many calls;
sort A B C D E
sort F G H I J
...
sort Z
and the output will be the concatenation of the individual calls.
A..E will be sorted, F..J will be sorted, etc. but A..Z will not
be sorted after the concatenation of the individual sorted parts.
Very subtle errors can occur this way if one is not aware of that
fact; the result may look correct if one looks at the first few MB
of the result, but may actually be wrong.
Whether other tools (like the one mentioned below) circumvent the
exec-buffer issue must be checked - but I wouldn't expect it does.
What a tool would need to do is either the ability to see all data
in one call, or to create partly sorted data and make more sort
runs on that partly sorted data; merge-sort is an algorithm that
works that way (which had been used on sequentially operating
tape archives especially in former times).
Janis