Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Issue with sort -un (unique and numeric) using numeric equality tests, not just order test

0 views
Skip to first unread message

r.p....@gmail.com

unread,
Nov 23, 2009, 1:19:45 PM11/23/09
to
I just confirmed something that was starting to bug me. The sort in
our current linux:

>sort (GNU coreutils) 6.10
.Copyright (C) 2008 Free Software Foundation, Inc.
>License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>This is free software: you are free to change and redistribute it.
>There is NO WARRANTY, to the extent permitted by law.
>
>Written by Mike Haertel and Paul Eggert.

has a nasty behavior when doing a unique and numeric ordering at the
same time. If there are two lines, 23lebron and 23jordan, it will
take the first one and presumably remove all lines "numerically
equivalent" to 23.

That's different from how it used to behave, I'm pretty sure. So I've
been losing lines here and there and probably doing a lot more error
checking and file recovery than I should have had to.

That's different from what you'd get if you said:

sort -u foo | sort -n,

or

sort -n foo | sort -u.

I suspect it's different from what you'd think would happen when
comparing fields and applying uniqueness, since a sort -u -k 3 would
presumably sort by field 3, but apply the uniqueness test to the whole
line. In fact, it (now) behaves like the number example, where two
lines that are equal in field 3 are considered equal from the point of
view of uniqueness. I am pretty sure this isn't the way it used to
be. So

23 lebron true
23 jordan true

sort -k 4 -u

yields only the first line.

And finally, I suspect it's different from what the man page "says",
since -n is claimed to "compare according to string numerical value",
not "compare and test for uniqueness according to string numerical
value". The "-u" explanation is sufficiently vague that both
interpretations are admissible.

I suppose it's too late to go back to prior behavior (Or is it? How
long has this been going on? I've been relying on traditional sort -
nu behavior for 30+ years on various unixes...), but this was somewhat
disturbing.

pk

unread,
Nov 23, 2009, 2:12:12 PM11/23/09
to
r.p....@gmail.com wrote:

> I just confirmed something that was starting to bug me. The sort in
> our current linux:
>
>>sort (GNU coreutils) 6.10
> .Copyright (C) 2008 Free Software Foundation, Inc.
>>License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
>>This is free software: you are free to change and redistribute it.
>>There is NO WARRANTY, to the extent permitted by law.
>>
>>Written by Mike Haertel and Paul Eggert.
>
> has a nasty behavior when doing a unique and numeric ordering at the
> same time. If there are two lines, 23lebron and 23jordan, it will
> take the first one and presumably remove all lines "numerically
> equivalent" to 23.

You shouldn't use -n with that input. Those are not numeric values.



> That's different from what you'd get if you said:
>
> sort -u foo | sort -n,
>
> or
>
> sort -n foo | sort -u.

Obviously. None of the above two command does what sort -nu would do with
your data. And even if those seem to work, it's just because "23jordan"
appears to be silently interpreted as just "23". I think there's some
obscure POSIX spec that says that this should be the case, but I wouldn't
rely too much on that.



> I suspect it's different from what you'd think would happen when
> comparing fields and applying uniqueness, since a sort -u -k 3 would
> presumably sort by field 3, but apply the uniqueness test to the whole
> line. In fact, it (now) behaves like the number example, where two
> lines that are equal in field 3 are considered equal from the point of
> view of uniqueness. I am pretty sure this isn't the way it used to
> be. So
>
> 23 lebron true
> 23 jordan true
>
> sort -k 4 -u
>
> yields only the first line.

Erm, yes, that is just expected behavior. The fourth field is the same for
all lines (empty), and -u checks for uniqueness of the key field.



> And finally, I suspect it's different from what the man page "says",
> since -n is claimed to "compare according to string numerical value",
> not "compare and test for uniqueness according to string numerical
> value".

That is correct. It's -u that tests for uniqueness.

> The "-u" explanation is sufficiently vague that both interpretations are
> admissible.
>
> I suppose it's too late to go back to prior behavior (Or is it? How
> long has this been going on? I've been relying on traditional sort -
> nu behavior for 30+ years on various unixes...), but this was somewhat
> disturbing.

The answer is: you shouldn't be using sort -nu on lines like "23jordan".

0 new messages