Re: sort by multiple columns

Spiros Bousbouras

unread,

Apr 19, 2023, 4:43:21 AM4/19/23

to

On Wed, 19 Apr 2023 09:27:12 +0200
Martin Trautmann <t-us...@gmx.net> wrote:
>
> Hi all,
>
> how do I sort by multiple columns?
>
> Example:
> +++
> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468
> Borgentreich;T2960;Lindenstätte;12;;32;519622.835;5709023.590
> Borgentreich;T2960;Lindenstätte;6;;32;519696.745;5709038.833
> Borgentreich;T2960;Lindenstätte;4;;32;519722.956;5709043.915
> Borgentreich;T2960;Lindenstätte;15;;32;519489.638;5709077.693
> Borgentreich;T2960;Lindenstätte;24;;32;519518.763;5709090.026
> Borgentreich;T2960;Lindenstätte;18;;32;519559.108;5709037.356
> Borgentreich;T2960;Lindenstätte;14;;32;519596.623;5709013.684
> Borgentreich;T2960;Lindenstätte;16;;32;519569.141;5709017.854
> Borgentreich;T2960;Lindenstätte;22;;32;519540.257;5709072.032
> Borgentreich;T2960;Lindenstätte;26;;32;519503.270;5709103.321
> Borgentreich;T2960;Lindenstätte;2;;32;519758.267;5709057.635
> Borgentreich;T2960;Lindenstätte;10;;32;519648.417;5709028.865
> Borgentreich;T2960;Lindenstätte;11;;32;519607.438;5708989.545
> Borgentreich;T2960;Lindenstätte;3;;32;519732.686;5709020.833
> Borgentreich;T2960;Lindenstätte;7;;32;519678.983;5709007.380
> Borgentreich;T2960;Lindenstätte;9;;32;519651.859;5709000.462
> Borgentreich;T2960;Lindenstätte;5;;32;519708.841;5709015.137
> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584
> Borgentreich;T2960;Lindenstätte;8;;32;519673.036;5709040.372
> +++
>
> I want to sort
> * first by column 4, numerical,
> * second by column 2
> * third by column 3
>
> So the result should be
> +++
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
[...]
> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354

Why are these 2 lines sorted this way ? Column 4 is the same ("1" in
both) so it boils down to how "D9444" and "D9386" get sorted. What
comes first and why ? It seems to me that "D9386" comes earlier than
"D9444" .

Your locale may also turn out to be relevant so you should mention
that.

Unrelated but the first letter of your last name is unicode codepoint
3A4 which is the Greek upper case tau. Was this intentional or an
accident ?

Janis Papanagnou

unread,

Apr 19, 2023, 4:44:11 AM4/19/23

to

From that specification I'd write

sort -t\; -k4n -k2 -k3

but your expected data below doesn't follow your own spec. So the
specification probably needs a correction.

(Option -s for a "stable sort" may also be part of your solution.)

Janis

>
> So the result should be
> +++
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109

> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468

> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354

> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584

> Borgentreich;T2960;Lindenstätte;2;;32;519758.267;5709057.635

> Borgentreich;T2960;Lindenstätte;3;;32;519732.686;5709020.833

> Borgentreich;T2960;Lindenstätte;4;;32;519722.956;5709043.915

> Borgentreich;T2960;Lindenstätte;5;;32;519708.841;5709015.137

> Borgentreich;T2960;Lindenstätte;6;;32;519696.745;5709038.833

> Borgentreich;T2960;Lindenstätte;7;;32;519678.983;5709007.380

> Borgentreich;T2960;Lindenstätte;8;;32;519673.036;5709040.372

> Borgentreich;T2960;Lindenstätte;9;;32;519651.859;5709000.462

> Borgentreich;T2960;Lindenstätte;10;;32;519648.417;5709028.865
> Borgentreich;T2960;Lindenstätte;11;;32;519607.438;5708989.545

> Borgentreich;T2960;Lindenstätte;12;;32;519622.835;5709023.590

> Borgentreich;T2960;Lindenstätte;14;;32;519596.623;5709013.684

> Borgentreich;T2960;Lindenstätte;15;;32;519489.638;5709077.693

> Borgentreich;T2960;Lindenstätte;16;;32;519569.141;5709017.854

> Borgentreich;T2960;Lindenstätte;18;;32;519559.108;5709037.356

> Borgentreich;T2960;Lindenstätte;22;;32;519540.257;5709072.032

> Borgentreich;T2960;Lindenstätte;24;;32;519518.763;5709090.026

> Borgentreich;T2960;Lindenstätte;26;;32;519503.270;5709103.321
> +++
>
> I tried both
> sort -k4 -t";" -n | sort -k2,2 -t";" | sort -k3,3 -t";"
> and
> sort -k4 -t";" -n -k2,2 -k3,3
> and some permutations and reverted orders, without success.
> The sort by column 4 just gets lost or resorted.
>
> I'm not sure about the man page
> -k, --key=POS1[,POS2]
> start a key at POS1, end it at POS2 (origin 1)
>
> So I tried relative positions with
> -k3,1
> as well, without success.
>
> How do I apply the sort syntax properly?
>
> Thanks
> Martin
>

Janis Papanagnou

unread,

Apr 19, 2023, 5:51:47 AM4/19/23

to

On 19.04.2023 10:44, Janis Papanagnou wrote:
> On 19.04.2023 09:27, Martin Τrautmann wrote:
>>
>> Hi all,
>>
>> how do I sort by multiple columns?

>>[...]

>>
>> I want to sort
>> * first by column 4, numerical,
>> * second by column 2
>> * third by column 3
>
> From that specification I'd write
>
> sort -t\; -k4n -k2 -k3

Oops... - make that

sort -t\; -k4,4n -k2,2 -k3,3

>
> but your expected data below doesn't follow your own spec. So the
> specification probably needs a correction.

You probably meant something like

sort -t\; -k3,3 -k4,4n -k2,2

Janis

Helmut Waitzmann

unread,

Apr 21, 2023, 9:41:55 PM4/21/23

to

> Martin Τραωτμανν <t-us...@gmx.net>:

>
> how do I sort by multiple columns?
>

[An example text…]

> I want to sort
> * first by column 4, numerical,
> * second by column 2
> * third by column 3

[…with sorted result]

The sorted result of your example has apparently been sorted
according to the following description:

First, group the lines sorted by column 3, that is, sort the
lines in a manner that results in alphabetically ascending values
in column 3.

Then, in each group of lines, that have got a common value in
column 3, sort the lines independently in a manner that results
in alphabetically ascending values in column 2.

Then, in each group of lines that have got common values in
columns 3 and 2 respectively, sort the lines independently in a
manner that results in numerically ascending values in column 4.

Finally, each group of lines that has got equal values in columns
3, 2, and 4 according to the sort criteria as specified above, is
sorted according to a default sorting criterium which comprises
the whole line.

This can be achieved using the following commandline:

sort -t ';' -k 3,3 -k 2,2 -k 4,4n

You might read the description of the "sort" utility in the POSIX
standard
(<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html#top>),
especially the last paragraph in the "OPTIONS" section
(<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html#tag_20_119_04>):
"When there are multiple key fields, later keys shall be compared
only after all earlier keys compare equal. Except when the -u
option is specified, lines that otherwise compare equal shall be
ordered as if none of the options -d, -f, -i, -n, or -k were
present (but with -r still in effect, if it was specified) and
with all bytes in the lines significant to the comparison. The
order in which lines that still compare equal are written is
unspecified."

Helmut Waitzmann

unread,

Apr 22, 2023, 9:42:11 PM4/22/23

to

Martin Τrautmann <t-us...@gmx.net>:

> On Sat, 22 Apr 2023 03:33:43 +0200, Helmut Waitzmann wrote:
>>> I want to sort
>>> * first by column 4, numerical,
>>> * second by column 2
>>> * third by column 3
>>
>> […with sorted result]
>>
>>
>> The sorted result of your example has apparently been sorted
>> according to the following description:
>>
>> First, group the lines sorted by column 3, that is, sort the
>> lines in a manner that results in alphabetically ascending
>> values in column 3.
>

> That's a matter of concern how the sort works.
>
>
> If I want to pre-sort by 3 first, then sub-sort by column 2,
> that's fine. But when I pipe one sort to the other, the second
> sort will destroy the sort before. That's why i had my sort
> order in reverted order, using a pipe example.

That won't help, either: A sorting pipe using (a standard)
"sort" won't solve the problem, because one cannot tell (a
standard) "sort" to do a sort on the given key option only. Each
sort in the pipe will be total (according to its sort criteria)
of its own.

With GNU‐"sort", though, a sorting pipe can solve the problem, if
one applies the "--stable" option to each (except the first) of
the "sort" invocations. Then the command

sort --stable -t ';' -n -k 4,4 |
sort --stable -t ';' -k 2,2 |
sort --stable -t ';' -k 3,3

will do the job. (Unfortunately the "--stable" option is not
part of the POSIX standard.)

[A quote from the "sort" description in the POSIX standard]

> This description is much better than my man and info sort
>

Yes, that's my experience, too. I tend to read not only the
manual page or info documentation but also look into the
corresponding POSIX description (if the utility is part of the
POSIX standard), and then check, whether the manual page or info
documentation conflicts with the POSIX description.

> - but unfortunately I can't be sure that the POSIX info actually
> does work on my local sort implementation: sort 5.93 November
> 2005

Yes, that might happen. In practice, GNU tries to follow the
POSIX standard.

Janis Papanagnou

unread,

Apr 23, 2023, 8:02:56 AM4/23/23

to

On 23.04.2023 13:28, Martin Τrautmann wrote:

> On Sun, 23 Apr 2023 03:33:47 +0200, Helmut Waitzmann wrote:
>>> If I want to pre-sort by 3 first, then sub-sort by column 2,
>>> that's fine. But when I pipe one sort to the other, the second
>>> sort will destroy the sort before. That's why i had my sort
>>> order in reverted order, using a pipe example.
>>
>> That won't help, either: A sorting pipe using (a standard)
>> "sort" won't solve the problem, because one cannot tell (a
>> standard) "sort" to do a sort on the given key option only. Each
>> sort in the pipe will be total (according to its sort criteria)
>> of its own.
>

> That was my problem - I expected that a pipe through several sorts would
> keep the order. I don't know why it doesn't.

Because sorting on one criterion generally doesn't impose any
restrictions on other criteria. By that sorting can be made a
very efficient implementation. But that's what stable sorting
is for; to make some provisions for specific ordering cases,
how to handle the set of records with equal keys. With Unix'es
'sort' implementation being and able to specify multiple keys
there's of course less need to separate sorting with pipes to
several distinct processes.

Janis

David W. Hodgins

unread,

Apr 23, 2023, 9:43:20 AM4/23/23

to

On Sun, 23 Apr 2023 07:28:22 -0400, Martin Τrautmann <t-us...@gmx.net> wrote:
> That was my problem - I expected that a pipe through several sorts would
> keep the order. I don't know why it doesn't.

It may be easier to understand if you use a temporary files instead of pipes.

Sorting the input file by column 4, numerical creating a first temporary file.
Sort the first temporary file by column 2 creating a second temporary file.
Sort the second temporary file by column 3 creating the output.

The last sort doesn't know that the prior two sorts have been done. It just
looks at the file it's giving and sorts it by column 3.

Using a pipe just takes the output of the first and second sort and uses it
directly as input for the next sort. All the pipe does is eliminate the
need for a temporary file.

Keep in mind. When sorting a file, the last line in the input may end up becoming
the first line in the output. The sort can not write anything to the pipe or
output file until it's sorted the entire input. With a pipe, the temporary
file is in ram rather then being a named file on disk.

Regards, Dave Hodgins

Kenny McCormack

unread,

Apr 23, 2023, 10:36:35 AM4/23/23

to

In article <op.13uwd4i...@hodgins.homeip.net>,
David W. Hodgins <dwho...@nomail.afraid.org> wrote:
...

>Keep in mind. When sorting a file, the last line in the input may end up
>becoming the first line in the output. The sort can not write anything to
>the pipe or output file until it's sorted the entire input. With a pipe,
>the temporary file is in ram rather then being a named file on disk.

This actually raises an interesting point. Pipes are not infinite in size,
and they could, theoretically block if enough is written on the write end
without anything being read from the read end. Though the limits are
likely very large nowadays on modern systems, I think the original
implementation was only 4096 bytes and the standards today (POSIX) may not
guarantee anything more than that (haven't checked).

For most programs, this is rarely a concern, since most pipelines write and
read more or less simultaneously in real time, but sort is an edge case for
the reason you explain above.

Something to keep in mind if you ever decide to sort very large files in a
pipeline. And it is probably a better idea not to do so; to sort it all at
once, using multiple key specifications on the command line.

--
Rich people pay Fox people to convince middle class people to blame poor people.

(John Fugelsang)

Janis Papanagnou

unread,

Apr 23, 2023, 10:54:18 AM4/23/23

to

On 23.04.2023 16:36, Kenny McCormack wrote:
> [...]

>
> For most programs, this is rarely a concern, since most pipelines write and
> read more or less simultaneously in real time, but sort is an edge case for
> the reason you explain above.

Note also that there are quite some sorting operations inherently
used (e.g. in 'ls', in shells '*' glob/pattern expansion, etc.).
For example, don't expect find | xargs ls to provide a sorted
output.

>
> Something to keep in mind if you ever decide to sort very large files in a

> pipeline. [...]

In whatever way some instance of sort is implemented (memory, or
temporary files, or whatever), my expectation is that
whatever | sort
will have to produce sorted output .- Isn't that guaranteed?

Janis

Kenny McCormack

unread,

Apr 23, 2023, 11:30:06 AM4/23/23

to

In article <u23gql$3rkl5$1...@dont-email.me>,

Actually, I may be wrong about this. May have posted too quickly.

The bad case would be if a program produced a ton of output, but the reader
didn't read any of it. I'll have to think some more as to whether or not
that applies here.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/GodDelusion

Spiros Bousbouras

unread,

Apr 23, 2023, 11:51:47 AM4/23/23

to

On Sun, 23 Apr 2023 14:36:30 -0000 (UTC)
gaz...@shell.xmission.com (Kenny McCormack) wrote:
> In article <op.13uwd4i...@hodgins.homeip.net>,
> David W. Hodgins <dwho...@nomail.afraid.org> wrote:
> ...
> >Keep in mind. When sorting a file, the last line in the input may end up
> >becoming the first line in the output. The sort can not write anything to
> >the pipe or output file until it's sorted the entire input. With a pipe,
> >the temporary file is in ram rather then being a named file on disk.
>
> This actually raises an interesting point. Pipes are not infinite in size,
> and they could, theoretically block if enough is written on the write end
> without anything being read from the read end. Though the limits are
> likely very large nowadays on modern systems, I think the original
> implementation was only 4096 bytes and the standards today (POSIX) may not
> guarantee anything more than that (haven't checked).

I tried to find an argument which you can give to getconf to get the
answer to that but I didn't see anything. I don't think POSIX gives a constant
(in some C header) to get the answer to that. There is PIPE_BUF but this is
for atomic writes rather than total pipe capacity.

> For most programs, this is rarely a concern, since most pipelines write and
> read more or less simultaneously in real time, but sort is an edge case for
> the reason you explain above.
>
> Something to keep in mind if you ever decide to sort very large files in a
> pipeline. And it is probably a better idea not to do so; to sort it all at
> once, using multiple key specifications on the command line.

I don't see the problem. If sort is on the left of a pipe then it will
sort its whole input and then all it will do is write to the pipe. If sort
is on the right of a pipe then in the beginning it will only do reading
until it has read everything and then do the sorting. Obviously if you
have process1 | process2 and one side does reading or writing (whatever
applies) much slower than the other side then the fast side will block but
there's nothing special with sort about that. On the contrary , by the
nature of what it does , sort will only do reading or writing during part
of its operation.

--
Fans of both doomsday scenario movies and movies that show close-ups of Willem
Dafoe's pubic region should walk away eerily pleased from this one.
https://www.imdb.com/review/rw2553866/

Kaz Kylheku

unread,

Apr 23, 2023, 11:51:58 AM4/23/23

to

On 2023-04-23, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> The bad case would be if a program produced a ton of output, but the reader
> didn't read any of it. I'll have to think some more as to whether or not
> that applies here.

Limited pipe sizes cause two potential problems:

- deadlock: programs that both read and write a pipe may work when
tested with small messages, but lock up on larger ones.

- atomicity of writes: a write of a number of bytes smaller
than the pipe size can be read all together on the other end,
so the reading end will work correctly without checking for
a short read. When the message size exceeds the pipe size,
that breaks.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazi...@mstdn.ca

Spiros Bousbouras

unread,

Apr 23, 2023, 12:05:42 PM4/23/23

to

On Sun, 23 Apr 2023 15:51:42 -0000 (UTC)
Spiros Bousbouras <spi...@gmail.com> wrote:
> I don't see the problem. If sort is on the left of a pipe then it will
> sort its whole input and then all it will do is write to the pipe. If sort
> is on the right of a pipe then in the beginning it will only do reading
> until it has read everything and then do the sorting. Obviously if you
> have process1 | process2 and one side does reading or writing (whatever
> applies) much slower than the other side then the fast side will block

To be precise , *may* block if the amount of data going through the pipe is
large enough.

Janis Papanagnou

unread,

Apr 23, 2023, 12:19:32 PM4/23/23

to

On 23.04.2023 17:51, Spiros Bousbouras wrote:
>
> [...] If sort

> is on the right of a pipe then in the beginning it will only do reading

> until it has read everything and then do the sorting. [...]

This is [in principle] not necessarily the case. The sort algorithm
can start to sort subsets of the stream to create runs of already
sorted sequences. Mergesort, for example, is a good candidate for
such a process; it can use (e.g.) Heapsort to create larger runs in
memory and then needs less merge-runs (which are typically costly
if that's done over files). How much data the Heapsort will process
may vary, but a size of magnitude of the pipe-buffer is reasonable.

Disclaimer: I don't know how Unix'es 'sort' is typically implemented,
but I expect some sophisticated implementation, since what I wrote
above is decades old knowledge (at least since the 1980's - when I
implemented some hybrid sorting algorithms -, or maybe even back to
Donald Knuth's work; but I don't recall whether it's covered in his
"Searching and Sorting" book).

Janis

Felix Palmen

unread,

Apr 23, 2023, 12:22:07 PM4/23/23

to

* Kenny McCormack <gaz...@shell.xmission.com>:

> David W. Hodgins <dwho...@nomail.afraid.org> wrote:
> ...
>>Keep in mind. When sorting a file, the last line in the input may end up
>>becoming the first line in the output. The sort can not write anything to
>>the pipe or output file until it's sorted the entire input. With a pipe,
>>the temporary file is in ram rather then being a named file on disk.
>
> This actually raises an interesting point. Pipes are not infinite in size,
> and they could, theoretically block if enough is written on the write end

> [...]

> Something to keep in mind if you ever decide to sort very large files in a
> pipeline. And it is probably a better idea not to do so; to sort it all at
> once, using multiple key specifications on the command line.

This won't be a concern here. You need the whole data to sort something,
so the sort utility must read until EOF anyways before doing its work.
So, the real concern is whether you'll have enough RAM.

The only alternative would be to sort on the file contents. I don't know
whether some sort utility can do that (it certainly would create other
issues when sorting by "text lines" of very different lengths), but
that's not possible with pipes anyways, they can't be seeked.

--
Dipl.-Inform. Felix Palmen <fe...@palmen-it.de> ,.//..........
{web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de
{pgp public key} http://palmen-it.de/pub.txt // """""""""""
{pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231

David W. Hodgins

unread,

Apr 23, 2023, 12:27:46 PM4/23/23

to

On Sun, 23 Apr 2023 10:36:30 -0400, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> This actually raises an interesting point. Pipes are not infinite in size,
> and they could, theoretically block if enough is written on the write end
> without anything being read from the read end. Though the limits are
> likely very large nowadays on modern systems, I think the original
> implementation was only 4096 bytes and the standards today (POSIX) may not
> guarantee anything more than that (haven't checked).

Just tested "sort bigfile|hexdump|less". htop shows it's using 917M of ram
and 2.5GB of virtual storage (reserved, not all used) to sort a 730M input
file.

After ending the less output ...
$ free -m
total used free shared buff/cache available
Mem: 15955 5715 2376 361 7863 9548
Swap: 32761 2 32758

There may be versions of sort that are still limit how much ram it can use but
the version from the coreutils packages is not one of them. It's only limit is
based on the amount of ram and swap space available, and what the oom killer
can make available if you do start to run out.

Also note it has options such as "--temporary-directory=DIR" to use disk files
for temporary storage instead of ram.

Regards, Dave Hodgins

Janis Papanagnou

unread,

Apr 23, 2023, 12:34:31 PM4/23/23

to

On 23.04.2023 18:21, Felix Palmen wrote:
>
> This won't be a concern here. You need the whole data to sort something,
> so the sort utility must read until EOF anyways before doing its work.

See my recent reply on a different view.

> So, the real concern is whether you'll have enough RAM.

Not if sorting is (alternatively or also) done over files.

Even at times when 640k was considered immense memory by some, much
larger data sets had been sorted even then. (Speaking about real OS
computers, not about toys). In earth-bound computers there usually
was much more disk/drum/tape memory than kernel memory available.
Even if that's "legacy" the principles are still the same. - Unless
responsible folks start putting everything into a global memory
cloud. :-/

Janis

David W. Hodgins

unread,

Apr 23, 2023, 12:35:25 PM4/23/23

to

On Sun, 23 Apr 2023 12:21:44 -0400, Felix Palmen <fe...@palmen-it.de> wrote:
> This won't be a concern here. You need the whole data to sort something,
> so the sort utility must read until EOF anyways before doing its work.
> So, the real concern is whether you'll have enough RAM.

Yes the sort has to read the entire input file before it can write anything
to the output file as the last record read may have to be the first one
written.

The way that's handled in low ram systems is to use temporary files, where it
sorts chunks into each temporary file and then merges the temporary files to
create the final output file.

By default the temporary files are stored in /tmp, which on most systems is
now a virtual file system kept in ram.

Either ensure /tmp is mounted on a disk files system with enough free space
or instruct sort to use another directory.

See "man sort" for the -T (aka --temporary-directory=DIR) option.

Regards, Dave Hodgins

Richard Harnden

unread,

Apr 23, 2023, 12:36:52 PM4/23/23

to

My man page says:

--radixsort
Try to use radix sort, if the sort specifications allow.
The radix sort can only be used for trivial locales (C and
POSIX), and it cannot be used for numeric or month sort.
Radix sort is very fast and stable.

--mergesort
Use mergesort. This is a universal algorithm that can
always be used, but it is not always the fastest.

--qsort
Try to use quick sort, if the sort specifications allow.
This sort algorithm cannot be used with -u and -s.

--heapsort
Try to use heap sort, if the sort specifications allow.
This sort algorithm cannot be used with -u and -s.

Janis Papanagnou

unread,

Apr 23, 2023, 12:46:03 PM4/23/23

to

On 23.04.2023 18:36, Richard Harnden wrote:
>
> My man page says:

Thanks for that, since my man page doesn't say anything about the
algorithms. Now we have some clue what 'sort' on Unix does; and it
seems that hybrid sorting algorithms aren't implemented; which is
really strange since Quicksort implementations usually use Linear
Sort for small partitions, and upthread I already spoke about the
Mergesort/Heapsort hybrid. (Room for improvement? Or are they just
presuming that everything is doable with an arbitrary large virtual
memory? Who knows.)

>
> --radixsort
> Try to use radix sort, if the sort specifications allow.
> The radix sort can only be used for trivial locales (C and
> POSIX), and it cannot be used for numeric or month sort.
> Radix sort is very fast and stable.
>
> --mergesort
> Use mergesort. This is a universal algorithm that can
> always be used, but it is not always the fastest.
>
> --qsort
> Try to use quick sort, if the sort specifications allow.
> This sort algorithm cannot be used with -u and -s.
>
> --heapsort
> Try to use heap sort, if the sort specifications allow.
> This sort algorithm cannot be used with -u and -s.
>

Janis

Felix Palmen

unread,

Apr 23, 2023, 1:00:06 PM4/23/23

to

* Janis Papanagnou <janis_pap...@hotmail.com>:

> On 23.04.2023 18:21, Felix Palmen wrote:
>>
>> This won't be a concern here. You need the whole data to sort something,
>> so the sort utility must read until EOF anyways before doing its work.
>
> See my recent reply on a different view.

So, even if it starts working on "chunks", this won't change anything:
the data from the pipe must be read in order to work with it, so the
size of the pipe won't be a problem here.

It seems the idea assuming this was that the whole data to be sorted
must fit into the pipe buffer. But this isn't the case.

>> So, the real concern is whether you'll have enough RAM.
>
> Not if sorting is (alternatively or also) done over files.

Sure this *can* be done, that's why I mentioned the possibility. I
wasn't aware sort utils these days actually do it.

Janis Papanagnou

unread,

Apr 23, 2023, 1:16:31 PM4/23/23

to

On 23.04.2023 18:58, Felix Palmen wrote:
> * Janis Papanagnou <janis_pap...@hotmail.com>:
>> On 23.04.2023 18:21, Felix Palmen wrote:
>>>
>>> This won't be a concern here. You need the whole data to sort something,
>>> so the sort utility must read until EOF anyways before doing its work.

s/doing/finishing/

>> See my recent reply on a different view.
>
> So, even if it starts working on "chunks", this won't change anything:
> the data from the pipe must be read in order to work with it, so the
> size of the pipe won't be a problem here.
>
> It seems the idea assuming this was that the whole data to be sorted
> must fit into the pipe buffer. But this isn't the case.

It boils down to this; sorting can _start_ sorting with fewer data
(something like a pipe-full), it can also _continue_ sorting with
more parts of data, and to _finish_ sorting it naturally must have
had all data available.

Janis

Felix Palmen

unread,

Apr 23, 2023, 1:30:06 PM4/23/23

to

* Janis Papanagnou <janis_pap...@hotmail.com>:
> s/doing/finishing/

Agreed.

> It boils down to this; sorting can _start_ sorting with fewer data
> (something like a pipe-full), it can also _continue_ sorting with
> more parts of data, and to _finish_ sorting it naturally must have
> had all data available.

All correct, but I really doubt the relevance of the parantheses. The
size of the pipe will never be of much interest (except maybe for
performance), mostly because you can't seek a pipe anyways.

David W. Hodgins

unread,

Apr 23, 2023, 1:47:07 PM4/23/23

to

On Sun, 23 Apr 2023 12:58:41 -0400, Felix Palmen <fe...@palmen-it.de> wrote:
> It seems the idea assuming this was that the whole data to be sorted
> must fit into the pipe buffer. But this isn't the case.

As the last line of the input file(s) may be the first line of the final output,
all of the data must be sorted before anything is written to the pipe.

Either all of the data has to fit in ram, or it has to be sorted in chunks
with those chunks stored on disk, and then the chunks are then merged to
produce the output.

The coreutils package's sort can use temporary (unamed) files as needed in the
directory specified by the TMPDIR environment variable. (/tmp on most systems).
They wont show up in ls as they are unnamed.

It's not clear from the man page if it will always use temporary files or only
if instructed to. So checking the source ...
https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/sort.c;h=8ca7a88c48ec07eccd952b14739e427721466c5d;hb=HEAD

If I'm reading it right, it always uses temporary files doing a sort/merge.
Given that it started in 1988, it's not surprising that it's designed to work
in a low ram environment.

So if you're in a low ram environment either ensure the $TMPDIR directory is
not in ram, or include the --temporary-directory=DIR to specify another
directory that is on a disk file system with enough free space.

Regards, Dave Hodgins

Janis Papanagnou

unread,

Apr 23, 2023, 1:53:42 PM4/23/23

to

On 23.04.2023 19:29, Felix Palmen wrote:
> * Janis Papanagnou <janis_pap...@hotmail.com>:
>> s/doing/finishing/
>
> Agreed.
>
>> It boils down to this; sorting can _start_ sorting with fewer data
>> (something like a pipe-full), it can also _continue_ sorting with
>> more parts of data, and to _finish_ sorting it naturally must have
>> had all data available.
>
> All correct, but I really doubt the relevance of the parantheses.

It's here just to demonstrate a magnitude, no less, no more.

But I seem to recall - faint memories from 4 decades ago - that
I/O-buffer size (similar to pipe-buffer size) was part of the
rationale about why to use such values and how to dimension it
(for optimum processing speed, yes, for performance as you say
below).

> The
> size of the pipe will never be of much interest (except maybe for
> performance), mostly because you can't seek a pipe anyways.

Seeking on the pipe isn't necessary since the pipe is just the
transfer medium, unstructured per se, with data likely even
truncated at the front or rear (because of octet-transmission,
not data-record processing). You'll anyway have it transferred
into a structured memory structure.

Janis

Spiros Bousbouras

unread,

Apr 23, 2023, 2:03:41 PM4/23/23

to

I think Kenny was worried in <u23fpe$2opsm$1...@news.xmission.com>
and <u23ito$2osbe$1...@news.xmission.com> about a deadlock situation where
no progress gets made because of low pipes capacity. I can't think of
a scenario where this can happen even if sort interleaves sorting and
reading from a pipe.

Janis Papanagnou

unread,

Apr 23, 2023, 2:17:53 PM4/23/23

to

On 23.04.2023 19:46, David W. Hodgins wrote:
>
> If I'm reading it right, it always uses temporary files doing a sort/merge.
> Given that it started in 1988, it's not surprising that it's designed to
> work in a low ram environment.

Some test-run[*] finished here...

The data created and fed into 'sort' is larger than my free RAM.

$ time seq 1000000000 -1 1 | sort -n | N=1 is-sorted
0

real 58m8.18s
user 54m18.16s
sys 1m49.34s

Janis

[*] 'is-sorted' is an awk script, and "0" means it's okay (=sorted),

John-Paul Stewart

unread,

Apr 23, 2023, 3:42:47 PM4/23/23

to

On 4/23/23 10:36, Kenny McCormack wrote:
> This actually raises an interesting point. Pipes are not infinite in size,
> and they could, theoretically block if enough is written on the write end
> without anything being read from the read end. Though the limits are
> likely very large nowadays on modern systems, I think the original
> implementation was only 4096 bytes and the standards today (POSIX) may not
> guarantee anything more than that (haven't checked).

FWIW, the pipe(7) manpage from Debian GNU/Linux has a "Pipe capacity"
section that says in part:

Before Linux 2.6.11, the capacity of a pipe was the same as the
system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 16 pages (i.e., 65,536 bytes in a
system with a page size of 4096 bytes). Since Linux 2.6.35,
the default pipe capacity is 16 pages, but the capacity can be
queried and set using the fcntl(2) F_GETPIPE_SZ and F_SET‐
PIPE_SZ operations. See fcntl(2) for more information.

So pipes on Linux aren't very large at all. I don't know how other Unix
systems compare.

Helmut Waitzmann

unread,

Apr 23, 2023, 3:57:53 PM4/23/23

to

Martin Τrautmann <t-us...@gmx.net>:

> On Sun, 23 Apr 2023 03:33:47 +0200, Helmut Waitzmann wrote:

>>> If I want to pre-sort by 3 first, then sub-sort by column 2,
>>> that's fine. But when I pipe one sort to the other, the second
>>> sort will destroy the sort before. That's why i had my sort
>>> order in reverted order, using a pipe example.
>>
>> That won't help, either: A sorting pipe using (a standard)
>> "sort" won't solve the problem, because one cannot tell (a
>> standard) "sort" to do a sort on the given key option only.
>> Each sort in the pipe will be total (according to its sort
>> criteria) of its own.
>

> That was my problem - I expected that a pipe through several
> sorts would keep the order. I don't know why it doesn't.

Look at these sample lines:

1;0
1;1
1;2
0;0
0;1
0;2
2;0
2;1
2;2

To have this sequence of lines sorted in such a way that the
first field is sorted in ascending numeric order while the second
is sorted in descending numeric order, one could specify the two
sort criteria at once:

sort -t ';' -k 1nb,1 -k 2nr,2

How would the command line be if one would use two "sort"
invocations with each of them getting only one "-k" option
(replacing the "???" by the appropriate sort key specifications)?

first=??? ; second=???
sort -t ';' -k "$first" |
sort -t ';' -k "$second"

Or (if it's easier to understand, but it's equivalent) use an
intermediate file rather than a pipe:

first=??? ; second=???
sort -t ';' -k "$first" > file &&
sort -t ';' -k "$second" -- file

Try to answer the following questions:

Would the variable assignments

first=2nr,2
second=1nb,1

yield the correct result? Why or why not? Would they work if
one adds the GNU‐"sort" "--stable" option to the second "sort"
invocations? Why or why not?

When using the variant with the intermediate file, after having
run the first "sort" invocation, you might examine the
intermediate file and try to predict what would be the outcome of
the second "sort" invocation.

David W. Hodgins

unread,

Apr 23, 2023, 4:07:00 PM4/23/23

to

On Sun, 23 Apr 2023 15:42:00 -0400, John-Paul Stewart <jpst...@personalprojects.net> wrote:
> So pipes on Linux aren't very large at all. I don't know how other Unix
> systems compare.

The pipe only has to store a minimum of one buffer of data. If the process
writing data to the pipe is faster than the one reading it, then the write
process will block while it waits for the reading process to catch up.
Likewise if the reading process is faster. It will just block while it waits
for the data to be ready.

Having more buffers will speed it up only the processes run at different
speeds with the slower one being inconsistent in it's speed.

A good example of that is sort somefile>less.

If the the user presses page down repeatedly. Each time the faster sort process
has written enough data to fill the buffers, it gets blocked from writing until
the page down key is pressed and the less command reads the data for the next
screen full, freeing up some of the buffer space.

Note that when I write that the sort command is faster, by time the first
screen full shows up in less, all of the data has been sorted, it just needs
to be written to the output. Until the data is sorted, the less command is
blocked, waiting for input.

Regards, Dave Hodgins

Lew Pitcher

unread,

Apr 23, 2023, 4:41:43 PM4/23/23

to

And fcntl(2) says
F_SETPIPE_SZ (int; since Linux 2.6.35)
Change the capacity of the pipe referred to by fd to be at least
arg bytes. An unprivileged process can adjust the pipe capacity
to any value between the system page size and the limit defined
in /proc/sys/fs/pipe-max-size (see proc(5)).

On my Linux (untuned 4.4.301 kernel), /proc/sys/fs/pipe-max-size
is set to
16:35 $ cat /proc/sys/fs/pipe-max-size
1048576
or 1Mb

> So pipes on Linux aren't very large at all.

... unless you tune them upward.

> I don't know how other Unix systems compare.

I've seen some studies; Linux pipe buffer sizes seem comparable to
other systems, which range in the 20K to 64K default size range, and
top out at about 1Mb.

HTH
--
Lew Pitcher
"In Skills We Trust"

Helmut Waitzmann

unread,

Apr 23, 2023, 4:42:24 PM4/23/23

to

Helmut Waitzmann <nn.th...@xoxy.net>:

> Look at these sample lines:
>
>
> 1;0
> 1;1
> 1;2
> 0;0
> 0;1
> 0;2
> 2;0
> 2;1
> 2;2
>
>
> To have this sequence of lines sorted in such a way that the
> first field is sorted in ascending numeric order while the
> second is sorted in descending numeric order,

I'm sorry, that is a quite misleading description. What I wanted
to say is that the sequence of lines should be sorted to look
like

0;2
0;1
0;0
1;2
1;1
1;0
2;2
2;1
2;0

and to achieve this…

vallor

unread,

Apr 24, 2023, 10:05:45 AM4/24/23

to

On Sun, 23 Apr 2023 15:42:00 -0400, John-Paul Stewart wrote:

Could the actual pipe size perhaps be queried
and set with "ulimit"?

$ ulimit -a
[...]
pipe size (512 bytes, -p) 8
[...]

With: GNU bash, version 5.1.16
("help ulimit" for docs on the shell built-in...)

--
-v (Scott)

Janis Papanagnou

unread,

Apr 24, 2023, 11:04:04 AM4/24/23

to

On 24.04.2023 16:05, vallor wrote:
>
> Could the actual pipe size perhaps be queried
> and set with "ulimit"?
>
> $ ulimit -a
> [...]
> pipe size (512 bytes, -p) 8
> [...]
>
> With: GNU bash, version 5.1.16
> ("help ulimit" for docs on the shell built-in...)

It's quite funny that every shell has its own formats; in bash you
have to do the math (8x512) while in ksh it's 4096. Other quantities
have different scaling, e.g. bytes vs. Kibytes. And some have units
not defined (in ulimit or ulimit --man), like "blocks".

# bash

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200
file size (blocks, -f) unlimited

# ksh
pipe buffer size (bytes) (-p) 4096
message queue size (Kibytes) (-q) 800
file size (blocks) (-f) unlimited

And zsh's ulimit "doesn't know" pipe size?

Janis

Kaz Kylheku

unread,

Apr 24, 2023, 12:50:48 PM4/24/23

to

On 2023-04-23, David W. Hodgins <dwho...@nomail.afraid.org> wrote:
> On Sun, 23 Apr 2023 15:42:00 -0400, John-Paul Stewart <jpst...@personalprojects.net> wrote:
>> So pipes on Linux aren't very large at all. I don't know how other Unix
>> systems compare.
>
> The pipe only has to store a minimum of one buffer of data. If the process

In fact, I suspect, a pipe doesn't have to store anything. It can be a
pure rendezvous. The write() call can block until the reader performs a
read(), or vice versa, at which time MIN(read_size, write_size) bytes
can be transferred directly between their respective buffers, that value
then being returned from the read and write.

Felix Palmen

unread,

Apr 24, 2023, 1:08:07 PM4/24/23

to

* Kaz Kylheku <864-11...@kylheku.com>:

> In fact, I suspect, a pipe doesn't have to store anything. It can be a
> pure rendezvous. The write() call can block until the reader performs a
> read(), or vice versa, at which time MIN(read_size, write_size) bytes
> can be transferred directly between their respective buffers, that value
> then being returned from the read and write.

Yes. IIRC, L4 uses some similar mechanism for IPC. It needs support from
the scheduler of course. And to make it most efficient, the size should
be agreed upon on both sides, so that won't work with typical pipe
semantics.

Geoff Clare

unread,

Apr 25, 2023, 9:11:08 AM4/25/23

to

Janis Papanagnou wrote:

> On 24.04.2023 16:05, vallor wrote:
>>
>> Could the actual pipe size perhaps be queried
>> and set with "ulimit"?
>>
>> $ ulimit -a
>> [...]
>> pipe size (512 bytes, -p) 8
>> [...]
>>
>> With: GNU bash, version 5.1.16
>> ("help ulimit" for docs on the shell built-in...)
>
> It's quite funny that every shell has its own formats; in bash you
> have to do the math (8x512) while in ksh it's 4096.

I believe the value ulimit is giving here is PIPE_BUF, not the
capacity of the pipe.

On my Linux system, much more than 4096 bytes can be written to
a pipe without anything being read from it:

$ dd if=/dev/zero | sleep 10
^C129+0 records in
128+0 records out
65536 bytes (66 kB, 64 KiB) copied, 2.04325 s, 32.1 kB/s

(I used Ctrl-C to send dd a SIGINT.)

$ ulimit -a | grep pipe

pipe size (512 bytes, -p) 8

$ getconf PIPE_BUF .
4096

In any case, on some systems "pipe capacity" is not a simple concept.
SVR4's STREAMS-based pipes have separate high-water and low-water
thresholds. (The writer blocks when high-water is reached but
doesn't unblock until enough has been read to take the level below
low-water.)

--
Geoff Clare <net...@gclare.org.uk>

Kenny McCormack

unread,

Apr 25, 2023, 9:29:53 AM4/25/23

to

In article <6tukhj-...@ID-313840.user.individual.net>,
Geoff Clare <net...@gclare.org.uk> wrote:
...

>On my Linux system, much more than 4096 bytes can be written to
>a pipe without anything being read from it:
>
>$ dd if=/dev/zero | sleep 10
>^C129+0 records in
>128+0 records out
>65536 bytes (66 kB, 64 KiB) copied, 2.04325 s, 32.1 kB/s
>
>(I used Ctrl-C to send dd a SIGINT.)

Didn't somebody say upthread that the default limit on Linux is 64K?
So, kinda funny that you chose exactly 64K for your demonstration.

Anyway, you can (according to those same people) bump it up to 1M. if
needed.

--
People who want to share their religious views with you
almost never want you to share yours with them. -- Dave Barry

David W. Hodgins

unread,

Apr 25, 2023, 11:01:20 AM4/25/23

to

On Tue, 25 Apr 2023 09:29:48 -0400, Kenny McCormack <gaz...@shell.xmission.com> wrote:

> In article <6tukhj-...@ID-313840.user.individual.net>,
> Geoff Clare <net...@gclare.org.uk> wrote:
> ...
>> On my Linux system, much more than 4096 bytes can be written to
>> a pipe without anything being read from it:
>>
>> $ dd if=/dev/zero | sleep 10
>> ^C129+0 records in
>> 128+0 records out
>> 65536 bytes (66 kB, 64 KiB) copied, 2.04325 s, 32.1 kB/s
>>
>> (I used Ctrl-C to send dd a SIGINT.)
>
> Didn't somebody say upthread that the default limit on Linux is 64K?
> So, kinda funny that you chose exactly 64K for your demonstration.
>
> Anyway, you can (according to those same people) bump it up to 1M. if
> needed.

It stopped after filling the output buffer, not the pipe. That data was still
waiting to be written to the pipe when the dd command was terminated.

Regards, Dave Hodgins

Geoff Clare

unread,

Apr 26, 2023, 8:41:08 AM4/26/23

to

David W. Hodgins wrote:

> On Tue, 25 Apr 2023 09:29:48 -0400, Kenny McCormack <gaz...@shell.xmission.com> wrote:
>
>> In article <6tukhj-...@ID-313840.user.individual.net>,
>> Geoff Clare <net...@gclare.org.uk> wrote:
>> ...
>>> On my Linux system, much more than 4096 bytes can be written to
>>> a pipe without anything being read from it:
>>>
>>> $ dd if=/dev/zero | sleep 10
>>> ^C129+0 records in
>>> 128+0 records out
>>> 65536 bytes (66 kB, 64 KiB) copied, 2.04325 s, 32.1 kB/s
>>>
>>> (I used Ctrl-C to send dd a SIGINT.)
>>
>> Didn't somebody say upthread that the default limit on Linux is 64K?
>> So, kinda funny that you chose exactly 64K for your demonstration.

I didn't actively choose 64K. I haven't ever changed the pipe size on
a Linux system, so the size used was whatever is the default.

>> Anyway, you can (according to those same people) bump it up to 1M. if
>> needed.
>
> It stopped after filling the output buffer, not the pipe. That data was still
> waiting to be written to the pipe when the dd command was terminated.

Only one block (of 512 bytes) was waiting to be written. A feature
of dd is that it reads and writes exactly the block sizes you tell it
to (or 512 bytes by default). The dd output:

128+0 records out

means it had successfully written 128 blocks (of 512 bytes) to the pipe
when it exited. The "129+0 records in" is what shows it had read one
extra block that was waiting to be written.

If I tell dd to read and write one byte at a time, it does exactly that:

$ dd bs=1 if=/dev/zero | sleep 10
^C65537+0 records in
65536+0 records out
65536 bytes (66 kB, 64 KiB) copied, 4.88295 s, 13.4 kB/s

--
Geoff Clare <net...@gclare.org.uk>

Eric Pozharski

unread,

Apr 26, 2023, 1:33:12 PM4/26/23

to

with <u265ou$csqa$1...@dont-email.me> Janis Papanagnou wrote:
> On 24.04.2023 16:05, vallor wrote:

>> Could the actual pipe size perhaps be queried and set with "ulimit"?

*SKIP*

> And zsh's ulimit "doesn't know" pipe size?

Funny thing, looking through /usr/include/**/resource.h suggests that size
of pipe has nothing to do with setrlimit(2) or ulimit(3). Weird.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Chris Elvidge

unread,

Apr 29, 2023, 7:01:22 AM4/29/23

to

On 29/04/2023 11:01, Martin Τrautmann wrote:
> On Sun, 23 Apr 2023 09:43:06 -0400, David W. Hodgins wrote:

>> On Sun, 23 Apr 2023 07:28:22 -0400, Martin Τrautmann <t-us...@gmx.net> wrote:
>>> That was my problem - I expected that a pipe through several sorts would
>>> keep the order. I don't know why it doesn't.
>>

>> It may be easier to understand if you use a temporary files instead of pipes.
>>
>> Sorting the input file by column 4, numerical creating a first temporary file.
>> Sort the first temporary file by column 2 creating a second temporary file.
>> Sort the second temporary file by column 3 creating the output.
>>
>> The last sort doesn't know that the prior two sorts have been done. It just
>> looks at the file it's giving and sorts it by column 3.
>>
>> Using a pipe just takes the output of the first and second sort and uses it
>> directly as input for the next sort. All the pipe does is eliminate the
>> need for a temporary file.
>
> But if I sort by one column only, then through the pipe by another
> column only, the second sort SHOULD respect the previous sort.
> Unfortunately, I feel it doesn't.

Of course it doesn't. How does the second sort know that the first sort
even happened?

>
>> Keep in mind. When sorting a file, the last line in the input may end up becoming
>> the first line in the output. The sort can not write anything to the pipe or
>> output file until it's sorted the entire input. With a pipe, the temporary
>> file is in ram rather then being a named file on disk.
>
> So the sort via a file actually should work the same as via the pipe?
>

--
Chris Elvidge
England

Richard Harnden

unread,

Apr 29, 2023, 8:39:04 AM4/29/23

to

On 29/04/2023 13:12, Martin Τrautmann wrote:

> On Sat, 29 Apr 2023 12:01:14 +0100, Chris Elvidge wrote:
>> On 29/04/2023 11:01, Martin Τrautmann wrote:
>>> On Sun, 23 Apr 2023 09:43:06 -0400, David W. Hodgins wrote:
>>>> On Sun, 23 Apr 2023 07:28:22 -0400, Martin Τrautmann <t-us...@gmx.net> wrote:
>>>>> That was my problem - I expected that a pipe through several sorts would
>>>>> keep the order. I don't know why it doesn't.
>>>>
>>>> It may be easier to understand if you use a temporary files instead of pipes.
>>>>
>>>> Sorting the input file by column 4, numerical creating a first temporary file.
>>>> Sort the first temporary file by column 2 creating a second temporary file.
>>>> Sort the second temporary file by column 3 creating the output.
>>>>
>>>> The last sort doesn't know that the prior two sorts have been done. It just
>>>> looks at the file it's giving and sorts it by column 3.
>>>>
>>>> Using a pipe just takes the output of the first and second sort and uses it
>>>> directly as input for the next sort. All the pipe does is eliminate the
>>>> need for a temporary file.
>>>
>>> But if I sort by one column only, then through the pipe by another
>>> column only, the second sort SHOULD respect the previous sort.
>>> Unfortunately, I feel it doesn't.
>>
>> Of course it doesn't. How does the second sort know that the first sort
>> even happened?
>

> It should sort on the given column only, but keep anything else as it
> was. I guess that's my misconception - however, sort seems to be allowed
> to resort anything else however it likes. That's the difference e.g. to
> an excel spreadsheet, which does keep the former sort.

You want a stable sort, then. Check if you have a '-s' option.

Lew Pitcher

unread,

Apr 29, 2023, 4:33:30 PM4/29/23

to

On Sat, 29 Apr 2023 20:23:18 +0200, Martin Τrautmann wrote:

> On Sat, 29 Apr 2023 13:38:58 +0100, Richard Harnden wrote:
>>> It should sort on the given column only, but keep anything else as it
>>> was. I guess that's my misconception - however, sort seems to be allowed
>>> to resort anything else however it likes. That's the difference e.g. to
>>> an excel spreadsheet, which does keep the former sort.
>>
>> You want a stable sort, then. Check if you have a '-s' option.
>

> wow, cool
>
> -s, --stable
> stabilize sort by disabling last-resort comparison
>
> I do not understand what that means. But it worked

From the option summary, the meaning is less than obvious. However
some versions of the manpage include an explanation:

"A pair of lines is compared as follows: if any key fields have
been specified, 'sort' compares each pair of fields, in the
order specified on the command line, according to the associated
ordering options, until a difference is found or no fields are
left.
...
Finally, as a last resort when all keys compare equal (or if no
ordering options were specified at all), 'sort' compares the
entire lines. ... The '-s' (stable) option disables this
last-resort comparison so that lines in which all fields
compare equal are left in their original relative order.
..."

In the case of a file that has already been sorted, either on a
key occurring before the key-to-be-sorted, or on a key that follows
(but is not adjacent to) the key-to-be sorted, this "last resort
comparison" may result in a record that sorts out-of-sequence
with respect to the prior sort order. To ensure that the order
from a prior sort is not lost, you have to disable this "last
resort comparison".

[snip]

David W. Hodgins

unread,

Apr 29, 2023, 4:45:51 PM4/29/23

to

On Sat, 29 Apr 2023 14:23:18 -0400, Martin Τrautmann <t-us...@gmx.net> wrote:

> On Sat, 29 Apr 2023 13:38:58 +0100, Richard Harnden wrote:

>>> It should sort on the given column only, but keep anything else as it
>>> was. I guess that's my misconception - however, sort seems to be allowed
>>> to resort anything else however it likes. That's the difference e.g. to
>>> an excel spreadsheet, which does keep the former sort.
>>
>> You want a stable sort, then. Check if you have a '-s' option.
>

> wow, cool
>
> -s, --stable
> stabilize sort by disabling last-resort comparison
>
> I do not understand what that means. But it worked

See https://unix.stackexchange.com/questions/64102/why-is-sort-changing-the-order-of-lines-with-identical-sort-keys

Regards, Dave Hodgins

Helmut Waitzmann

unread,

May 1, 2023, 7:36:39 AM5/1/23

to

Martin Τrautmann <t-us...@gmx.net>:
> On Sat, 29 Apr 2023 12:05:17 +0200, Martin Τrautmann wrote:

>> Would you achieve this via a pipe as well?
>>
>
> When I sort by column 2 first and only, I end up with
>
> 0;2
> 1;2
> 2;2
> 0;1
> 1;1
> 2;1
> 0;0
> 1;0
> 2;0
>
> Why that? I would expect
>
> 1;2
> 0;2
> 2;2
> 1;1
> 0;1
> 2;1
> 1;0
> 0;0
> 2;0
>

[Sorted by using the

sort -t ';' -k 2nr,2

command]

> So why does it resort by first column as well?
>

Because that is the way "sort" is supposed to work. The POSIX
standard, especially the last paragraph in the "OPTIONS" section
(<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html#tag_20_119_04>)
says: "Except when the -u option is specified, lines that
otherwise compare equal shall be ordered as if none of the
options -d, -f, -i, -n, or -k were present (but with -r still in
effect, if it was specified) and with all bytes in the lines
significant to the comparison. The order in which lines that
still compare equal are written is unspecified."

In your case, the lines "1;2" and "0;2" for example compare equal
when compared according to the "-k 2nr,2" key specification.
Because of that equality, these two equal comparing lines are
ordered as a last resort, as if by the

sort

command line, and that of course will sort "0;2" before "1;2".

With GNU sort, you may specify the "--stable" option (which
unfortunately is not part of the POSIX standard) to suppress that
last resort ordering.

> Since it does that, both a pipe and a second sort from a
> temporary file still fail, since they also ignore the temporary
> sort of the other column.

Yes, "sort" without the GNU "sort" "--stable" option will always
do a total ordering, ignoring and destroying any order that has
been done to its input before. That's what we've been discussing
the whole thread and that's what makes the GNU "sort" "--stable"
option a nice thing to have.

Helmut Waitzmann

unread,

May 1, 2023, 7:36:41 AM5/1/23

to

Martin Τrautmann <t-us...@gmx.net>:

> So the sort via a file actually should work the same as via the
> pipe?

Yes. At least the result will be the same. When using a pipe,
the first sort must either use its virtual memory if the data fit
into it else use a temporary file.

Ben Bacarisse

unread,

May 1, 2023, 10:02:32 AM5/1/23

to

Helmut Waitzmann <nn.th...@xoxy.net> writes:

Sorry, piggybacking...

> Yes, "sort" without the GNU "sort" "--stable" option will always do a
> total ordering, ignoring and destroying any order that has been done to
> its input before. That's what we've been discussing the whole thread and
> that's what makes the GNU "sort" "--stable" option a nice thing to
> have.

There's an old trick that was common back in the day of adding a line
number (or similar) and then removing it. You could then either
explicitly sort on that number or make sure that the number has leading
zeros so the default sort restores the original order:

nl -n rz data | sort -t ';' -k 2nr,2 | cut -f2-

--
Ben.

Lew Pitcher

unread,

May 1, 2023, 2:57:20 PM5/1/23

to

On Mon, 01 May 2023 20:27:57 +0200, Martin Τrautmann wrote:

> On Mon, 01 May 2023 13:19:24 +0200, Helmut Waitzmann wrote:
>>> So why does it resort by first column as well?
>>>
>>
>> Because that is the way "sort" is supposed to work.
>

> How should I know that this is supposed that way? If I tell "sort" to
> sort by a certain column only, why would I have to expect that it will
> sort by something else as well?

As Helmut said, "because that is the way 'sort' is supposed to work".
The Open Group defines the interface and results for each of the common
'Unix' utilities, "sort[1]" included, and their definition of sort says
that
"When there are multiple key fields, later keys shall be compared
only after all earlier keys compare equal. ... [L]ines that otherwise

compare equal shall be ordered as if none of the options -d, -f, -i,

-n, or -k were present ... and with all bytes in the lines
significant to the comparison."

The "Rationale" section /does/ seem to give implementations some leeway:
"Implementations are encouraged to perform the recommended further
byte-by-byte comparison of lines that collate equally, even though
this may affect efficiency."
The key phrase here is "are encouraged", implying that this behaviour,
while specified, is not absolutely required.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

Kaz Kylheku

unread,

May 1, 2023, 3:13:10 PM5/1/23

to

On 2023-05-01, Martin Τrautmann <t-us...@gmx.net> wrote:

> On Mon, 01 May 2023 13:19:24 +0200, Helmut Waitzmann wrote:
>>> So why does it resort by first column as well?
>>>
>>
>> Because that is the way "sort" is supposed to work.
>

> How should I know that this is supposed that way? If I tell "sort" to
> sort by a certain column only, why would I have to expect that it will
> sort by something else as well?

Sorting, in computer science, may be stable or unstable. If you've
not read the documentation of a sorting system thoroughly,
you have no basis for expecting it to be one way or the other.

When you expect stable sort, you're still expecting "sorting by
something else".

Under stable sort, all records are imagined to be put into
correspondence with the natural numbers, in their original sorted order.
When two records are considered equal by the sorting comparison
function, they are in fact not considered equal but further
compared by their original order number: the record with the lower
number is considered lesser.

Some sorting algorithms achieve that behavior implicitly, by never
exchanging the relative position of items that are equal by the sorting
comparison function. Algorithms which work by comparing elements
pairwise and swapping them into order will be like this: merge sort,
insertion sort, Shell sort.

Some sorting algorithms, like quicksort, will wreck the original order
for equal keys. Quicksort has a partitioning step whereby it chooses
some middle key value, and then separates records into two groups: those
higher and those lower.

If all you know is that some program sorts, you have no idea
which kind of algorithm is using.

Helmut Waitzmann

unread,

May 1, 2023, 3:27:26 PM5/1/23

to

Ben Bacarisse <ben.u...@bsb.me.uk>:

> Helmut Waitzmann <nn.th...@xoxy.net> writes:
>
>> Yes, "sort" without the GNU "sort" "--stable" option will
>> always do a total ordering, ignoring and destroying any order
>> that has been done to its input before. That's what we've been
>> discussing the whole thread and that's what makes the GNU
>> "sort" "--stable" option a nice thing to have.
>
> There's an old trick that was common back in the day of adding a
> line number (or similar) and then removing it. You could then
> either explicitly sort on that number or make sure that the
> number has leading zeros so the default sort restores the
> original order:
>
> nl -n rz data | sort -t ';' -k 2nr,2 | cut -f2-
>

I'm stunned. Thank you for presenting this solution! And thank
you, Martin, for initiating this interesting topic!

I prefer

grep -F -n -- ''

over

nl -n rz

though, because it doesn't get confused by header, body, and
footer lines (see the "nl(1)" manual):

# Sort according to the second numerical field, descending:
#
sort -t ';' -k 2nr,2 |

# To each line, prepend an additional numeric field, ascending,
# separated by a ";", in order to "save" the sorted result for
# later retrieval thus making the second sort below a "stable"
# one:
#
grep -F -n -- '' | sed -e 's/:/;/' |

# Sort according to the original first - now second - numerical
# field, ascending, and the "saved" sort in the first numerical
# field, ascending, thus getting a stable sort:
#
sort -t ';' -k 2nb,2 -k 1nb,1 |

# Finally remove the leading field of the saved sort result:
#
cut -d ';' -f 2-

Of course this is no better than just doing

sort -t ';' -k 1n,1 -k 2nr,2

but it might be helpful when there are either more than 10 sort
keys (POSIX only requires that "sort" shall at least allow 10
sort keys) or different delimiters ("-t" option), which can't be
specified in one sort invocation.

Keith Thompson

unread,

May 1, 2023, 8:49:35 PM5/1/23

to

Martin Τrautmann <t-us...@gmx.net> writes:

> On Mon, 1 May 2023 18:57:13 -0000 (UTC), Lew Pitcher wrote:
>> On Mon, 01 May 2023 20:27:57 +0200, Martin Τrautmann wrote:
>>> On Mon, 01 May 2023 13:19:24 +0200, Helmut Waitzmann wrote:
>>>>> So why does it resort by first column as well?
>>>>
>>>> Because that is the way "sort" is supposed to work.
>>>
>>> How should I know that this is supposed that way? If I tell "sort" to
>>> sort by a certain column only, why would I have to expect that it will
>>> sort by something else as well?
>>
>> As Helmut said, "because that is the way 'sort' is supposed to work".
>> The Open Group defines the interface and results for each of the common
>> 'Unix' utilities, "sort[1]" included, and their definition of sort says
>> that
>> "When there are multiple key fields, later keys shall be compared
>> only after all earlier keys compare equal. ... [L]ines that otherwise
>> compare equal shall be ordered as if none of the options -d, -f, -i,
>> -n, or -k were present ... and with all bytes in the lines
>> significant to the comparison."
>

> So where is that information available on my computer? Sorry, but I
> really did not think about using a geneology search first to find out
> how someone thought something should behave. No, it was not obvious to
> me. When -k tells me about first and last key to sort by, I just did not
> expect a bonus sort.

Nobody is expecting you to know this inherently. Helmut told you
'Because that is the way "sort" is supposed to work.' I don't think he
meant to imply that there was anything wrong with you for not already
knowing it. You asked; he answered.

You *should* be able to get this information with `man sort`. If you
have the GNU coreutils implementation of sort, the man page doesn't
mention re-sorting by the whole line (which is IMHO unfortunate), but at
the bottom of the man page there is a reference to the full
documentation:

Full documentation <https://www.gnu.org/software/coreutils/sort>
or available locally via: info '(coreutils) sort invocation'

If you have an implemntation other than GNU coreutils, `man sort` is
likely to describe it in more detail. `sort --help` is also a good
thing to try.

It's also good to know about the POSIX standard:
<https://pubs.opengroup.org/onlinepubs/9699919799/toc.htm>
This is the standard for the behavior of Unix tools, but not all
implementations follow it completely, and most provide extra
functionality.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */

Spiros Bousbouras

unread,

May 2, 2023, 7:24:52 AM5/2/23

to

On Tue, 2 May 2023 12:43:00 +0200
Martin =?UTF-8?Q?=CE=A4rautmann?= <t-us...@gmx.net> wrote:

> On Mon, 01 May 2023 17:49:24 -0700, Keith Thompson wrote:
> > You *should* be able to get this information with `man sort`. If you
> > have the GNU coreutils implementation of sort, the man page doesn't
> > mention re-sorting by the whole line (which is IMHO unfortunate), but at
> > the bottom of the man page there is a reference to the full
> > documentation:
> >
> > Full documentation <https://www.gnu.org/software/coreutils/sort>
> > or available locally via: info '(coreutils) sort invocation'
> >
> > If you have an implemntation other than GNU coreutils, `man sort` is
> > likely to describe it in more detail. `sort --help` is also a good
> > thing to try.
>

> No, mine says
>
> SEE ALSO
> The full documentation for sort is maintained as a Texinfo
> manual. If the info and sort programs are properly installed at your
> site, the command
>
> info sort
>
> should give you access to the complete manual.
>
> sort 5.93 November 2005
> SORT(1)
>
> And info sort does not provide more details here.

On the other hand
https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
says

A pair of lines is compared as follows: sort compares each pair of
fields (see --key), in the order specified on the command line,

according to the associated ordering options, until a difference is

found or no fields are left. If no key fields are specified, sort uses a
default key of the entire line. Finally, as a last resort when all keys
compare equal, sort compares entire lines as if no ordering options
other than --reverse (-r) were specified. The --stable (-s) option

disables this last-resort comparison so that lines in which all fields

compare equal are left in their original relative order. The --unique
(-u) option also disables the last-resort comparison.

.With GNU software it is worth checking (and perhaps also downloading) the
online documentation which usually has a lot more detail than what is
automatically installed and that includes both info and man pages. (GNU
tar is a prominent example)

Keith Thompson

unread,

May 2, 2023, 4:43:02 PM5/2/23

to

Martin Τrautmann <t-us...@gmx.net> writes:
> On Mon, 01 May 2023 17:49:24 -0700, Keith Thompson wrote:

>> You *should* be able to get this information with `man sort`. If you
>> have the GNU coreutils implementation of sort, the man page doesn't
>> mention re-sorting by the whole line (which is IMHO unfortunate), but at
>> the bottom of the man page there is a reference to the full
>> documentation:
>>
>> Full documentation <https://www.gnu.org/software/coreutils/sort>
>> or available locally via: info '(coreutils) sort invocation'
>>
>> If you have an implemntation other than GNU coreutils, `man sort` is
>> likely to describe it in more detail. `sort --help` is also a good
>> thing to try.
>

> No, mine says
>
> SEE ALSO
> The full documentation for sort is maintained as a Texinfo
> manual. If the info and sort programs are properly installed at your
> site, the command
>
> info sort
>
> should give you access to the complete manual.
>
> sort 5.93 November 2005
> SORT(1)
>
> And info sort does not provide more details here.

Yours is quite old. If you don't have the "info" documentation
installed, "info sort" falls back to showing you the man page. Is that
what you're seeing? The info documentation for COREUTILS-5_92 (which I
presume is very close to the version you have) says:

If no key fields are specified, @command{sort} uses a default key of

the entire line. Finally, as a last resort when all keys compare

equal, @command{sort} compares entire lines as if no ordering options
other than @option{--reverse} (@option{-r}) were specified. The
@option{--stable} (@option{-s}) option disables this @dfn{last-resort

comparison} so that lines in which all fields compare equal are left

in their original relative order. The @option{--unique}
(@option{-u}) option also disables the last-resort comparison.

(That's from the raw coreutils.texi file; the info documentation is
generated from it.)

At least on modern Ubuntu, the "coreutils" package installs all the
documentation. Perhaps on your system the tools and the documentation
are in separate packages, for example "coreutils" and "coreutils-doc".

[...]

Dr Eberhard W Lisse

unread,

May 5, 2023, 4:35:12 AM5/5/23

to

mlr --fs 'semicolon' --ocsv --hi --ho --from t.ssv sort -n 4 -f 2,3

mfg, el

On 19/04/2023 09:27, Martin Τrautmann wrote:
>
> Hi all,
>
> how do I sort by multiple columns?
>
> Example:
> +++
> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468
> Borgentreich;T2960;Lindenstätte;12;;32;519622.835;5709023.590
> Borgentreich;T2960;Lindenstätte;6;;32;519696.745;5709038.833
> Borgentreich;T2960;Lindenstätte;4;;32;519722.956;5709043.915
> Borgentreich;T2960;Lindenstätte;15;;32;519489.638;5709077.693
> Borgentreich;T2960;Lindenstätte;24;;32;519518.763;5709090.026
> Borgentreich;T2960;Lindenstätte;18;;32;519559.108;5709037.356
> Borgentreich;T2960;Lindenstätte;14;;32;519596.623;5709013.684
> Borgentreich;T2960;Lindenstätte;16;;32;519569.141;5709017.854
> Borgentreich;T2960;Lindenstätte;22;;32;519540.257;5709072.032
> Borgentreich;T2960;Lindenstätte;26;;32;519503.270;5709103.321
> Borgentreich;T2960;Lindenstätte;2;;32;519758.267;5709057.635
> Borgentreich;T2960;Lindenstätte;10;;32;519648.417;5709028.865
> Borgentreich;T2960;Lindenstätte;11;;32;519607.438;5708989.545
> Borgentreich;T2960;Lindenstätte;3;;32;519732.686;5709020.833
> Borgentreich;T2960;Lindenstätte;7;;32;519678.983;5709007.380
> Borgentreich;T2960;Lindenstätte;9;;32;519651.859;5709000.462
> Borgentreich;T2960;Lindenstätte;5;;32;519708.841;5709015.137
> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584
> Borgentreich;T2960;Lindenstätte;8;;32;519673.036;5709040.372
> +++
>
> I want to sort
> * first by column 4, numerical,
> * second by column 2
> * third by column 3
>
> So the result should be
> +++
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468
> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584
> Borgentreich;T2960;Lindenstätte;2;;32;519758.267;5709057.635
> Borgentreich;T2960;Lindenstätte;3;;32;519732.686;5709020.833
> Borgentreich;T2960;Lindenstätte;4;;32;519722.956;5709043.915
> Borgentreich;T2960;Lindenstätte;5;;32;519708.841;5709015.137
> Borgentreich;T2960;Lindenstätte;6;;32;519696.745;5709038.833
> Borgentreich;T2960;Lindenstätte;7;;32;519678.983;5709007.380
> Borgentreich;T2960;Lindenstätte;8;;32;519673.036;5709040.372
> Borgentreich;T2960;Lindenstätte;9;;32;519651.859;5709000.462
> Borgentreich;T2960;Lindenstätte;10;;32;519648.417;5709028.865
> Borgentreich;T2960;Lindenstätte;11;;32;519607.438;5708989.545
> Borgentreich;T2960;Lindenstätte;12;;32;519622.835;5709023.590
> Borgentreich;T2960;Lindenstätte;14;;32;519596.623;5709013.684
> Borgentreich;T2960;Lindenstätte;15;;32;519489.638;5709077.693
> Borgentreich;T2960;Lindenstätte;16;;32;519569.141;5709017.854
> Borgentreich;T2960;Lindenstätte;18;;32;519559.108;5709037.356
> Borgentreich;T2960;Lindenstätte;22;;32;519540.257;5709072.032
> Borgentreich;T2960;Lindenstätte;24;;32;519518.763;5709090.026
> Borgentreich;T2960;Lindenstätte;26;;32;519503.270;5709103.321
> +++
>
> I tried both
> sort -k4 -t";" -n | sort -k2,2 -t";" | sort -k3,3 -t";"
> and
> sort -k4 -t";" -n -k2,2 -k3,3
> and some permutations and reverted orders, without success.
> The sort by column 4 just gets lost or resorted.
>
> I'm not sure about the man page
> -k, --key=POS1[,POS2]
> start a key at POS1, end it at POS2 (origin 1)
>
> So I tried relative positions with
> -k3,1
> as well, without success.
>
> How do I apply the sort syntax properly?
>
> Thanks
> Martin

Martin Τrautmann

unread,

May 5, 2023, 8:25:35 AM5/5/23

to

On Fri, 5 May 2023 10:35:01 +0200, Dr Eberhard W Lisse wrote:
> mlr --fs 'semicolon' --ocsv --hi --ho --from t.ssv sort -n 4 -f 2,3

miller looks very powerful to me, but unfortunately it's not available here.

Kenny McCormack

unread,

May 5, 2023, 8:27:02 AM5/5/23

to

In article <slrnu59tbj....@ID-685.user.individual.de>,

Some sort of import/export restriction in your country?

--
If you ask a Trumper who is to blame for the debacle of Jan 6, they will almost certainly say
something about Antifa/BLM/something/whatever. This shows just how screwed up they are; they can't
even get their narrative straight. What they *should* say is "Eugene Goodman". If not for him, the plot
would probably have succeeded, so he (Eugene) is clearly to blame for the failure.

Kenny McCormack

unread,

May 5, 2023, 10:54:34 AM5/5/23

to

In article <slrnu5a50m....@ID-685.user.individual.de>,
Martin rautmann <tr...@gmx.de> wrote:

>On Fri, 5 May 2023 12:26:56 -0000 (UTC), Kenny McCormack wrote:
>> In article <slrnu59tbj....@ID-685.user.individual.de>,
>> Martin rautmann <tr...@gmx.de> wrote:
>>>On Fri, 5 May 2023 10:35:01 +0200, Dr Eberhard W Lisse wrote:
>>>> mlr --fs 'semicolon' --ocsv --hi --ho --from t.ssv sort -n 4 -f 2,3
>>>
>>>miller looks very powerful to me, but unfortunately it's not available here.
>>
>> Some sort of import/export restriction in your country?
>

>Error: Port miller requires a full Xcode installation, which was not
>found on your system.
>
>...and I've not enough space for that, 256 GB SSD only.

(Quoting Arte Johnson) Interesting... But not very.

I did a little research on "Miller" (which I had never heard of before
today). I don't see the point of it. It just seems like another AWK (or
Perl or Ruby or Python or ...). I.e., what I am saying is that the only
reason to use the "traditional" Unix tools (cut, join, comm, sort, sed,
grep, etc, etc) is because you just don't want to learn anything new (*).
If you were going to learn something new, just learn AWK - which can do all
the things that any/all of those "traditional" tools can do - and much more.
Why bother to learn this "Miller" thing?

(*) Or for pedagogical reasons. I sometimes see people - who obviously
know better - post sed | grep | cut | sed | sort | grep | ...
pipeliners on various boards because they assume their audience is more
comfortable that way.

Finally, I note that you mentioned Xcode. That's a Mac/Apple thing. Is
this "Miller" a specifically Apple thing?

--
To be evangelical is to spend every waking moment hovering around
two emotional states: fear and rage. Evangelicals are seriously the
angriest and most vicious bunch of self-pitying, constantly-moaning
whinybutts I've ever encountered.

gerg

unread,

May 5, 2023, 7:46:06 PM5/5/23

to

In article <slrnu5a50m....@ID-685.user.individual.de>,

Martin Τrautmann <tr...@gmx.de> wrote:
>On Fri, 5 May 2023 12:26:56 -0000 (UTC), Kenny McCormack wrote:

>> In article <slrnu59tbj....@ID-685.user.individual.de>,
>> Martin rautmann <tr...@gmx.de> wrote:
>>>On Fri, 5 May 2023 10:35:01 +0200, Dr Eberhard W Lisse wrote:
>>>> mlr --fs 'semicolon' --ocsv --hi --ho --from t.ssv sort -n 4 -f 2,3
>>>
>>>miller looks very powerful to me, but unfortunately it's not available here.
>>
>> Some sort of import/export restriction in your country?
>

>Error: Port miller requires a full Xcode installation, which was not
>found on your system.
>
>...and I've not enough space for that, 256 GB SSD only.
>

Homebrew is a thing on MacOS. A thing that seems to include miller v6.7.0.
<https://formulae.brew.sh/formula/miller#default>

(Homebrew only needs the Xcode runtime, not the full install)

--
::::::::::::: Greg Andrews ::::: ge...@panix.com :::::::::::::
I have a map of the United States that's actual size.
-- Steven Wright

Popping Mad

unread,

May 6, 2023, 2:04:38 AM5/6/23

to

On 4/19/23 03:27, Martin Τrautmann wrote:
>
> Hi all,
>
> how do I sort by multiple columns?
>

awk

Martin Τrautmann

unread,

May 6, 2023, 3:54:56 AM5/6/23

to

On Fri, 5 May 2023 14:53:24 -0000 (UTC), Kenny McCormack wrote:
> Finally, I note that you mentioned Xcode. That's a Mac/Apple thing. Is
> this "Miller" a specifically Apple thing?

Absolutely not. But if I want to install it the easy way I do a
sudo port install miller
which does require more of the xcode installation which I acutally have.

Kaz Kylheku

unread,

May 6, 2023, 5:47:10 AM5/6/23

to

On 2023-05-06, Martin Τrautmann <t-us...@gmx.net> wrote:

> On Sat, 6 May 2023 02:03:24 -0400, Popping Mad wrote:
>> On 4/19/23 03:27, Martin Τrautmann wrote:
>>>
>>> Hi all,
>>>
>>> how do I sort by multiple columns?
>>>
>>
>> awk
>

> Nope. "awk" alone does not to the job.

You may be able to cob together with GNU Awk, which provides:

- controlling the traversal of associative array to be in sorted orders.

- the asort function for sorting an associative array
(the indices are clobbered to a 1..N enumeration).

- the asorti function which sorts the indices instead: they
become the values, and indices go to 1..N.

A user-defined comparison can be used in asort and asorti,
which receives all four relevant values: left key and value,
right key and value.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazi...@mstdn.ca

Kenny McCormack

unread,

May 6, 2023, 7:50:57 AM5/6/23

to

In article <slrnu5c1i5....@ID-685.user.individual.de>,

Martin rautmann <tr...@gmx.de> wrote:
>On Sat, 6 May 2023 02:03:24 -0400, Popping Mad wrote:

>> On 4/19/23 03:27, Martin rautmann wrote:
>>>
>>> Hi all,
>>>
>>> how do I sort by multiple columns?
>>>
>>
>> awk
>

>Nope. "awk" alone does not to the job.

Yes. As Kaz explains.

--
"Women should not be enlightened or educated in any way. They should be
segregated because they are the cause of unholy erections in holy men.

-- Saint Augustine (354-430) --

Dr Eberhard W Lisse

unread,

May 6, 2023, 10:01:21 AM5/6/23

to

I use Homebrew on my Macs, which has it but you can also pull the
tar.gz from

https://github.com/johnkerl/miller/releases/tag/v6.7.0

and after extracting cd to the directory and run

xattr -cr mlr

greetings, el

On 05/05/2023 16:35, Martin Τrautmann wrote:
[...]

Dr Eberhard W Lisse

unread,

May 6, 2023, 10:04:38 AM5/6/23

to

Homebrew only needs the Xcode Command Line Tools, but only if you
wish to or have to compile sources. It's mainly binaries only and
works perfectly well without if if you can tolerate the squealing
:-)-O

I don't want to start a war on that but I find Homebrew more complete.

el

On 06/05/2023 01:45, gerg wrote:
[...]

Dr Eberhard W Lisse

unread,

May 6, 2023, 10:22:43 AM5/6/23

to

On 05/05/2023 16:53, Kenny McCormack wrote:
[...]

> (Quoting Arte Johnson) Interesting... But not very.
>
> I did a little research on "Miller" (which I had never heard of
> before today).

Obviously not enough research, but then that's a bliss :-)-O

> I don't see the point of it. It just seems like another AWK (or
> Perl or Ruby or Python or ...). I.e., what I am saying is that
> the only reason to use the "traditional" Unix tools (cut, join,
> comm, sort, sed, grep, etc, etc) is because you just don't want
> to learn anything new (*). If you were going to learn something
> new, just learn AWK - which can do all the things that any/all
> of those "traditional" tools can do - and much more. Why bother
> to learn this "Miller" thing?

Because it is extremely powerful and can do a lot of things awk
can't do or not do as well or as easily.

It has become one of a few of my goto tools to slice, dice and
look at CSV (and variants)

MILLER
QSV
CSVQ
CSVIEW
CSVLENS
(TYPST)

The first three and typst (which is a new but VERY promising
typesetting software) are under active development, the other two
not so much but they are just viewers and stable.

I have written me a large number of bash functions using
combinations of the above which make my life (Gynecologist six
months before retirement, doing my own admin and book keeping)
much easier (and time is money :-)-O)

From reconciling bank statements, over declaring Value Added and
Employees' Tax submitting claims to Funders.

I have been able to retire some long used Perl Scripts which are
much longer and in spite of having commented them very well I
don't understand any more :-)-O

[...]

> Finally, I note that you mentioned Xcode. That's a Mac/Apple
> thing. Is this "Miller" a specifically Apple thing?

See above, my research comment.

That means, in my understanding, that MacPorts wants to compile
from source and for that it apparently wants the full XCode.

el

Kenny McCormack

unread,

May 7, 2023, 5:54:31 AM5/7/23

to

In article <slrnu5ep43....@ID-685.user.individual.de>,
Martin rautmann <tr...@gmx.de> wrote:
...
>I agree that it might be able. But there's no solution given.
>But why look for an awk solution, it there's a reasonable sort solution
>which helped me to learn about sort options and principles.

Short answer: If the discussion of AWK (and various other tools) doesn't
make sense to you, you should probably just ignore it.

>naming just "awk" is not even as helpful as claiming "excel"

It wasn't really intended to be "helpful", in the sense in which you are
using that term. It was a side discussion, inspired by the introduced
diversion of yet another tool (the one called "Miller").

Finally, I should note that both of the following are true:

1) I never liked the Unix "sort" command. I always found its command
line syntax "weird". So, I've pretty much avoided using it, all
these years.
2) It's not worth learning AWK just to sort a file. But if you already
know AWK, it will seem like a natural way to sort a file.

--
"We are in the beginning of a mass extinction, and all you can talk
about is money and fairy tales of eternal economic growth."

- Greta Thunberg -

Chris Elvidge

unread,

May 7, 2023, 9:11:10 AM5/7/23

to

On 07/05/2023 13:00, Martin Τrautmann wrote:

> On Sun, 7 May 2023 09:52:38 -0000 (UTC), Kenny McCormack wrote:
>> In article <slrnu5ep43....@ID-685.user.individual.de>,
>> Martin rautmann <tr...@gmx.de> wrote:
>> ...
>>> I agree that it might be able. But there's no solution given.
>>> But why look for an awk solution, it there's a reasonable sort solution
>>> which helped me to learn about sort options and principles.
>>
>> Short answer: If the discussion of AWK (and various other tools) doesn't
>> make sense to you, you should probably just ignore it.
>

> I'd like to learn more about awk - but I don't like it that much to
> actually find out how awk would do this job, since awk can do that much
> more than just sort.
>

A start:

sed & awk
Second Edition
Dale Dougherty and Arnold Robbins
https://doc.lagout.org/operating%20system%20/linux/Sed%20%26%20Awk.pdf

--
Chris Elvidge
England

Janis Papanagnou

unread,

May 7, 2023, 10:21:42 AM5/7/23

to

This is probably not a good suggestion, where Martin wrote "[...] how
awk would do this job."; we were talking about sorting in this thread,
and that book mostly just refers to how to use the _external_ 'sort'
command (that was IMO anyway the right tool to do the job) called from
within awk. There's no mention of 'asort()' (or other sort functions
built-in in GNU awk). A function is described, though, written in awk,
how to write s sorting function; but re-implementation should IMO not
be the approach how to use awk to sort things. Unix'es 'sort' fits
exactly and has powerful flexibility to solve the elementary sorting
tasks efficiently as we've already seen some weeks ago in this thread.

Thanks for the link, BTW. That way I could have a peek into that book
and see which areas it covers and in which depth.

Janis

Kenny McCormack

unread,

May 7, 2023, 12:42:00 PM5/7/23

to

In article <u38c1p$3dcuv$1...@dont-email.me>,
Janis Papanagnou <janis_pap...@hotmail.com> wrote:
...

>Thanks for the link, BTW. That way I could have a peek into that book
>and see which areas it covers and in which depth.

If I were going to make a book recommendation (for AWK), it would be for
Robbbins's EAP. That covers GAWK specifically - which is the only AWK
anyone should be using nowadays (other than TAWK, of course).

--
The scent of awk programmers is a lot more attractive to women than
the scent of perl programmers.

(Mike Brennan, quoted in the "GAWK" manual)

Janis Papanagnou

unread,

May 7, 2023, 3:26:28 PM5/7/23

to

On 07.05.2023 18:39, Kenny McCormack wrote:
> In article <u38c1p$3dcuv$1...@dont-email.me>,
> Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> ...
>> Thanks for the link, BTW. That way I could have a peek into that book
>> and see which areas it covers and in which depth.
>
> If I were going to make a book recommendation (for AWK), it would be for
> Robbbins's EAP. That covers GAWK specifically - which is the only AWK
> anyone should be using nowadays (other than TAWK, of course).

Effective Awk Programming
It should be noted that it (also) covers functionality that you
won't find in traditional awk, in the versions that you find on
commercial Unix systems. The book marks GNU specific/non-standard
extensions, though.

The Awk Programming Language
For "plain old" Awk I'd still recommend (also) the old book from
the Awk's authors (Aho, Kernighan, Weinberger).

Sed & Awk
The book behind Chris' link isn't bad; besides all the basic Awk
knowledge it shows, for example, how to combine Awk with other
tools through pipes. The emphasis on Sed, though, looks (to me)
quite anachronistic. In the light of Awk the presence of Sed is
of arguable value. We may consider it just two independent topics
(Sed, Awk) in one book. (Only that they save duplication of the
Regexp chapter.) It probably would have made more sense if the
book would show the strength of the specific tools, and what's
the difference between the same functionality in one tool or the
other. As presented it's just one book for two separate tools.

For the mentioned aspect of combining tools I recall that a book
called (something like) "The Unix Tool-Chest" - or was it "The
Unix Programming Environment" (Kernighan, Pike?) - was very good
and, as far as I recall, also more complete and with convincing
examples.

The inherent shortcomings of one-dimensional pipe-processing vs.
(stateful) processing (e.g. in Awk) should also be kept in mind
for contexts where it matters.

Janis

Keith Thompson

unread,

May 7, 2023, 5:42:22 PM5/7/23

to

Chris Elvidge <ch...@mshome.net> writes:
[...]

> A start:
>
> sed & awk
> Second Edition
> Dale Dougherty and Arnold Robbins

> https://[...]/Sed%20%26%20Awk.pdf

That's a pirated copy. The second edition was published in 1997
and is still under copyright.

https://www.oreilly.com/library/view/sed-awk/1565922255/

Kaz Kylheku

unread,

May 7, 2023, 8:03:29 PM5/7/23

to

On 2023-05-07, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <u38c1p$3dcuv$1...@dont-email.me>,
> Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> ...
>>Thanks for the link, BTW. That way I could have a peek into that book
>>and see which areas it covers and in which depth.
>
> If I were going to make a book recommendation (for AWK), it would be for
> Robbbins's EAP. That covers GAWK specifically - which is the only AWK
> anyone should be using nowadays (other than TAWK, of course).

I agree that if you want to write in Awk because you see yourself as an
Awk developer, where all your target systems will somehow be running the
implementation that you prefer, then that probably would be GNU Awk.

While that's a nice sentiment, one big reason people write a script
in awk is that nothing has to be installed, and the problem they
are trying to solve is outside of the algorithmic expressiveness
of the shell and (other) utilities.

For that to work out, those people have to work with whatever awk is
installed.

As soon as you have the leeway to add something else to the target
systems's installation, non-Awk choices are likely on the table too.

For instance, on a good many Linux-based embedded systems, you
get BusyBox Awk. (BusyBox puts all utilities into a single executable,
so your /usr/bin/awk is a hard link to busybox.)

Mawk appears in the default installation of some GNU/Linux distros;
if you have a dependency on GNU Awk, you have to manipulate the
system installation, or ask your downstream user to do that.
(If that's okay, again, other tools could be on the table.)

MacOS's awk doesn't appear to be GNU Awk; it's something from BSD.
If you want Awk code to run on every installation of MacOS without
installing anything other than that code, that's the awk it
has to work with.

(There are also other proprietary Unixes. No longer highly relevant, but
still there.)

Awk is a POSIX tool; you can use POSIX's description of Awk as a guide
for writing portable code. (And even then, if you're dealing with
installations of legacy Unixes, be prepared for some "broken old Awk").

(Maybe what you meant is that GNU Awk is the only Awk that system
vendors should be installing in their images so that everyone could then
just write GNU Awk code, worrying only about what version is installed.
There are probably some systems that would be better off that way.
BusyBox Awk is used for some good reasons, like saving flash space in
constrained systems. Either way, what people should be doing and what
they are doing is different.)

Keith Thompson

unread,

May 7, 2023, 8:18:37 PM5/7/23

to

Kaz Kylheku <864-11...@kylheku.com> writes:
[...]

> For instance, on a good many Linux-based embedded systems, you
> get BusyBox Awk. (BusyBox puts all utilities into a single executable,
> so your /usr/bin/awk is a hard link to busybox.)

A minor point: busybox applets, including awk, are typically *symbolic*
links to busybox. (There might be a configuration option to use hard
links; I haven't checked.)

Benjamin Esham

unread,

May 8, 2023, 11:46:44 AM5/8/23

to

Martin Τrautmann wrote:

> Hi all,
>
> how do I sort by multiple columns?
>

> [snip

>
> I want to sort
> * first by column 4, numerical,
> * second by column 2
> * third by column 3
>
> So the result should be
> +++
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468
> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584

> [snip]
> +++

Hi Martin,

Are you sure that this example result is correct? The second, third, and
fourth lines have values of 2, 3, and 4 in column 4, but then the fifth line
has a value of 1 in column 4. That doesn't seem to match your description of
the sorting logic, unless I'm missing something.

Assuming your description of the logic is right, though, and assuming you
have access to GNU sort (I'm using "sort (GNU coreutils) 9.1"), I think you
can use

sort -t ';' -k 4g -k 2 -k 3

The result this gives is

> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109

> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584

> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116

> Borgentreich;T2960;Lindenstätte;2;;32;519758.267;5709057.635

> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464

> Borgentreich;T2960;Lindenstätte;3;;32;519732.686;5709020.833

> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468

> Borgentreich;T2960;Lindenstätte;4;;32;519722.956;5709043.915
> Borgentreich;T2960;Lindenstätte;5;;32;519708.841;5709015.137
> Borgentreich;T2960;Lindenstätte;6;;32;519696.745;5709038.833
> Borgentreich;T2960;Lindenstätte;7;;32;519678.983;5709007.380
> Borgentreich;T2960;Lindenstätte;8;;32;519673.036;5709040.372
> Borgentreich;T2960;Lindenstätte;9;;32;519651.859;5709000.462
> Borgentreich;T2960;Lindenstätte;10;;32;519648.417;5709028.865
> Borgentreich;T2960;Lindenstätte;11;;32;519607.438;5708989.545
> Borgentreich;T2960;Lindenstätte;12;;32;519622.835;5709023.590
> Borgentreich;T2960;Lindenstätte;14;;32;519596.623;5709013.684
> Borgentreich;T2960;Lindenstätte;15;;32;519489.638;5709077.693
> Borgentreich;T2960;Lindenstätte;16;;32;519569.141;5709017.854
> Borgentreich;T2960;Lindenstätte;18;;32;519559.108;5709037.356
> Borgentreich;T2960;Lindenstätte;22;;32;519540.257;5709072.032
> Borgentreich;T2960;Lindenstätte;24;;32;519518.763;5709090.026
> Borgentreich;T2960;Lindenstätte;26;;32;519503.270;5709103.321

Hope this helps!

Benjamin

Benjamin Esham

unread,

May 9, 2023, 10:54:18 AM5/9/23

to

Martin Τrautmann wrote:

> On Mon, 08 May 2023 11:46:35 -0400, Benjamin Esham wrote:
>> Martin Τrautmann wrote:
>>
>>> Hi all,
>>>
>>> how do I sort by multiple columns?
>>>
>>> [snip]
>>>
>>> I want to sort
>>> * first by column 4, numerical,
>>> * second by column 2
>>> * third by column 3
>>>
>>> So the result should be
>>> +++
>>> Borgentreich;D9444;Auf der Lindenstätte;1;;32;519950.850;5708982.109
>>> Borgentreich;D9444;Auf der Lindenstätte;2;;32;519926.937;5708966.116
>>> Borgentreich;D9444;Auf der Lindenstätte;3;;32;520008.619;5709083.464
>>> Borgentreich;D9444;Auf der Lindenstätte;4;;32;519860.278;5709041.468
>>> Borgentreich;D9386;Lindenstätte;1;;32;520150.696;5709236.354
>>> Borgentreich;T2960;Lindenstätte;1;;32;519778.725;5709026.584
>>> [snip]
>>> +++
>>
>> Hi Martin,
>>
>> Are you sure that this example result is correct? The second, third, and
>> fourth lines have values of 2, 3, and 4 in column 4, but then the fifth
>> line has a value of 1 in column 4. That doesn't seem to match your
>> description of the sorting logic, unless I'm missing something.
>

> Maybe my description is wrong, determined what is a "sticky" sort from
> other applications.
>
> The first sort is done numerical - resulting in the correct order of 1
> to 4, from column 4.
>
> The next order is by column 2, keeping the numerical sort of column 4
> and grouping thise together.
>
> But then I want to do the final sort on column 3, which does resort by
> those names, but does group "Lindenstätte" together, keeping the sort
> oder of the former D9386 vs. T2960
>
> So in "sort" terms, the expected order is not 4>2>3, but 3>2>4,
> depending on how the sort actually does proceed. That's why I had given
> this example - that's the order I have to apply in spreadsheets or
> relational databases for a stepwise pipe.

Ah, I think I understand. What you are calling the "first" sort is the
"innermost" sort, i.e., the sort that is applied *last*, and only if it is
necessary to break the tie between two rows that have been considered equal
by all of the previous sorting steps.

If I weren't able to get the results I wanted from sort(1), not even with
the GNU extensions, I would probably jump straight to importing the data
into an in-memory SQLite database and expressing the sort as a SQL query. In
your case, you would need to replace the semicolon separators with commas
and add a header line like

col1,col2,col3,col4,col5,col6,col7,col8

at the top of the file. Then, assuming the input is in a file named "input",
you could run

sqlite3 -noheader :memory: '.mode csv' .output \
".import '|cat -' data"
'select * from data order by col3, col2, cast(col4 as integer)' \
< input > output.csv

This produces a CSV file that obeys the informal standard that empty values,
and values with spaces in them, are double-quoted. If you don't want that,
you could replace the semicolons in the input with tabs, change the commas
to tabs in the header line, and run with ".mode tabs" instead of ".mode
csv". This would give you tab-separated output with no extra double quotes.

Of course, you allude to having this data in a relational database already.
I don't know the details of your situation, but in general a database seems
much better suited to this kind of manipulation than command-line tools that
can only deal with text.

Hope this helps!

Benjamin

Janis Papanagnou

unread,

May 9, 2023, 2:08:42 PM5/9/23

to

On 09.05.2023 19:41, Martin Τrautmann wrote:

> On Tue, 09 May 2023 10:54:09 -0400, Benjamin Esham wrote:
>> Ah, I think I understand. What you are calling the "first" sort is the
>> "innermost" sort, i.e., the sort that is applied *last*, and only if it is
>> necessary to break the tie between two rows that have been considered equal
>> by all of the previous sorting steps.
>

> Yeah, that's it. And the proper sort command had been given before:
>
> sort -t\; -k4,4n -k2,2 -k3,3

In that light it's interesting how long (in time and number of posts)
this thread got. ;-)

Janis

David W. Hodgins

unread,

May 9, 2023, 2:43:23 PM5/9/23

to

Easy to check as I use leafnode. :-)

# grep ^'Subject:' /var/spool/news/comp/unix/shell/*|grep 'sort by multiple columns'|wc -l
85

First post ...
Message-ID: <slrnu3v5vd....@ID-685.user.individual.de>
From: Martin =?UTF-8?Q?=CE=A4rautmann?= <t-us...@gmx.net>
Newsgroups: comp.unix.shell
Subject: sort by multiple columns
Date: Wed, 19 Apr 2023 09:27:12 +0200

Regards, Dave Hodgins

David W. Hodgins

unread,

May 9, 2023, 5:27:29 PM5/9/23

to

On Tue, 09 May 2023 16:46:26 -0400, Martin Τrautmann <t-us...@gmx.net> wrote:

> On Tue, 09 May 2023 14:41:33 -0400, David W. Hodgins wrote:
>> Easy to check as I use leafnode. :-)
>>
>> # grep ^'Subject:' /var/spool/news/comp/unix/shell/*|grep 'sort by multiple columns'|wc -l
>> 85
>

> But the result would not have been sorted properly yet. Maybe you could
> use sort to sort the digits suitably first, then use awk to compute the
> correct checksum?

:-)

Regards, Dave Hodgins

Benjamin Esham

unread,

May 10, 2023, 10:45:15 AM5/10/23

to

Partly my fault, sorry everyone :-) I was catching up with the newsgroup
after a while and missed that the "right answer" had already been posted.

Benjamin