Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What is GNU 'tac -b' for?

39 views
Skip to first unread message

Martijn Dekker

unread,
Jun 25, 2019, 8:39:16 PM6/25/19
to
Consider:

$ printf '%s\n' one two three | tac -b

According to 'man tac':
-b, --before: attach the separator before instead of after

So I would expect this output ('\n' is a real linefeed):

\nthree\ntwo\none

Instead, the output is:

\n\nthree\ntwoone

What use is that?

The behaviour seems to be as documented in "info coreutils 'tac
invocation'":

`-b'
`--before'
The separator is attached to the beginning of the record that it
precedes in the file.

But, as we see above, that creates a double separator at the beginning
and two unseparated records at the end.

Can anyone think of a valid use case for that, or is this a bug?

- M.

--
/ modernish -- harness the shell \
https://github.com/modernish/modernish

Janis Papanagnou

unread,
Jun 26, 2019, 3:09:10 AM6/26/19
to
On 26.06.2019 02:39, Martijn Dekker wrote:
> Consider:
>
> $ printf '%s\n' one two three | tac -b
>
> According to 'man tac':
> -b, --before: attach the separator before instead of after
>
> So I would expect this output ('\n' is a real linefeed):
>
> \nthree\ntwo\none
>
> Instead, the output is:
>
> \n\nthree\ntwoone
>
> What use is that?
>
> The behaviour seems to be as documented in "info coreutils 'tac invocation'":
>
> `-b'
> `--before'
> The separator is attached to the beginning of the record that it
> precedes in the file.
>
> But, as we see above, that creates a double separator at the beginning and two
> unseparated records at the end.
>
> Can anyone think of a valid use case for that, or is this a bug?

Looks like a bug to me. (And I see no usecase for that.)

On my system I see another strange behaviour...

$ printf $'one\ntwo\nthree\n' | od -c
0000000 o n e \n t w o \n t h r e e \n
0000016
$ printf $'one\ntwo\nthree\n' | tac | od -c
0000000

Obviously tac creates no output if sent to a pipe. - What am I missing?

And...

$ printf $'one\ntwo\nthree\n' | tac -b


three
$ printf $'one\ntwo\nthree\n' | tac -b | od -c
0000000


Janis

>
> - M.
>

Keith Thompson

unread,
Jun 26, 2019, 5:58:28 AM6/26/19
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:
[...]
> On my system I see another strange behaviour...
>
> $ printf $'one\ntwo\nthree\n' | od -c
> 0000000 o n e \n t w o \n t h r e e \n
> 0000016
> $ printf $'one\ntwo\nthree\n' | tac | od -c
> 0000000

I don't see that on my system (Ubuntu 18.04, coreutils 8.28):

$ printf $'one\ntwo\nthree\n' | od -c
0000000 o n e \n t w o \n t h r e e \n
0000016
$ printf $'one\ntwo\nthree\n' | tac | od -c
0000000 t h r e e \n t w o \n o n e \n
0000016
$

> Obviously tac creates no output if sent to a pipe. - What am I missing?

I don't know.

> And...
>
> $ printf $'one\ntwo\nthree\n' | tac -b
>
>
> three
> $ printf $'one\ntwo\nthree\n' | tac -b | od -c
> 0000000

$ printf $'one\ntwo\nthree\n' | tac -b


three
twoone$ printf $'one\ntwo\nthree\n' | tac -b | od -c
0000000 \n \n t h r e e \n t w o o n e
0000016


In the "printf ... | tac -b" case, I wonder of some of the output
is being clobbered by your shell prompt. Try adding " ; sleep 3"
to delay the prompt.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

John-Paul Stewart

unread,
Jun 26, 2019, 12:24:27 PM6/26/19
to
On 2019-06-25 8:39 p.m., Martijn Dekker wrote:
> Consider:
>
> $ printf '%s\n' one two three | tac -b
>
> According to 'man tac':
> -b, --before: attach the separator before instead of after
>
> So I would expect this output ('\n' is a real linefeed):
>
>     \nthree\ntwo\none
>
> Instead, the output is:
>
>     \n\nthree\ntwoone
>
> What use is that?
[snip]
> Can anyone think of a valid use case for that, or is this a bug?

I think it is just that the documentation is unclear that the -b option
affects how tac splits its input into records. Also note that tac
doesn't add or remove separators, it strictly re-orders input.

The original input was "one\ntwo\nthree\n", which 'tac -b' sees as four
input records: (no separator) "one", "\ntwo", "\nthree", and "\n" (just
a separator). Reverse the order of those, and you get exactly the
output you saw. If you change the input to "\none\ntwo\nthree" with a
leading newline instead, then you'll get the expected output.

To really see what's going on, try all of these strings:

"one\ntwo\nthree\n"
"one\ntwo\nthree"
"\none\ntwo\nthree"

Run each through both 'tac -b' and plain 'tac' and you'll see what I
mean about it not adding/removing separators and that the -b affects how
the input is broken into records, not the output.


Janis Papanagnou

unread,
Jun 26, 2019, 2:54:48 PM6/26/19
to
On 26.06.2019 09:09, Janis Papanagnou wrote:
> On 26.06.2019 02:39, Martijn Dekker wrote:
>> Consider:
>>
>> $ printf '%s\n' one two three | tac -b
>>
>> According to 'man tac':
>> -b, --before: attach the separator before instead of after
>>
>> So I would expect this output ('\n' is a real linefeed):
>>
>> \nthree\ntwo\none
>>
>> Instead, the output is:
>>
>> \n\nthree\ntwoone
>>
>> What use is that?
>> [...]
>>
>> Can anyone think of a valid use case for that, or is this a bug?
>
> Looks like a bug to me. (And I see no usecase for that.)

Re-thinking about it one thought came to my mind (not sure it makes
technically or else sense); what about processing languages that are
written from right to left, would they have (in computer processing)
their line terminators also on the left side of each line? - If so,
then that could be an application case.

$ printf $'\nthree\ntwo\none' | tac -b

would create \none, \ntwo, and, \nthree, and another tac -b added

$ printf $'\nthree\ntwo\none' | tac -b | tac -b

would re-create the original sequence.

Janis

Stephane Chazelas

unread,
Jun 28, 2019, 6:05:10 AM6/28/19
to
2019-06-26 02:39:11 +0200, Martijn Dekker:
[...]
> Can anyone think of a valid use case for that, or is this a bug?
[...]

Can be useful in things like:

~$ printf 'header: %s\n' foo bar | tac -s 'header: '
bar
foo
header: header: <no-eol>
~$ printf 'header: %s\n' foo bar | tac -s 'header: ' -b
header: bar
header: foo

--
Stephane

Grant Taylor

unread,
Jun 28, 2019, 8:00:49 PM6/28/19
to
On 6/28/19 4:00 AM, Stephane Chazelas wrote:
> Can be useful in things like:
>
> ~$ printf 'header: %s\n' foo bar | tac -s 'header: '
> bar
> foo
> header: header: <no-eol>
> ~$ printf 'header: %s\n' foo bar | tac -s 'header: ' -b
> header: bar
> header: foo

Intriguing.

Thank you for sharing Stephane.



--
Grant. . . .
unix || die

Kaz Kylheku

unread,
Jun 28, 2019, 8:52:29 PM6/28/19
to
You've cracked the nut.

But your code doesn't serve as a good motivating example; it is
confounded by the fact that for the given line-oriented data sample, we
would be better off working with the default newline separator as a
record terminator:

$ printf 'header: %s\n' foo bar | tac
header: bar
header: foo

Perhaps this the following is a clearer demonstration of your finding:

$ echo -n 'a:b:c:' | tac -s ':' -b ; echo
::c:ba

Huh? That looks like a bug?! But, what if we have record start
symbols rather than terminator symbols:

$ echo -n ':a:b:c' | tac -s ':' -b ; echo
:c:b:a

Aha!

Thus the ::c:ba output is explained like this: in a:b:c: the colon is
treated as leader, and so the data is considered to be:

<missing-leader> a <leader> b <leader> c <leader> <empty-record>

Thus we get the output:

<leader> <empty-record> <leader> c <leader> b <missing-leader> a

This could benefit from being better documented. The "separator"
terminology is flawed, for starters.

Martijn Dekker

unread,
Jul 1, 2019, 8:03:40 PM7/1/19
to
Thanks to everyone who responded in this thread, particularly John-Paul
Stewart:

> I think it is just that the documentation is unclear that the -b option
> affects how tac splits its input into records. Also note that tac
> doesn't add or remove separators, it strictly re-orders input.

That's the essential bit of insight I needed in a nutshell. Stéphane and
Kaz also gave very useful examples.

With the help of all your responses I've created a new cross-platform
'tac' implementation, in shell and awk, as a module for the modernish
shell library so it can be used on any POSIX system, not just where GNU
coreutils are available (note that installing modernish does not require
a compiler and you can install it in your home directory on multi-user
systems). Modernish 'tac' acts identically to GNU 'tac' with the
examples you all gave in this thread, as long as the input is text.

Brief documentation (hopefully better than the GNU one):
https://github.com/modernish/modernish#user-content-use-sysbasetac

View the code (the awk code is the interesting bit):
https://github.com/modernish/modernish/blob/master/lib/modernish/mdl/sys/base/tac.mm

I've added couple of new options to this 'tac':
* -B is like -b except it acts like I originally expected -b to act:
it expects separators to follow records in the input, but makes them
precede records in the output. I can't quite figure out a concrete
use case for that, perhaps one of you can think of some. :)
* -P is for paragraph mode: output a text last paragraph first, with
paragraphs separated by at least two linefeeds. (This is easy to do
in awk by using an empty record separator.)

One other important difference is that the '-r' option causes it to
accept an extended (as opposed to basic) regex as the separator; this is
inevitable as awk can't parse basic regexes.

It would be fun if some of you could try to break this and post your
findings.

To get started quickly, get the current modernish development code and
make yourself a little wrapper script:

git clone https://github.com/modernish/modernish
modernish/install.sh -n
cat > ~/bin/mtac <<EOF
#! /usr/bin/env modernish
#! use sys/base/tac
tac "$@"
EOF
chmod +x ~/bin/mtac

Now you have an 'mtac' command in ~/bin that makes it easy to test this
implementation.

To get rid of modernish again:

modernish/uninstall.sh -fn

Thanks,

Stephane Chazelas

unread,
Jul 2, 2019, 12:45:14 PM7/2/19
to
2019-07-02 02:03:35 +0200, Martijn Dekker:
[...]
> With the help of all your responses I've created a new cross-platform 'tac'
> implementation, in shell and awk, as a module for the modernish shell
> library so it can be used on any POSIX system, not just where GNU coreutils
> are available (note that installing modernish does not require a compiler
> and you can install it in your home directory on multi-user systems).
> Modernish 'tac' acts identically to GNU 'tac' with the examples you all gave
> in this thread, as long as the input is text.
[...]

Note that while "tac" is a GNU-specific command, most other
systems have "tail -r" to achieve the same effect (though some
have limitations for non-seekable input IIRC).

--
Stephane

Ed Morton

unread,
Jul 2, 2019, 2:08:41 PM7/2/19
to
On 7/1/2019 7:03 PM, Martijn Dekker wrote:
> Thanks to everyone who responded in this thread, particularly John-Paul
> Stewart:
>
>> I think it is just that the documentation is unclear that the -b option
>> affects how tac splits its input into records.  Also note that tac
>> doesn't add or remove separators, it strictly re-orders input.
>
> That's the essential bit of insight I needed in a nutshell. Stéphane and
> Kaz also gave very useful examples.
>
> With the help of all your responses I've created a new cross-platform
> 'tac' implementation, in shell and awk, as a module for the modernish
> shell library so it can be used on any POSIX system, not just where GNU
> coreutils are available (note that installing modernish does not require
> a compiler and you can install it in your home directory on multi-user
> systems). Modernish 'tac' acts identically to GNU 'tac' with the
> examples you all gave in this thread, as long as the input is text.
>
> Brief documentation (hopefully better than the GNU one):
> https://github.com/modernish/modernish#user-content-use-sysbasetac
>
> View the code (the awk code is the interesting bit):
> https://github.com/modernish/modernish/blob/master/lib/modernish/mdl/sys/base/tac.mm

To be able to handle large files you'd be better off using:

cat -n | sort -rn | cut -f2-

to reverse the line order than what you're currently doing which is
reading the whole input into a string then splitting that string into an
array thereby more than requiring twice the memory of the input file size.

For example you can do this to reverse a file:

$ seq 5 | cat -n | sort -rn | cut -f2-
5
4
3
2
1

so I'd build a solution around that process of adding line numbers then
using sort (to do the heavy lifting since it's designed to handle huge
files and can do paging, etc. as necessary to handle them) to reorder
them, then cut to remove the line numbers you added at the start.

It's easy with awk to convert paragraphs to/from lines when necessary if
that's a concern, e.g.:

#####
$ cat file
Wee, sleekit, cowrin, tim'rous beastie,
O, what a panic's in thy breastie!
Thou need na start awa sae hasty,
Wi' bickering brattle!
I wad be laith to rin an' chase thee,
Wi' murd'ring pattle!

I'm truly sorry man's dominion,
Has broken nature's social union,
An' justifies that ill opinion,
Which makes thee startle
At me, thy poor, earth-born companion,
An' fellow-mortal!

#####

$ awk -v RS= '{gsub(/@/,"@A"); gsub(/#/,"@B"); gsub(ORS,"#")}1' file |
cat -n | sort -rn | cut -f2- |
awk -v ORS='\n\n' '{gsub(/#/,RS); gsub(/@B/,"#"); gsub(/@A/,"@")}1'
I'm truly sorry man's dominion,
Has broken nature's social union,
An' justifies that ill opinion,
Which makes thee startle
At me, thy poor, earth-born companion,
An' fellow-mortal!

Wee, sleekit, cowrin, tim'rous beastie,
O, what a panic's in thy breastie!
Thou need na start awa sae hasty,
Wi' bickering brattle!
I wad be laith to rin an' chase thee,
Wi' murd'ring pattle!

#####

Obviously you could add the NR with awk but I left the cat -n in as I
like the consistency of always handling `cat -n | sort -rn | cut -f2-`
the same way regardless of other things you're doing. You can also do
whatever else you need to do wrt other tac options using awk outside of
that main line-sorting pipeline.

Regards,

Ed.

Chris Elvidge

unread,
Jul 2, 2019, 2:13:32 PM7/2/19
to
How many UUOC can you get in one post?


--

Chris Elvidge, England

Ed Morton

unread,
Jul 2, 2019, 4:26:51 PM7/2/19
to
You didn't see any in my post but I'd expect someone could create a post
with many of them if they wanted to. If you think any of the `cat`s in
my post are useless then you misunderstood what they're doing and why -
feel free to ask if you have questions.

Ed.

Martijn Dekker

unread,
Jul 3, 2019, 10:46:30 AM7/3/19
to
Op 02-07-19 om 20:08 schreef Ed Morton:
> To be able to handle large files you'd be better off using:
>
>     cat -n | sort -rn | cut -f2-
>
> to reverse the line order than what you're currently doing which is
> reading the whole input into a string then splitting that string into an
> array thereby more than requiring twice the memory of the input file size.

Thanks. I did think about that. Unfortunately I can't think of another
way of handling arbitrary (and arbitrarily varying) separators that are
matched by a regular expression, while remembering each individual
separator, as GNU tac does. Your idea only works for lines of text.

(Also, 'cat -n' (count lines) is not POSIX -- though its functionality
can easily be duplicated in awk.)

If only POSIX awk supported a regex RS, like GNU awk does, then this
wouldn't be a problem... Instead, I have to abuse FS to be able to match
regex separators, which involves reading the entire document and then
splitting it as one 'field'.

Maybe it would be better to add another awk script for use with a
single-character, non-regex separator. This case can be handled
straightforwardly with RS and one array, so there's no need for two
copies of the document to exist in memory.

[re: paragraph mode]
> It's easy with awk to convert paragraphs to/from lines when necessary
> if that's a concern, e.g.:
[...]
> $ awk -v RS= '{gsub(/@/,"@A"); gsub(/#/,"@B"); gsub(ORS,"#")}1' file |
> cat -n | sort -rn | cut -f2- |
> awk -v ORS='\n\n' '{gsub(/#/,RS); gsub(/@B/,"#"); gsub(/@A/,"@")}1'

I'm guessing the point of this is that none of the utilities in this
pipeline need to hold a large file entirely in working memory (with
'sort' doing its own paging where needed, as you pointed out).

On the other hand, for smaller files, it comes at the cost of invoking
multiple processes instead of just the one. Although I could reduce it
to three by eliminating 'cat -n' and making awk do that work.

Thanks for the ideas -- they did give me stuff to think about.

Martijn Dekker

unread,
Jul 3, 2019, 11:21:03 AM7/3/19
to
Below, gtac = GNU tac, mtac = modernish tac.

$ printf 'fourXXXXthreeXXXtwoXXoneX' | gtac -r -s 'XX*'; echo
oneXXtwoXXXthreeXXXXfourX
$ printf 'fourXXXXthreeXXXtwoXXoneX' | mtac -r -s 'XX*'; echo
oneXtwoXXthreeXXXfourXXXX

With -b (separator precedes record in both input and output):

$ printf 'XXXXfourXXXthreeXXtwoXone' | gtac -b -r -s 'XX*'; echo
XoneXtwoXXthreeXXXfourXXX
$ printf 'XXXXfourXXXthreeXXtwoXone' | mtac -b -r -s 'XX*'; echo
XoneXXtwoXXXthreeXXXXfour

The regex 'XX*' means: match one or more Xes, in both basic and extended
regular expressions, right?

So, mtac does what I would expect, and the gtac output smells like some
off-by-one bug in GNU tac to me.

Am I missing something?

_________
As an aside, here's something GNU tac can't do: separator follows record
in input, precedes record in output.

$ printf 'fourXXXXthreeXXXtwoXXoneX' | mtac -B -r -s 'XX*'; echo
XoneXXtwoXXXthreeXXXXfour

- M.

Ed Morton

unread,
Jul 4, 2019, 12:09:22 PM7/4/19
to
On 7/3/2019 9:46 AM, Martijn Dekker wrote:
> Op 02-07-19 om 20:08 schreef Ed Morton:
>> To be able to handle large files you'd be better off using:
>>
>>      cat -n | sort -rn | cut -f2-
>>
>> to reverse the line order than what you're currently doing which is
>> reading the whole input into a string then splitting that string into
>> an array thereby more than requiring twice the memory of the input
>> file size.
>
> Thanks. I did think about that. Unfortunately I can't think of another
> way of handling arbitrary (and arbitrarily varying) separators that are
> matched by a regular expression, while remembering each individual
> separator, as GNU tac does. Your idea only works for lines of text.

True.

> (Also, 'cat -n' (count lines) is not POSIX -- though its functionality
> can easily be duplicated in awk.)

Also true.

> If only POSIX awk supported a regex RS, like GNU awk does, then this
> wouldn't be a problem... Instead, I have to abuse FS to be able to match
> regex separators, which involves reading the entire document and then
> splitting it as one 'field'.

If you posted a worst-case example (concise, testable sample input and
expected output) I'd be happy to take a look to see if i can come up
with a better approach.

>
> Maybe it would be better to add another awk script for use with a
> single-character, non-regex separator. This case can be handled
> straightforwardly with RS and one array, so there's no need for two
> copies of the document to exist in memory.

Yeah, probably as that'll be by far the most common use case so no point
slowing that down for the general case that'll rarely be used. I've
never actually seen tac used with a regexp as the separator.

>
> [re: paragraph mode]
>> It's easy with awk to convert paragraphs to/from lines when necessary
>> if that's a concern, e.g.:
> [...]
>> $ awk -v RS= '{gsub(/@/,"@A"); gsub(/#/,"@B"); gsub(ORS,"#")}1' file |
>> cat -n | sort -rn | cut -f2- |
>> awk -v ORS='\n\n' '{gsub(/#/,RS); gsub(/@B/,"#"); gsub(/@A/,"@")}1'
>
> I'm guessing the point of this is that none of the utilities in this
> pipeline need to hold a large file entirely in working memory (with
> 'sort' doing its own paging where needed, as you pointed out).

Right.

>
> On the other hand, for smaller files, it comes at the cost of invoking
> multiple processes instead of just the one. Although I could reduce it
> to three by eliminating 'cat -n' and making awk do that work.

For small files you don't care about performance so don't make any
changes to optimize for small files at any cost to handling large files
or clarity of the script or anything else.

>
> Thanks for the ideas -- they did give me stuff to think about.

You're welcome.

Ed.

>
> - M.
>

Helmut Waitzmann

unread,
Jul 5, 2019, 9:52:26 AM7/5/19
to
Martijn Dekker <mar...@inlv.demon.nl>:
> Op 02-07-19 om 20:08 schreef Ed Morton:

>> To be able to handle large files you'd be better off using:
>>
>>     cat -n | sort -rn | cut -f2-
>>

[…]

> (Also, 'cat -n' (count lines) is not POSIX -- though its functionality
> can easily be duplicated in awk.)
>

An alternative would be to use

grep -F -n -e '' | sort -rn | cut -f 2- -d :

Janis Papanagnou

unread,
Jul 5, 2019, 10:56:28 AM7/5/19
to
Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
options to suppress headers etc. etc.).

Janis

Kenny McCormack

unread,
Jul 5, 2019, 11:43:55 AM7/5/19
to
In article <qfnoeo$ipl$1...@news-1.m-online.net>,
nl

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Seriously

Martijn Dekker

unread,
Jul 5, 2019, 12:25:52 PM7/5/19
to
Op 04-07-19 om 18:09 schreef Ed Morton:
> On 7/3/2019 9:46 AM, Martijn Dekker wrote:
>> Op 02-07-19 om 20:08 schreef Ed Morton:
>>> To be able to handle large files you'd be better off using:
>>>
>>>      cat -n | sort -rn | cut -f2-
>>>
>>> to reverse the line order than what you're currently doing which is
>>> reading the whole input into a string then splitting that string into
>>> an array thereby more than requiring twice the memory of the input
>>> file size.
>>
>> Thanks. I did think about that. Unfortunately I can't think of another
>> way of handling arbitrary (and arbitrarily varying) separators that
>> are matched by a regular expression, while remembering each individual
>> separator, as GNU tac does. Your idea only works for lines of text.
>
> True.

And that unfortunately gives a problem: a lack of final linefeed in the
input cannot be handled in a way that is compatible with GNU 'tac',
because 'tac' is not actually line-oriented.

$ printf 'one\ntwo\nthree' | gtac | od -a
0000000 t h r e e t w o nl o n e nl
0000015
$ printf 'one\ntwo\nthree' | cat -n | sort -rn | cut -f2- | od -a
0000000 t h r e e nl t w o nl o n e nl
0000016

An extra newline is added after 'three'. A proper reimplementation of
'tac' should not pull separators out of a hat.


>> (Also, 'cat -n' (count lines) is not POSIX -- though its functionality
>> can easily be duplicated in awk.)
>
> Also true.

'pr -t -n' is a highly portable alternative (flagged by Janis, elsewhere
in thread).


>> If only POSIX awk supported a regex RS, like GNU awk does, then this
>> wouldn't be a problem... Instead, I have to abuse FS to be able to
>> match regex separators, which involves reading the entire document and
>> then splitting it as one 'field'.
>
> If you posted a worst-case example (concise, testable sample input and
> expected output) I'd be happy to take a look to see if i can come up
> with a better approach.

One usage example given in GNU 'tac' source code is reversing input
character by character, which seems like a good "worst-case" example.
Any text file can be your sample input.

gtac -r -s 'x\|[^x]'

With modernish tac we have to use an ERE which means '\|' becomes '|':

mtac -r -s 'x|[^x]'

Works fine, but it's very slow.

Also, trying this out on a text file with UTF-8 characters (in a UTF-8
locale) shows that GNU 'tac' actually reverses byte by byte, not
character by character. Modernish tac supports multibyte characters
through awk.

____
Another possible 'worst-case' example is the use of a regex separator
that matches separators of varying lengths -- for instance, use any
string of numbers as the separator:

printf 'lorem12ipsum786dolor1337sit16384amet' \
| tac -r -s '[0-9][0-9]*'

Output should be:

ametsit16384dolor1337ipsum786lorem12

And it appears that GNU 'tac' does this wrong:

amet4836sit1733dolor168ipsum72lorem1

Martijn Dekker

unread,
Jul 5, 2019, 2:18:35 PM7/5/19
to
As I wrote elsewhere, unfortunately I can't use line-oriented utilities
to faithfully reimplement the non-line-oriented 'tac', as line-oriented
utilities pull a final linefeed out of a hat if it's not in the input.

I might add a new -L option that enables this method, making modernish
'tac' explicitly line-oriented for better performance with large files.

In any case, the question of what is a portable line numbering filter is
inherently interesting...


>> On 05.07.2019 15:52, Helmut Waitzmann wrote:
>>> Martijn Dekker <mar...@inlv.demon.nl>:
>>>> Op 02-07-19 om 20:08 schreef Ed Morton:
>>>>> To be able to handle large files you'd be better off using:
>>>>> cat -n | sort -rn | cut -f2"
>>>> (Also, 'cat -n' (count lines) is not POSIX -- though its functionality can
>>>> easily be duplicated in awk.)
>>> An alternative would be to use
>>> grep -F -n -e '' | sort -rn | cut -f 2- -d :

Good call. POSIX does specify that an empty regex for 'grep' matches
every line, so this is very usable.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html

> In article <qfnoeo$ipl$1...@news-1.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
>> options to suppress headers etc. etc.).

This may be an even better idea. The -t option completely turns off page
separation and padding with empty lines, so 'pr -t' equals 'cat'. So 'pr
-t -n' is a straightforward line numbering filter for arbitrary input
text. It's output is identical to that of the non-standard 'cat -n'.

I've also verified this is POSIX and works on macOS, the BSDs, Linux,
Solaris, and (as verified at <https://unix50.org/>) all the way back to
AT&T UNIX System III (1982), so this should be extremely portable.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pr.html

As an aside, something like 'pr -t -3' will format text into 3 columns.
I've been wanting a portable way to do that for a while (as the 'column'
command is not portable).

Thanks for reminding us of this useful utility!


Op 05-07-19 om 17:43 schreef Kenny McCormack:
> nl

That looks like most obvious POSIX command to use.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/nl.html

Unfortunately, if you don't want to divide the input text in pages, 'nl'
cannot be used with completely arbitrary input text. While you can
specify custom page delimiter characters with '-d' it's impossible to
turn page separation off. Even if you use '-p' to specify that line
numbering should not be restarted at logical page delimiters, any line
matching those page delimiter characters is still deleted.

I've also attempted to use an empty page delimiter (-d '') to turn off
page delimitation, but in POSIX that is unspecified; in real life, it
turns off page delimitation on GNU 'nl', but BSD 'nl' simply reverts to
the default '\:'.


Thanks everyone,

Janis Papanagnou

unread,
Jul 5, 2019, 3:12:57 PM7/5/19
to
On 05.07.2019 20:18, Martijn Dekker wrote:
>> In article <qfnoeo$ipl$1...@news-1.m-online.net>,
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
>>> options to suppress headers etc. etc.).
>
> This may be an even better idea. The -t option completely turns off page
> separation and padding with empty lines, so 'pr -t' equals 'cat'. So 'pr -t
> -n' is a straightforward line numbering filter for arbitrary input text. It's
> output is identical to that of the non-standard 'cat -n'.
>
> I've also verified this is POSIX and works on macOS, the BSDs, Linux, Solaris,
> and (as verified at <https://unix50.org/>) all the way back to AT&T UNIX
> System III (1982), so this should be extremely portable.
>
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pr.html
>
> As an aside, something like 'pr -t -3' will format text into 3 columns. I've
> been wanting a portable way to do that for a while (as the 'column' command is
> not portable).
>
> Thanks for reminding us of this useful utility!

You're welcome. But be careful WRT compatibility or standard behaviour with
really old systems. I worked with it in the 1980's and seem to recall that
the UNIX variant I used at that time (basically the Amdahl variant "UTS" of
AT&T's UNIX Version 7, which was later based on System III and System V)
behaved differently; e.g. truncating lines larger than the default (72), or
you'd have to provide some larger but fixed(!) value as command line option;
so it would depend on the actual data and might be inappropriate for use in
the tac case. (Maybe that's anyway all academic now in POSIX era, and my
memories may also mislead me, but a warning might nonetheless be necessary
to be on the safe side just in case.)

Janis

Stephane Chazelas

unread,
Jul 6, 2019, 5:25:14 AM7/6/19
to
2019-07-05 20:18:29 +0200, Martijn Dekker:
[...]
> > > On 05.07.2019 15:52, Helmut Waitzmann wrote:
> > > > Martijn Dekker <mar...@inlv.demon.nl>:
> > > > > Op 02-07-19 om 20:08 schreef Ed Morton:
> > > > > > To be able to handle large files you'd be better off using:
> > > > > > cat -n | sort -rn | cut -f2"
> > > > > (Also, 'cat -n' (count lines) is not POSIX -- though its functionality can
> > > > > easily be duplicated in awk.)
> > > > An alternative would be to use
> > > > grep -F -n -e '' | sort -rn | cut -f 2- -d :
>
> Good call. POSIX does specify that an empty regex for 'grep' matches every
> line, so this is very usable.

Note that

grep -n '' is not portable in practice, where on some systems an
empty regexp recalls the last regexp (which for grep doesn't
make sense). That's the case of /bin/grep on Solaris for
instance (which doesn't support -F nor -e either):

$ grep ''
grep: RE error 41: No remembered search string

(fgrep -n '' works on Solaris)

With GNU grep at least, if --colour is enabled, it also becomes
very inefficient.

I usually use grep '^' instead.

For numbering lines

nl -ba -d'
'

(with the newline attached to the -d option)

should be portable and POSIX.

--
Stephane

Martijn Dekker

unread,
Jul 6, 2019, 3:50:17 PM7/6/19
to
Op 06-07-19 om 11:23 schreef Stephane Chazelas:
> grep -n '' is not portable in practice, where on some systems an
> empty regexp recalls the last regexp (which for grep doesn't
> make sense). That's the case of /bin/grep on Solaris for
> instance (which doesn't support -F nor -e either):

Are there other current grep utilities that do this? Everything else
aims to be POSIX compliant these days, AFAIK.

Personally I don't care about the ancient-style /bin/* utilities on
Solaris; it comes with POSIX-conforming versions in /usr/xpg{7,6,4}/bin.

The modernish installer uses this to get a good default utility path:
DEFPATH=$(
PATH=/usr/xpg7/bin:/usr/xpg6/bin:/usr/xpg4/bin:/bin:/usr/bin:$PATH \
getconf PATH 2>/dev/null
) || DEFPATH=/bin:/usr/bin:/sbin:/usr/sbin

However:
[...]
> I usually use grep '^' instead.

That actually makes more sense, and seems like good practice.

> For numbering lines
>
> nl -ba -d'
> '
>
> (with the newline attached to the -d option)
>
> should be portable and POSIX.

Doesn't this technically specify that the logical page delimiter is
$'\n:'? POSIX says: "If only one character is entered, the second
character shall remain the default character ':'" and it doesn't mention
anything about a possibility of disabling the thing altogether.

I imagine that this portably disables the logical page delimiter in
practice only because no implementation bothers to match a delimiter
that spans two lines. It seems like a bit of a hack to me.

Is there any reason to prefer this over 'pr -t -n'?

Thanks,

Stephane Chazelas

unread,
Jul 6, 2019, 5:35:11 PM7/6/19
to
2019-07-06 21:50:11 +0200, Martijn Dekker:
> Op 06-07-19 om 11:23 schreef Stephane Chazelas:
> > grep -n '' is not portable in practice, where on some systems an
> > empty regexp recalls the last regexp (which for grep doesn't
> > make sense). That's the case of /bin/grep on Solaris for
> > instance (which doesn't support -F nor -e either):
>
> Are there other current grep utilities that do this? Everything else aims to
> be POSIX compliant these days, AFAIK.
>
> Personally I don't care about the ancient-style /bin/* utilities on Solaris;
> it comes with POSIX-conforming versions in /usr/xpg{7,6,4}/bin.
[...]

Note that xpg4/6/7 are only installed by default in the
"full" (workstation/server) deployments of Solaris (or at least
it was the case for Solaris 10). I would agree with you it's not
worth trying to deal with Solaris ancient /bin/* tools though.

[...]
> Is there any reason to prefer this over 'pr -t -n'?
[...]

With GNU pr:

$ printf '\f\n' | pr -t -n | od -tc
0000000 \f
0000001

On Solaris:

$ printf '\f\n' | pr -t -n | od -tc
0000000 \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
*
0000100 \n \n 1 \t \n
0000111

Anyway, I just found nl -ba -d$'\n' is not portable either. On NetBSD:

nl: invalid delim argument --

There's always:

awk '{print NR, $0}'

And

sed = | paste - -

--
Stephane

Helmut Waitzmann

unread,
Jul 7, 2019, 5:25:54 PM7/7/19
to
Janis Papanagnou <janis_pa...@hotmail.com>:
I didn't forget pr(1), I consider it unusable for numbering lines:

On my system,

seq 1 1 20 | pr -t -n' 1' | sed -n -e '8,12p'

(using "sed" merely to reduce output size of the example) gives
the following output:

8 8
9 9
0 10
1 11
2 12

<http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pr.html#tag_20_93_04>
says:

-n[char][width]
Provide width-digit line numbering (default for width shall
be 5). The number shall occupy the first width column
positions of each text column of default output or each line
of -m output. If char (any non-digit character) is given, it
shall be appended to the line number to separate it from
whatever follows (default for char is a <tab>).

I think, there is no way to prevent pr(1) from stripping
digits from the line number, if there are more than given by the
"width" info.

At least according to the POSIX standard, "grep -n" does not have
this limitation.

Janis Papanagnou

unread,
Jul 7, 2019, 6:06:46 PM7/7/19
to
On 07.07.2019 23:25, Helmut Waitzmann wrote:
> Janis Papanagnou <janis_pa...@hotmail.com>:
>> On 05.07.2019 15:52, Helmut Waitzmann wrote:
>
>>> An alternative would be to use
>>> grep -F -n -e '' | sort -rn | cut -f 2- -d :
>>
>> Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
>> options to suppress headers etc. etc.).
>
> I didn't forget pr(1), I consider it unusable for numbering lines:

Numbering lines works perfectly in my environment and worked on every
Unix system where I had used it before.

> On my system,
> seq 1 1 20 | pr -t -n' 1' | sed -n -e '8,12p'
>
> (using "sed" merely to reduce output size of the example) gives the following
> output:
> 8 8
> 9 9
> 0 10
> 1 11
> 2 12
>
> <http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pr.html#tag_20_93_04>
> says:
> -n[char][width]
> Provide width-digit line numbering (default for width shall
> be 5). The number shall occupy the first width column
> positions of each text column of default output or each line
> of -m output. If char (any non-digit character) is given, it
> shall be appended to the line number to separate it from
> whatever follows (default for char is a <tab>).
>
> I think, there is no way to prevent pr(1) from stripping digits from the line
> number, if there are more than given by the "width" info.

I wonder then why you use a fixed number of 1 if you want your numbers
to contain more than 1 digit. Just omit the number 1 then. Works for me.

Janis

Ralf Damaschke

unread,
Jul 8, 2019, 8:32:01 AM7/8/19
to
Janis Papanagnou schrieb:

> On 07.07.2019 23:25, Helmut Waitzmann wrote:

>> I didn't forget pr(1), I consider it unusable for numbering lines:
>
> Numbering lines works perfectly in my environment and worked on every
> Unix system where I had used it before.
>
>> On my system,
>> seq 1 1 20 | pr -t -n' 1' | sed -n -e '8,12p'
>>
>> (using "sed" merely to reduce output size of the example) gives the following
>> output:
>> 8 8
>> 9 9
>> 0 10
>> 1 11
>> 2 12
[...]
>> I think, there is no way to prevent pr(1) from stripping digits from the line
>> number, if there are more than given by the "width" info.
>
> I wonder then why you use a fixed number of 1 if you want your numbers
> to contain more than 1 digit. Just omit the number 1 then. Works for me.

Probably just to keep the example small. Obviously another test with
seq 1 999999 | pr -t -n | sed -n -e '99998,100002p'
exhibits the same problem with line numbers wrapping.

Janis Papanagnou

unread,
Jul 8, 2019, 12:03:05 PM7/8/19
to
Yes, this inherent problem matches the issue with the 72-char truncation
that I mentioned upthread. You can of course increase the number of digits
beyond 5 by specifying a larger (fixed) one, say, use 18 digits (allowing
to process files with 10^18-1 lines); but even if you can choose such a
number to fit all practical cases there's always someone saying for any N
that he'd need N+1 digits for his case (probably just for the argument) -
the 72-char truncation issue I consider to be much more severe given the
arbitrary data line lengths we nowadays process in many areas (but that
issue maybe isn't existing any more in modern/standard implementations).

When using pr(1) we should keep in mind that, given the indication provided
by its options, this is a tool to format output for printing [in old ages].
(Even the man page says so: "pr - convert text files for printing".)
Multi-column, numbering of non-empty lines, page header information etc.
are such indications.

So while the tool can be used as one component for (a standard way of)
general purpose data processing it's not its original purpose[*], and I
think that specialized "data processing" tools as mentioned in this thread
are generally better suited for processing of arbitrary data.

Janis

[*] I always thought about pr(1) as being the last processing/formatting
component in a pipe before the final "| lpr" or "| opr".

Ralf Damaschke

unread,
Jul 8, 2019, 7:09:01 PM7/8/19
to
Janis Papanagnou wrote:

> Yes, this inherent problem matches the issue with the 72-char truncation
> that I mentioned upthread. You can of course increase the number of digits
> beyond 5 by specifying a larger (fixed) one, say, use 18 digits (allowing
> to process files with 10^18-1 lines); but even if you can choose such a
> number to fit all practical cases there's always someone saying for any N
> that he'd need N+1 digits for his case (probably just for the argument) -
> the 72-char truncation issue I consider to be much more severe given the
> arbitrary data line lengths we nowadays process in many areas (but that
> issue maybe isn't existing any more in modern/standard implementations).
>
> When using pr(1) we should keep in mind that, given the indication provided
> by its options, this is a tool to format output for printing [in old ages].
> (Even the man page says so: "pr - convert text files for printing".)
> Multi-column, numbering of non-empty lines, page header information etc.
> are such indications.
>
> So while the tool can be used as one component for (a standard way of)
> general purpose data processing it's not its original purpose[*], and I
> think that specialized "data processing" tools as mentioned in this thread
> are generally better suited for processing of arbitrary data.
>
> Janis
>
> [*] I always thought about pr(1) as being the last processing/formatting
> component in a pipe before the final "| lpr" or "| opr".

You seem to agree with Helmut's conclusion that pr(1) is unusable for
line numbering. Actually it sounds as if you objected to its use even
before your suggestion:
| Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
| options to suppress headers etc. etc.).

Truncation of input lines for single column output by some(?) legacy
implementations is a non-issue with POSIX implementations though, I
myself cannot recall to see that even on Version 6 or 7 UNIX.


Janis Papanagnou

unread,
Jul 8, 2019, 11:45:49 PM7/8/19
to
On 09.07.2019 01:08, Ralf Damaschke wrote:
> Janis Papanagnou wrote:
>> [...]
>>
>> When using pr(1) we should keep in mind that, given the indication provided
>> by its options, this is a tool to format output for printing [in old ages].
>> (Even the man page says so: "pr - convert text files for printing".)
>> Multi-column, numbering of non-empty lines, page header information etc.
>> are such indications.
>>
>> So while the tool can be used as one component for (a standard way of)
>> general purpose data processing it's not its original purpose[*], and I
>> think that specialized "data processing" tools as mentioned in this thread
>> are generally better suited for processing of arbitrary data.
>>
>> [*] I always thought about pr(1) as being the last processing/formatting
>> component in a pipe before the final "| lpr" or "| opr".
>
> You seem to agree with Helmut's conclusion that pr(1) is unusable for
> line numbering. Actually it sounds as if you objected to its use even
> before your suggestion:
> | Or use of the (almost forgotten?) pr(1) command (supported with a bunch of
> | options to suppress headers etc. etc.).

It's not "unusable". I think it's a "rusty" old tool whose interface
and design is not as one would define it nowadays. It is formatting
text data for printing on text devices. (We're in HTML and TeX era.)
Nowadays we see more often specialized tools usually designed without
limitations or inconsistencies. To illustrate; pr(1) creates numbers
right aligned for nice formatting/printing, but that requires (to be
able to process data in a single pass or pipe) knowledge of the width
of the numbers or (as Helmut pointed out) it will get truncated. Or,
as I pointed out, you can increase the width to some _fixed_ value;
that will make it "usable". But (as a computer scientist) I dislike
hard coded arbitrary values, more so if avoidable. I prefer a default
behaviour that covers the general case, and options to control and
define specific behaviour, not hacks. With pr(1) you have to _disable_
the print header, and _specify_ some arbitrary width when processing
large data streams; to make it usable for the given purpose.

(Mind at that stage where we came from; using tac, having some hack
with line numbers added to let sort make the job for us, using tools
to create sequence numbers.)

To summarze; you can use pr(1), you can define options to make it
work just fine (for practical purposes and with standard versions),
to create sequence numbers for general cases I prefer other tools.

> Truncation of input lines for single column output by some(?) legacy
> implementations is a non-issue with POSIX implementations though, I
> myself cannot recall to see that even on Version 6 or 7 UNIX.

As I said upthread, that's what I thought to remember about the first
version I used but I'm not sure. (If someone has an old UNIX running
or source code to analyse or docs we could become sure about that.
From the old manual like book in my shelf I cannot derive that info.)

Janis

Benjamin Esham

unread,
Jul 10, 2019, 2:04:13 PM7/10/19
to
Martijn Dekker wrote:

>> Janis Papanagnou wrote:
>>
>>> Or use of the (almost forgotten?) pr(1) command (supported with a bunch
>>> of options to suppress headers etc. etc.).
>
> This may be an even better idea. The -t option completely turns off page
> separation and padding with empty lines, so 'pr -t' equals 'cat'. So 'pr
> -t -n' is a straightforward line numbering filter for arbitrary input
> text. It's output is identical to that of the non-standard 'cat -n'.
>
> I've also verified this is POSIX and works on macOS, the BSDs, Linux,
> Solaris, and (as verified at <https://unix50.org/>) all the way back to
> AT&T UNIX System III (1982), so this should be extremely portable.
>
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pr.html
>
> As an aside, something like 'pr -t -3' will format text into 3 columns.
> I've been wanting a portable way to do that for a while (as the 'column'
> command is not portable).
>
> Thanks for reminding us of this useful utility!

This really excited me for a second--I have a blog post [1] that offers a
sequence of commands to extract some data from Git and display it in a nice
table, and I was always slightly annoyed that I had to offer an alternate
version for users who didn't have column(1) installed. (The pipeline
produces data in CSV format but with semicolons for delimiters. If you have
column(1) you can format it nicely with "column -s ';' -t"; my fallback was
"sed -e 's/;/\t/g'", which gets most of the way there but produces jagged-
looking columns sometimes.) It would be great to be able to replace those
two commands with a single one that would work on macOS, Linux, BSD, etc.

It took me some tweaking to get pr(1) to produce output similar to what I
wanted. Starting from an input like

3 years, 1 month ago;Parker Moore;v1-stable
12 months ago;Ben Balter;pages-as-documents
10 months ago;Jordon Bedwell;make-jekyll-parallel

I added

tr ';' '\n'

to the pipeline to put each data "cell" on its own line. Then I could run

pr -t -3 -a -i1000 -w`tput cols`

to produce output like

3 years, 1 month ago Parker Moore v1-stable
12 months ago Ben Balter pages-as-documents
10 months ago Jordon Bedwell make-jekyll-parallel

Not too bad! The arguments I used were as follows:

* -t: Omit headers and footers and don't print extra blank lines at the end
of the output.
* -3: Produce three-column output.
* -a: Use the first three input lines as the values in the first output row;
the next three input lines as the values in the next output row; and so
on. Without this option, the first three input lines would become the
first, second, and third *rows* in the first output *column*.
* -i1000: This is a kludge to prevent the output from including any tabs--I
wanted spaces to be used for alignment instead. (What this is really
saying is to format the output as if the tab stops were at columns 1001,
2001, 3001, etc., effectively ensuring that no tabs are actually printed.)
* -w`tput cols`: Specify that the output should be exactly as wide as the
current terminal. (I used the backtick syntax because I wanted something
that would work equally well under bash, zsh, and tcsh, the latter of
which is still used by many Mac people.) You could of course use something
like -w72 to specify that you want 72-column-wide output.

The main downsides (for my use case) are these:

* The columns are always equally sized. It's not too egregious in the
example above, but it can look really weird when one column contains very
short values and another contains very long values--there will be a ton of
empty space after the former.
* The output width has to be specified manually; as far as I can tell,
there's no option to produce output that's just as wide as it needs to be
but no wider. (Of course, this "limitation" makes perfect sense in the
tool's original context, where the paper has a certain width and you
always want the output to be sized to that width.)

(If I'm wrong about either of these, I would gladly be corrected.)

In contrast, column(1) does both of these things in the way I'd prefer: it
sizes each column according to its widest value, leaving only two spaces
between the right edge of one column and the left edge of the next. The
total width of the output is determined by the values in the input. pr(1)
will keep its columns the same size, even when this requires that the values
in one column be truncated (and even when a different column still has
plenty of whitespace around it).

I could see pr(1) being perfectly useful for some formatting tasks (and I'm
always delighted to learn about a "new" tool that has quietly been sitting
on my computer since I bought it!). In general, though, I would guess that
if you want a portable replacement for column(1) then you might be better
off writing a script in portable Awk instead.

[1]: https://esham.io/2017/07/git-branch-authors

Best regards,

--
Benjamin Esham
https://esham.io

Helmut Waitzmann

unread,
Jul 10, 2019, 5:56:24 PM7/10/19
to
Benjamin Esham <use...@esham.io>:

> pr -t -3 -a -i1000 -w`tput cols`
>
[…]

> * -i1000: This is a kludge to prevent the output from including any tabs--I
> wanted spaces to be used for alignment instead. (What this is really
> saying is to format the output as if the tab stops were at columns 1001,
> 2001, 3001, etc., effectively ensuring that no tabs are actually printed.)

I'd like to suggest ‘-i\ 1’. That should replace spaces with
spaces; and the output ‘tab’ – i. e. space – stop width would be
1. => Each space would be replaced by exactly one space; no
space character sequence would be replaced by a tab character.

Benjamin Esham

unread,
Jul 10, 2019, 10:57:17 PM7/10/19
to
Way better! Thank you for the suggestion.
0 new messages