Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert CR LF to CR

303 views
Skip to first unread message

Harry

unread,
Jul 9, 2021, 12:19:10 PM7/9/21
to
I have a text file that need to covert CR LF to CR.

Sample input
--
LF
LF
Line 1 CR LF
Line 2 CR LF
LF
LF
LF
Line 3 CR LF
...
--

Expected output
--
LF
LF
Line 1 CR
Line 2 CR
LF
LF
LF
Line 3 CR
...
--

I have tried the following, but the output file has no change.

sed -e 's#\r\n#\r#g' < infile > outfile
sed -e 's#\x0D\x0A#\x0D#g' < infile > outfile

Any suggestion?

TIA

Ed Morton

unread,
Jul 9, 2021, 12:30:25 PM7/9/21
to
If that first sed command didn't do it then your input file isn't as
expected, assuming you don't REALLY have a blank char between each CR
and LF as your sample input shows (i.e. you have CRLF, not CR LF). Run
`cat -Ev` on your input file and copy/paste the output here.

Ed.

Kenny McCormack

unread,
Jul 9, 2021, 12:44:08 PM7/9/21
to
In article <3ec51006-7355-44b5...@googlegroups.com>,
--- Cut Here ---
#!/usr/bin/awk
gsub(/CR LF/,"CR") + 1
--- Cut Here ---

--

Prayer has no place in the public schools, just like facts
have no place in organized religion.
-- Superintendent Chalmers

Harry

unread,
Jul 9, 2021, 12:59:06 PM7/9/21
to
Here you go ...

$ sed -e 's#\r\n#\r#g' < infile > outfile

$ cat -Ev infile
$
$
MSH|blablabla^M$
OBX|blablabla^M$
$
$
$
MSH|blablabla^M$
OBX|blablabla^M$
$
$
$

$ cat -Ev outfile
$
$
MSH|blablabla^M$
OBX|blablabla^M$
$
$
$
MSH|blablabla^M$
OBX|blablabla^M$
$
$
$

Harry

unread,
Jul 9, 2021, 1:05:54 PM7/9/21
to
On Friday, July 9, 2021 at 9:44:08 AM UTC-7, Kenny McCormack wrote:

> --- Cut Here ---
> #!/usr/bin/awk
> gsub(/CR LF/,"CR") + 1
> --- Cut Here ---

$ sh xx.awk < infile > outfile
xx.awk: line 2: syntax error near unexpected token `/CR'
xx.awk: line 2: `gsub(/CR LF/,"CR") + 1'

/u/tmp3
$ vi xx.awk

/u/tmp3
$ sh xx.awk < infile > outfile
xx.awk: line 2: syntax error near unexpected token `/\r\n/,"\r"'
xx.awk: line 2: `gsub(/\r\n/,"\r") + 1'

Ed Morton

unread,
Jul 9, 2021, 1:28:48 PM7/9/21
to
Ah, I missed it first time but you're including `\n` in the regexp.
Replace that with `$` as the `\n` has already been consumed by sed
before the regexp is applied to the line so what you really want to
match is just a `\r` at the end of a line:

$ sed 's/\r$//' infile | cat -Ev
$
$
MSH|blablabla$
OBX|blablabla$
$
$
$
MSH|blablabla$
OBX|blablabla$
$
$
$
$

Regards,

Ed.

Harry

unread,
Jul 9, 2021, 1:36:54 PM7/9/21
to
On Friday, July 9, 2021 at 10:28:48 AM UTC-7, Ed Morton wrote:

> $ sed 's/\r$//' infile | cat -Ev

Outfile now looks like this, but it is not what I want. I need "blablabla CR" instead of "blablabla LF".
--
LF
LF
MSH|blablabla LF
OBX|blablabla LF
LF
LF
LF
MSH|blablabla
OBX|blablabla
LF
LF
LF
--

Harry

unread,
Jul 9, 2021, 1:40:07 PM7/9/21
to
On Friday, July 9, 2021 at 9:19:10 AM UTC-7, Harry wrote:

The infile is uuencoded below, also rot13'ed because I could not post without the later.
Anyone could try this infile.

TIA

$ uuencode infile - | rot13
ortva 770 -
Z"@V-4GN\8SDN8SDN8SDN#0V/0RN\8SDN8SDN8SDN#0U*"@V-4GN\8SDN8SDN
78SDN#0V/0RN\8SDN8SDN8SDN#0U*"@U`
`
raq

Ed Morton

unread,
Jul 9, 2021, 4:31:41 PM7/9/21
to
OK, sorry, now I understand. I've never seen anyone ask for CRLF to
become CR before, it's always LF. Using GNU sed for `-z` to read the
whole file as a single string:

$ sed -z 's/\r\n/\r/g' infile | cat -Ev
$
$
MSH|blablabla^MOBX|blablabla^M$
$
$
MSH|blablabla^MOBX|blablabla^M$
$
$

Regards,

Ed.

Harry

unread,
Jul 9, 2021, 5:04:38 PM7/9/21
to
On Friday, July 9, 2021 at 1:31:41 PM UTC-7, Ed Morton wrote:

> OK, sorry, now I understand. I've never seen anyone ask for CRLF to
> become CR before, it's always LF. Using GNU sed for `-z` to read the
> whole file as a single string:
>
> $ sed -z 's/\r\n/\r/g' infile | cat -Ev
> $
> $
> MSH|blablabla^MOBX|blablabla^M$
> $
> $
> MSH|blablabla^MOBX|blablabla^M$
> $
> $

Ed, it is almost there. the first (MSH) lines end with CR.
But the second (OBX) lines still end with CR LF.

I need to make the lines end with CR because they are HL7 messages,
which are Health care messages which have CR as segment separators.

I search the web for dos2mac with no aval, of course dos2unix and unix2dos are there.
In the old days, Apple computer has all line ended with CR.

Please help me with making all lines (MSH, OBX) lines work.

Thanks

Keith Thompson

unread,
Jul 9, 2021, 5:08:34 PM7/9/21
to
I'm curious why you'd want that. Using just CR as a line ending is
rare; the most recent example I can think of is MacOS before OSX.
And I can't really think of a good reason to want different endings
on different lines in the same file.

Not saying it's wrong, but it's unusual (and I'm wondering if more
information about why you need this might lead to a better solution).

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Harry

unread,
Jul 9, 2021, 5:10:55 PM7/9/21
to
Ed,
Sorry, actually it works.
The outfile now becomes this : the LF on OBX came from the next LF line.
--
LF
LF
MSH|bla CR
OBX|bla CR LF
LF
LF
,,,
--

Kaz Kylheku

unread,
Jul 9, 2021, 5:29:19 PM7/9/21
to
awk '/\r$/ { printf("%s", $0); next } 1'

Explanation: note that the requirements amount to processing an ordinary
Unix-style LF-terminated file such that:

- Every line that ends in CR is output without the terminating LF.

- All other lines are are passed through.

All we have to do is match lines ending in a CR and output them
using printf("%s", ...) without any newline.

Harry

unread,
Jul 9, 2021, 5:37:10 PM7/9/21
to
On Friday, July 9, 2021 at 2:29:19 PM UTC-7, Kaz Kylheku wrote:

> awk '/\r$/ { printf("%s", $0); next } 1'
>
> Explanation: note that the requirements amount to processing an ordinary
> Unix-style LF-terminated file such that:
>
> - Every line that ends in CR is output without the terminating LF.
>
> - All other lines are are passed through.
>
> All we have to do is match lines ending in a CR and output them
> using printf("%s", ...) without any newline.

Kaz,

Your solution works as well as Ed's.
BTW, what is the "1" mean/for at the end pf the awk cmd ?

Thanks

Harry

unread,
Jul 9, 2021, 5:53:08 PM7/9/21
to
On Friday, July 9, 2021 at 2:08:34 PM UTC-7, Keith Thompson wrote:

> I'm curious why you'd want that. Using just CR as a line ending is
> rare; the most recent example I can think of is MacOS before OSX.
> And I can't really think of a good reason to want different endings
> on different lines in the same file.

Keith

To make my story short ...

I am working in Healthcare where my role is an Interface analyst.
Health Care messages (HL7) come from other foreign Healthcare systems
and post to our Hospital Information system.

Previously I used Oracle's sqldeveloper to query manually the HL7 messages stored
in our database, and I then use some shell scripts to massages the SQL output
so to do some of my task.

Oracle has now another tool called sqlcl, which allows the query to be automated.
But the sqlcl output have some issues (like showing HL7 messages ended with CR LF
instead of just CR as the segment separators).

So, that is the reason I need a shell script fix.

Hope this explains my reason, without confusing more ....

Thanks

David W. Hodgins

unread,
Jul 9, 2021, 5:59:36 PM7/9/21
to
On Fri, 09 Jul 2021 17:04:35 -0400, Harry <harryoo...@hotmail.com> wrote:
> I search the web for dos2mac with no aval, of course dos2unix and unix2dos are there.

$ cat test
one
two
$ unix2dos -n test test.dos
unix2dos: converting file test to file test.dos in DOS format...
$ unix2mac -n test test.mac
unix2mac: converting file test to file test.mac in Mac format...
$ hexdump test
0000000 6e6f 0a65 7774 0a6f
0000008
$ hexdump test.dos
0000000 6e6f 0d65 740a 6f77 0a0d
000000a
$ hexdump test.mac
0000000 6e6f 0d65 7774 0d6f
0000008
$ rpm -q -f /usr/bin/unix2mac
dos2unix-7.4.2-1.mga8
$ rpm -q -i dos2unix|grep ^URL
URL : http://www.xs4all.nl/~waterlan/dos2unix.html

This is on a Mageia 8 x86_64 installation.

Regards, Dave Hodgins

--
Change dwho...@nomail.afraid.org to davidw...@teksavvy.com for
email replies.

Kenny McCormack

unread,
Jul 9, 2021, 6:21:06 PM7/9/21
to
In article <b053012b-041b-4882...@googlegroups.com>,
Harry <harryoo...@hotmail.com> wrote:
...
>Your solution works as well as Ed's.

High praise, indeed.

--
"Every time Mitt opens his mouth, a swing state gets its wings."

(Should be on a bumper sticker)

Harry

unread,
Jul 9, 2021, 6:22:11 PM7/9/21
to
On Friday, July 9, 2021 at 2:59:36 PM UTC-7, David W. Hodgins wrote:
> On Fri, 09 Jul 2021 17:04:35 -0400, Harry <harryoo...@hotmail.com> wrote:
> > I search the web for dos2mac with no aval, of course dos2unix and unix2dos are there.
> $ cat test
> one
> two
> $ unix2dos -n test test.dos
> unix2dos: converting file test to file test.dos in DOS format...
> $ unix2mac -n test test.mac
> unix2mac: converting file test to file test.mac in Mac format...

I'd been there, d/l dos2unix-7.4.2-win64.zip.
4 exe there :

dos2unix.exe
unix2dos.exe
mac2unix.exe
unix2mac.exe

But no dos2mac.exe nor mac2dos.exe.

David W. Hodgins

unread,
Jul 9, 2021, 6:47:56 PM7/9/21
to
On Fri, 09 Jul 2021 18:22:08 -0400, Harry <harryoo...@hotmail.com> wrote:
> I'd been there, d/l dos2unix-7.4.2-win64.zip.
> 4 exe there :
>
> dos2unix.exe
> unix2dos.exe
> mac2unix.exe
> unix2mac.exe
>
> But no dos2mac.exe nor mac2dos.exe.

Run it in two steps. Dos to unix, and then unix to mac or vice-versa. Not ideal, but
easy to do.

Harry

unread,
Jul 9, 2021, 7:01:39 PM7/9/21
to
On Friday, July 9, 2021 at 3:47:56 PM UTC-7, David W. Hodgins wrote:

> > But no dos2mac.exe nor mac2dos.exe.
> Run it in two steps. Dos to unix, and then unix to mac or vice-versa. Not ideal, but
> easy to do.

Dave

It would not work for me, as the second step unix2mac will change the
blank LF lines to CR lines, which is not what I want.

Thanks

Kaz Kylheku

unread,
Jul 9, 2021, 7:19:23 PM7/9/21
to
On 2021-07-09, Harry <harryoo...@hotmail.com> wrote:
> On Friday, July 9, 2021 at 2:29:19 PM UTC-7, Kaz Kylheku wrote:
>
>> awk '/\r$/ { printf("%s", $0); next } 1'
>>
>> Explanation: note that the requirements amount to processing an ordinary
>> Unix-style LF-terminated file such that:
>>
>> - Every line that ends in CR is output without the terminating LF.
>>
>> - All other lines are are passed through.
>>
>> All we have to do is match lines ending in a CR and output them
>> using printf("%s", ...) without any newline.
>
> Kaz,
>
> Your solution works as well as Ed's.

In one try, using code that can reasonably be expected to work on a SVR4
Unix box from 1988.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Keith Thompson

unread,
Jul 9, 2021, 8:18:36 PM7/9/21
to
My first suggestion is to stop thinking of the chunks of your input and
output files as "lines". The whole thing is a sequence of cahracters.
Your goal is to copy that sequence, replacing every occurrence of CR LF
by CR. while copying any LF characters *not* preceded by CR unchanged.

Here's a C program that should do the job. It reads from stdin and
writes to stdout. (Of course this requires having access to a C
development environment so you can build and run the program.)

#include <stdio.h>
int main(void) {
int prev = 0;
int c;
while ((c = getchar()) != EOF) {
if (! (prev == '\r' && c == '\n')) {
putchar(c);
}
prev = c;
}
}

Don't worry about the performance of reading and writing a character at
a time. It's all buffered.

It should do what you want if you run it in a Unix-like environment
(which seems a safe assumption given where you posted).

(Under Windows CRLF pairs would be translated to LF on input, and LF to
CRLF on output. You could avoid that by using binary streams for input
and output, avoiding translation of line terminators. But Unix treats
text and binary files the same way, so it shouldn't be necessary for
you.)

Spiros Bousbouras

unread,
Jul 10, 2021, 7:02:56 AM7/10/21
to
It is a pattern. It means always match (1 stands for true ; one of the many
ways to indicate boolean true in AWK) and , since no explicit action is given
, the default action will happen which is print the whole line. So if a line
matches the pattern /\r$/ then { printf("%s", $0); next } will be executed
and the next command will load the next line and start matching the
patterns from the beginning ; if /\r$/ does not match then the next
"pattern" is tried which is 1 which always matches so the whole line gets
printed.

By the way , for this kind of thing the first thing to cross my mind would be
to use C because for this problem it actually provides the clearest code. So
something like Keith's code in <87k0lz6...@nosuchdomain.example.com>
would be my preference.

--
The famous --do-what-I-mean-not-what-I-write compiler switch.
https://lwn.net/Articles/508991/

Janis Papanagnou

unread,
Jul 10, 2021, 7:09:01 AM7/10/21
to
On 10.07.2021 13:02, Spiros Bousbouras wrote:
>
> By the way , for this kind of thing the first thing to cross my mind would be
> to use C because for this problem it actually provides the clearest code.

And likely also the fastest, which would matter if - as I would expect
with health data - a lot of data has to be handled.

> So
> something like Keith's code in <87k0lz6...@nosuchdomain.example.com>
> would be my preference.

For small data sets a quick awk prototype solution might be preferable,
and also if there's no C development environment available in the OP's
working context.

Janis

Harry

unread,
Jul 10, 2021, 10:37:08 AM7/10/21
to
On Saturday, July 10, 2021 at 4:02:56 AM UTC-7, Spiros Bousbouras wrote:

> > BTW, what is the "1" mean/for at the end pf the awk cmd ?

> It is a pattern. It means always match (1 stands for true ; one of the many
> ways to indicate boolean true in AWK) and , since no explicit action is given
> , the default action will happen which is print the whole line. So if a line
> matches the pattern /\r$/ then { printf("%s", $0); next } will be executed
> and the next command will load the next line and start matching the
> patterns from the beginning ; if /\r$/ does not match then the next
> "pattern" is tried which is 1 which always matches so the whole line gets
> printed.

Thanks for the explanation.

Harry

unread,
Jul 10, 2021, 10:39:20 AM7/10/21
to
On Friday, July 9, 2021 at 5:18:36 PM UTC-7, Keith Thompson wrote:

> #include <stdio.h>
> int main(void) {
> int prev = 0;
> int c;
> while ((c = getchar()) != EOF) {
> if (! (prev == '\r' && c == '\n')) {
> putchar(c);
> }
> prev = c;
> }
> }

It works very well, thanks.

Keith Thompson

unread,
Jul 10, 2021, 7:55:15 PM7/10/21
to
I thought of using C because I found it clearer to think of the input
and output as character-oriented, not line-oriented. awk processing is
line-oriented.

On the other hand, there's more than one way to describe the problem.

One way, which led to my C solution, was:

Copy input *characters* to output, except that every occurrence of
CRLF is replaced by CR.

An equivalent way to describe it is:

Copy input *lines* to output, except that any line ending in CR is
printed without a trailing newline (think "echo -n" vs. "echo").

(This assumes a "line" is terminated by a single LF character.)

The input is line-oriented. The output is not.

That led me to this awk solution, which is similar to the one Kaz
posted, except that mine is a bit more verbose. It might be easier for
someone who's not an awk expert to follow. (My awk is a bit rusty).

#!/usr/bin/awk -f

{
if (/\r$/) {
printf("%s", $0);
}
else {
print
}
}

Awk's input is performed a line at a time, discarding the terminating
newline character. The "print" statement adds a newline character (and
prints $0, the input line, if you don't give it an argument). "printf"
prints exactly what you tell it to, printing a newline only if you
specify it. (`awk '{print}'` copies input to output *except* that if
the input doesn't end in a newline, it will add one.)

And here's a Perl one-liner solution:

perl -pe 'chomp if /\r$/'

Perl, like awk, is line-oriented, but it doesn't discard newlines on
input. "chomp" deletes a trailing newline. "-p" tells Perl to do an
awk-like loop copying input lines to output. An equivalent Perl script
that doesn't use special command-line options:

#!/usr/bin/perl

# The following two lines don't matter in this case, but are good Perl
# practice in general.
use strict;
use warnings;

while (<>) {
chomp if /\r$/;
print;

Janis Papanagnou

unread,
Jul 10, 2021, 9:57:28 PM7/10/21
to
On 11.07.2021 01:55, Keith Thompson wrote:
>
> I thought of using C because I found it clearer to think of the input
> and output as character-oriented, not line-oriented.

Definitely.

> awk processing is line-oriented.

Unless you use GNU Awk with a "stream processing" setup.

>
> On the other hand, there's more than one way to describe the problem.

Sure.

Janis

Keith Thompson

unread,
Jul 10, 2021, 11:41:12 PM7/10/21
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:
> On 11.07.2021 01:55, Keith Thompson wrote:
>> I thought of using C because I found it clearer to think of the input
>> and output as character-oriented, not line-oriented.
>
> Definitely.
>
>> awk processing is line-oriented.
>
> Unless you use GNU Awk with a "stream processing" setup.

I don't see anything about that in the gawk manual. Do you have a
reference?

For myself, if a problem isn't strictly line-oriented I generally treat
that as a sign that I should use something other than awk, but I'm sure
it's more powerful than what I'm aware of.

(There's a comp.lang.awk newsgroup, so perhaps this isn't quite topical.)

>> On the other hand, there's more than one way to describe the problem.
>
> Sure.

Janis Papanagnou

unread,
Jul 11, 2021, 6:35:21 AM7/11/21
to
On 11.07.2021 05:41, Keith Thompson wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>> On 11.07.2021 01:55, Keith Thompson wrote:
>>> I thought of using C because I found it clearer to think of the input
>>> and output as character-oriented, not line-oriented.
>>
>> Definitely.
>>
>>> awk processing is line-oriented.
>>
>> Unless you use GNU Awk with a "stream processing" setup.
>
> I don't see anything about that in the gawk manual. Do you have a
> reference?

I don't recall whether it was mentioned in Arnold's Awk book and/or
also discussed in comp.lang.awk, and myself I rarely used it and it
was decades ago. A quick test (from memory) shows how to activate
that feature, by setting RS to NUL. The following script

awk 'BEGIN{RS="\0"} END{print NR}'

will print 1 (for non-empty files), meaning that it has processed one
record.

> For myself, if a problem isn't strictly line-oriented I generally treat
> that as a sign that I should use something other than awk, but I'm sure
> it's more powerful than what I'm aware of.

Actually you can expect that GNU Awk supports quite some features that
the Awk standard doesn't, and quite useful features, like allowing RS
to be a regular expression (just for a prominent and important example).

Given GNU Awk's widespread availability (on many platforms), and being
open source, and being still actively supported and developed, makes it
effectively the quasi standard tool for awk.

Mind that here in comp.unix.shell there's regularly code posted that
contains bash-specifcs, often without this fact being mentioned. I
think it should always be mentioned when non-standard constructs are
(necessarily or unnecessarily) used, so folks can decide whether the
solution is suited or not, and the same (IMO to a lesser degree,
because of being a quasi-standard) is true for GNU Awk.

> (There's a comp.lang.awk newsgroup, so perhaps this isn't quite topical.)

Awk solutions are fine here. For specific Awk language or tool specific
discussions the awk newsgroup might be better suited. Though I don't
see that we went into any gory awk details here with our discussion.

Janis

Kenny McCormack

unread,
Jul 11, 2021, 7:55:46 AM7/11/21
to
In article <scehh5$76d$1...@news-1.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>>> Unless you use GNU Awk with a "stream processing" setup.
>>
>> I don't see anything about that in the gawk manual. Do you have a
>> reference?
>
>I don't recall whether it was mentioned in Arnold's Awk book and/or
>also discussed in comp.lang.awk, and myself I rarely used it and it
>was decades ago.

I don't think "stream processing" is a GAWK "thing". I think you are
mis-remembering that (to put it kindly).

From what you say below, I am going to guess that what you are attempting
to evoke is the idea of reading in a whole file at once, and then
processing through the buffer. Sort of like what you might do in a
low-level language like C.

>A quick test (from memory) shows how to activate
>that feature, by setting RS to NUL. The following script
>
> awk 'BEGIN{RS="\0"} END{print NR}'

This doesn't do what you think it does. In fact, setting RS to a NUL
character is perfectly legitimate and will do exactly that. This can be
useful when processing the /proc/*/environ files in Linux. E.g.,

gawk 'BEGIN { RS = "\0" } { print NR,$0 }' /proc/self/environ | less

>will print 1 (for non-empty files), meaning that it has processed one
>record.

Only if there are no actual nulls in the file.

Now, what I *think* you are going for is that if you set RS="^$", then it
*is* guaranteed to never match. This has been discussed in the newsgroup
(comp.lang.awk), and has been codified as the following include file
(readfile.awk), which I have in my "awksrc" directory (accessed via the
AWKPATH environment variable) on all of my systems:

--- Cut Here ---
# readfile.awk --- read an entire file at once
#
# Original idea by Denis Shirokov, cosm...@gmail.com, April 2013
#

function readfile(file, tmp, save_rs)
{
save_rs = RS
RS = "^$"
getline tmp < file
close(file)
RS = save_rs

return tmp
}
--- Cut Here ---

This could be used to solve OP's problem (assuming we ever actually figure
out what OP's problem is), assuming the file isn't too large.

P.S. I posted on this thread a few days ago. To date, mine is the only
actual solution to OP's problem as originally described. Everything else
(i.e., all the other posts on this thread) has been noise.

--
BigBusiness types (aka, Republicans/Conservatives/Independents/Liberatarians/whatevers)
don't hate big government. They *love* big government as a means for them to get
rich, sucking off the public teat. What they don't like is *democracy* - you know,
like people actually having the right to vote and stuff like that.

Janis Papanagnou

unread,
Jul 11, 2021, 8:21:11 AM7/11/21
to
On 11.07.2021 13:55, Kenny McCormack wrote:
> In article <scehh5$76d$1...@news-1.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>>>> Unless you use GNU Awk with a "stream processing" setup.
>>>
>>> I don't see anything about that in the gawk manual. Do you have a
>>> reference?
>>
>> I don't recall whether it was mentioned in Arnold's Awk book and/or
>> also discussed in comp.lang.awk, and myself I rarely used it and it
>> was decades ago.
>
> I don't think "stream processing" is a GAWK "thing". I think you are
> mis-remembering that (to put it kindly).

It is, strictly, not stream processing and that's why I wrote "stream
processing" (in quotes). But I don't think I am misremembering because
it serves the intended purpose, which was to regard the various line
separator characters in the text file as part of the data (as opposed
to being record/line separators).

>
> From what you say below, I am going to guess that what you are attempting
> to evoke is the idea of reading in a whole file at once, and then
> processing through the buffer.

Effectively, yes.

> Sort of like what you might do in a low-level language like C.

Not quite. In C I usually do real stream processing, not filling the
whole file in the record buffer (in an array).

>
>> A quick test (from memory) shows how to activate
>> that feature, by setting RS to NUL. The following script
>>
>> awk 'BEGIN{RS="\0"} END{print NR}'
>
> This doesn't do what you think it does.

It always did what I thought it would do. (What do you think I have
thought it would do?)

> In fact, setting RS to a NUL
> character is perfectly legitimate and will do exactly that. This can be
> useful when processing the /proc/*/environ files in Linux. E.g.,
>
> gawk 'BEGIN { RS = "\0" } { print NR,$0 }' /proc/self/environ | less
>
>> will print 1 (for non-empty files), meaning that it has processed one
>> record.
>
> Only if there are no actual nulls in the file.

We have been considering text files in this thread (not binary). Myself
I have never considered awk to be a binary file processor, but if one
wants to use it that way I'm of course fine with it.

>
> Now, what I *think* you are going for is that if you set RS="^$", then it
> *is* guaranteed to never match.

This is an alternative (and works for your binary file case as well).

Janis

> [...]


Ben Bacarisse

unread,
Jul 11, 2021, 9:08:06 AM7/11/21
to
gaz...@shell.xmission.com (Kenny McCormack) writes:

> P.S. I posted on this thread a few days ago. To date, mine is the only
> actual solution to OP's problem as originally described. Everything else
> (i.e., all the other posts on this thread) has been noise.

Including the three the OP said worked well? Odd. I thought your
suggestion was intended as a joke (at the OP's expense) but maybe you
were being serious.

--
Ben.

Janis Papanagnou

unread,
Jul 12, 2021, 5:28:18 AM7/12/21
to
On 11.07.2021 05:41, Keith Thompson wrote:
>> Unless you use GNU Awk with a [sort of] "stream processing" setup.
>
> I don't see anything about that in the gawk manual. Do you have a
> reference?

It's indeed mentioned in Arnold Robbin's Book and also in the GNU Awk
manual online (see chapter "Record Splitting with gawk" for "\0").
https://www.gnu.org/software/gawk/manual/gawk.html

The current manual version mentions also "^$" in the context of a
getline based readfile function
https://www.gnu.org/software/gawk/manual/gawk.html#Readfile-Function

And finally there's mention of an extension library function to read
an entire file at once
https://www.gnu.org/software/gawk/manual/gawk.html#Extension-Sample-Readfile

Janis

Kenny McCormack

unread,
Jul 12, 2021, 6:23:34 AM7/12/21
to
In article <sch1vd$u3m$1...@news-1.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 11.07.2021 05:41, Keith Thompson wrote:
>>> Unless you use GNU Awk with a [sort of] "stream processing" setup.
>>
>> I don't see anything about that in the gawk manual. Do you have a
>> reference?
>
>It's indeed mentioned in Arnold Robbin's Book and also in the GNU Awk
>manual online (see chapter "Record Splitting with gawk" for "\0").
>https://www.gnu.org/software/gawk/manual/gawk.html

As I've already explained to you, there is absolutely nothing special or
unique about setting RS = "\0".

It works entirely as expected. I even gave you an example of using it to
parse Linux proc "environ" files. You can also use it to parse the
"cmdline" file. There are other files in /proc as well that are delimited
by null characters.

--
If you ask a Trumper who is to blame for the debacle of Jan 6, they will almost certainly say
something about Antifa/BLM/something/whatever. This shows just how screwed up they are; they can't
even get their narrative straight. What they *should* say is "Eugene Goodman". If not for him, the plot
would probably have succeeded, so he (Eugene) is clearly to blame for the failure.

Janis Papanagnou

unread,
Jul 12, 2021, 8:00:43 AM7/12/21
to
On 12.07.2021 12:23, Kenny McCormack wrote:
> In article <sch1vd$u3m$1...@news-1.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 11.07.2021 05:41, Keith Thompson wrote:
>>>> Unless you use GNU Awk with a [sort of] "stream processing" setup.
>>>
>>> I don't see anything about that in the gawk manual. Do you have a
>>> reference?
>>
>> It's indeed mentioned in Arnold Robbin's Book and also in the GNU Awk
>> manual online (see chapter "Record Splitting with gawk" for "\0").
>> https://www.gnu.org/software/gawk/manual/gawk.html
>
> As I've already explained to you, there is absolutely nothing special or
> unique about setting RS = "\0".

Depends on your mental image of "special". Special is that it's not
portably working; it's even explained in the GNU Awk manual IIRC,
since you cannot expect awk's to not fail when using C strings as an
Awk string implementation. In GNU Awk, OTOH, it is _technically_ not
special, but it had been mentioned in the past that it can be used
to achieve what we're talking about here in this thread. That's all.
Is that so hard to accept?

I'm not sure what your problem is. Are you trying to teach me how to
breathe? Do you feel to be misunderstood by me?

>
> It works entirely as expected.

Yes. I also wrote upthread "It always did what I thought it would do."
(in response to your "This doesn't do what you think it does." hubris).

> I even gave you an example of using it to
> parse Linux proc "environ" files. You can also use it to parse the
> "cmdline" file. There are other files in /proc as well that are delimited
> by null characters.

So what? I know all that. Repeating your trivial statements is nothing
but unnecessary noise (to use your habitual style of word choices).

Janis

Ed Morton

unread,
Jul 12, 2021, 8:59:08 AM7/12/21
to
On 7/11/2021 5:35 AM, Janis Papanagnou wrote:
<snip>
> The following script
>
> awk 'BEGIN{RS="\0"} END{print NR}'
>
> will print 1 (for non-empty files), meaning that it has processed one
> record.

Assuming gawk, use RS="^$" instead of RS="\0". An input file could
contain NULs and still be processed by gawk (but not some other awks as
it's then not a valid POSIX text file), but "^$" only matches an empty
file and therefore cannot match any string in a non-empty file.

Ed.

Kenny McCormack

unread,
Jul 12, 2021, 9:58:58 AM7/12/21
to
In article <schean$k7q$1...@dont-email.me>,
Like a stopped clock...

--
People who want to share their religious views with you
almost never want you to share yours with them. -- Dave Barry

Janis Papanagnou

unread,
Jul 12, 2021, 10:25:35 AM7/12/21
to
Hi Ed,

since you killfiled a prominent member you obviously haven't seen all
replies.

On 12.07.2021 14:59, Ed Morton wrote:
> On 7/11/2021 5:35 AM, Janis Papanagnou wrote:
> <snip>
>> The following script
>>
>> awk 'BEGIN{RS="\0"} END{print NR}'
>>
>> will print 1 (for non-empty files), meaning that it has processed one
>> record.
>
> Assuming gawk, use RS="^$" instead of RS="\0".

That has already been suggested...

> An input file could
> contain NULs and still be processed by gawk (but not some other awks as
> it's then not a valid POSIX text file), but "^$" only matches an empty
> file and therefore cannot match any string in a non-empty file.

...and I replied that we're considering text files here (no NULs).

Seen from a wider perspective all the options are arguable and depend
on the constraints and tools used.

This has actually all been already disputed in this newsgroup 8 years
ago (and the discussion was spanning many months).

To quote one of my years old replies (just as an example, but it may
also answer another posters vigorous attack) from that old thread:

A '\0' character had never been a reliable separator in case of
binary files. And WRT text files using RS=SUBSEP would be even
better than the suggested RS="\0" in cases when your programs
shall run on other (and older) awks as well.

There are yet more aspects to consider. At that time you posted (also
just for another example) to use "\n$" in certain application contexts.
Another poster suggested an equivalent of RS=".*" (IIRC). And so on.
In short; what fits best depends.

The "^$" is very appealing because you can use it for binary files as
well (although I'm using awk as _text_-processor). There's a few things
I don't like much with it, though. Similar to "\0" it is non-portable.
It's more cryptic, no one seems to be perfectly sure how it works - at
least that was the expression I've got from the discussion 8 years ago,
and also now I immediately start a test case to see whether it matches
an empty line (just to have some confidence). Because of that fact I
think it should explicitly be documented and supported by a rationale
that explains the _mechanism_. (Of course there could also just be a
statement that it's sort of a special pattern that works "magically",
so that no [technical] explanation needs to be formulated.) The GNU
Awk manual has the topic incoherently spread across three chapters.
It would certainly be helpful to have a more coherent picture and a
guidance or suggestion (with all the caveats, e.g. about what happens
with a RS="^$" statement in other awks). There have also performance
issues been addressed by Arnold in the past, I think that the @load
option is the fastest because it bypasses the regexp processing,
which would aid the user to make an informed choice what to use when.

Until we have such a "directive" I fear we'll repeat our discussions
every couple years again, and they seem to not be quickly terminated
discussions on every re-iteration. ;-)

Janis

>
> Ed.

Ben Bacarisse

unread,
Jul 12, 2021, 10:26:43 AM7/12/21
to
Keith Thompson <Keith.S.T...@gmail.com> writes:

> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>> On 11.07.2021 01:55, Keith Thompson wrote:
>>> I thought of using C because I found it clearer to think of the input
>>> and output as character-oriented, not line-oriented.
>>
>> Definitely.
>>
>>> awk processing is line-oriented.
>>
>> Unless you use GNU Awk with a "stream processing" setup.
>
> I don't see anything about that in the gawk manual. Do you have a
> reference?
>
> For myself, if a problem isn't strictly line-oriented I generally treat
> that as a sign that I should use something other than awk, but I'm sure
> it's more powerful than what I'm aware of.

You can write something like your C solution using gawk if you set
RS="().". The ()s are needed to make RS longer than one character, but
having done that, every character is then available, one by one, in RT.

Something like (IIRC)

awk 'BEGIN{RS="()."} p!="\r" || RT!="\n" {printf RT} {p=RT}'

That's how I'd interpret doing "stream processing" in GAWK.

--
Ben.

Ed Morton

unread,
Jul 12, 2021, 11:25:21 AM7/12/21
to
On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
<snip>
> The "^$" is very appealing because you can use it for binary files as
> well (although I'm using awk as _text_-processor). There's a few things
> I don't like much with it, though. Similar to "\0" it is non-portable.
> It's more cryptic, no one seems to be perfectly sure how it works

I don't know why it wouldn't just work like any other RS. gawk looks for
where a string matching that regexp occurs in the file and uses that to
identify the end of a record. If no such string exists in the file the
whole file is stored in $0. Since "^" means "start-of-string" and "$"
means "end-of-string" the only possible way that "^$" could match a
string in a file would be if there was nothing between the start and end
of the file, i.e. the file is empty. Using RS='^$' for any file is no
different than using RS='foo' when foo doesn't exist in the file.

Ed.

Kenny McCormack

unread,
Jul 12, 2021, 11:50:17 AM7/12/21
to
In article <schmss$lpb$1...@dont-email.me>,
This is not your "stopped clock" moment, I'm afraid.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Security

Janis Papanagnou

unread,
Jul 12, 2021, 4:49:26 PM7/12/21
to
On 12.07.2021 17:25, Ed Morton wrote:
> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>> It's more cryptic, no one seems to be perfectly sure how it works
>
> I don't know why it wouldn't just work like any other RS.

The responses in that other thread left a different impression to me,
that it wasn't obvious or as clear as it should be.

> gawk looks for
> where a string matching that regexp occurs in the file and uses that to
> identify the end of a record. If no such string exists in the file the
> whole file is stored in $0.

If all we want is a pattern that is principally non-existing wouldn't
it be clearer to use something like "$^" (i.e. "^$" reversed), which
is a meta-character sequence that does obviously not make any sense.

If there's a meta-character sequence that may make sense in a way that
you have to think about what constitutes a string in a context where
we are just in the process to define the parsing unit string by setting
RS opens room for confusion. YMMV.

It seems to me that an explanation of "$^" (one that cannot exist per
definition of the meaning of '^' and '$') is clearer than the version
that requires a more verbose explanation like the one here (that had
in this or similar form also already been given 8 years ago, IIRC):

> Since "^" means "start-of-string" and "$"
> means "end-of-string" the only possible way that "^$" could match a
> string in a file would be if there was nothing between the start and end
> of the file, i.e. the file is empty. Using RS='^$' for any file is no
> different than using RS='foo' when foo doesn't exist in the file.
>
> Ed.

Janis

Ed Morton

unread,
Jul 12, 2021, 4:52:15 PM7/12/21
to
On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
> Hi Ed,
>
> since you killfiled a prominent member you obviously haven't seen all
> replies.

The only person I have killfiled is Kenny and that's well-deserved. If
he's yammering away somewhere in this thread I'm sure that, as always,
it'll be wrong and/or redundant and/or infantile name-calling so I know
I'm not missing anything there.

Actually I may still have that other well-know netkook Alan Conor aka
Tom Newton killfiled too but I haven't seen him post anything in years
so I doubt if he's who you're referring to.

Ed.

Ed Morton

unread,
Jul 12, 2021, 4:59:54 PM7/12/21
to
On 7/12/2021 3:49 PM, Janis Papanagnou wrote:
> On 12.07.2021 17:25, Ed Morton wrote:
>> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>>> It's more cryptic, no one seems to be perfectly sure how it works
>>
>> I don't know why it wouldn't just work like any other RS.
>
> The responses in that other thread left a different impression to me,
> that it wasn't obvious or as clear as it should be.

I don't recall the previous conversation on it you're referring to but
it seems very clear and simple to me.

>> gawk looks for
>> where a string matching that regexp occurs in the file and uses that to
>> identify the end of a record. If no such string exists in the file the
>> whole file is stored in $0.
>
> If all we want is a pattern that is principally non-existing wouldn't
> it be clearer to use something like "$^" (i.e. "^$" reversed), which
> is a meta-character sequence that does obviously not make any sense.

`$^` means the literal chars `$` then `^` since `$` is only an anchor
metachar at the end of a regexp or subexpression and `^` only at the
beginning. Look:

$ echo 'a$^b' | grep '$^'
a$^b

See
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_08
for more info.

Regards,

Ed.

Janis Papanagnou

unread,
Jul 12, 2021, 5:05:18 PM7/12/21
to
On 12.07.2021 22:52, Ed Morton wrote:
> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>> Hi Ed,
>>
>> since you killfiled a prominent member you obviously haven't seen all
>> replies.
>
> The only person I have killfiled is Kenny and [...]

And, IIRC, he's the one that suggested to use "^$".
(Or was it that he suggested to not use "\0"?)
Anyway...

William Unruh

unread,
Jul 12, 2021, 5:06:25 PM7/12/21
to
On 2021-07-12, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 12.07.2021 17:25, Ed Morton wrote:
>> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>>> It's more cryptic, no one seems to be perfectly sure how it works
>>
>> I don't know why it wouldn't just work like any other RS.
>
> The responses in that other thread left a different impression to me,
> that it wasn't obvious or as clear as it should be.
>
>> gawk looks for
>> where a string matching that regexp occurs in the file and uses that to
>> identify the end of a record. If no such string exists in the file the
>> whole file is stored in $0.
>
> If all we want is a pattern that is principally non-existing wouldn't
> it be clearer to use something like "$^" (i.e. "^$" reversed), which
> is a meta-character sequence that does obviously not make any sense.

Isn't $^ something that occurs at the end of every line? (End of this
line, beginning of the next)

Actually, I also have problems with ^$ since that would seem to mean and
empty line, which is certainly possible (LFLF) would seem to have an
empty line in it. But clearly I would have to know EXACTLU how awk
determines the start and end of a line.

Janis Papanagnou

unread,
Jul 12, 2021, 5:10:40 PM7/12/21
to
We were talking about GNU Awk, don't we?

Consequently I have tested GNU Awk:

$ echo $'a$^b\na$^b' | awk 'BEGIN{RS="$^"}{print NR, $0}'
1 a$^b
a$^b


Janis

Janis Papanagnou

unread,
Jul 12, 2021, 5:15:13 PM7/12/21
to
On 12.07.2021 23:06, William Unruh wrote:
> On 2021-07-12, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 12.07.2021 17:25, Ed Morton wrote:
>>> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>>>> It's more cryptic, no one seems to be perfectly sure how it works
>>>
>>> I don't know why it wouldn't just work like any other RS.
>>
>> The responses in that other thread left a different impression to me,
>> that it wasn't obvious or as clear as it should be.
>>
>>> gawk looks for
>>> where a string matching that regexp occurs in the file and uses that to
>>> identify the end of a record. If no such string exists in the file the
>>> whole file is stored in $0.
>>
>> If all we want is a pattern that is principally non-existing wouldn't
>> it be clearer to use something like "$^" (i.e. "^$" reversed), which
>> is a meta-character sequence that does obviously not make any sense.
>
> Isn't $^ something that occurs at the end of every line? (End of this
> line, beginning of the next)

Well, on close look it may have a similar interpretation problem (but
probably to a lesser degree?).

>
> Actually, I also have problems with ^$ since that would seem to mean and
> empty line, which is certainly possible (LFLF) would seem to have an
> empty line in it. But clearly I would have to know EXACTLU how awk
> determines the start and end of a line.

I think it boils down to have a good documentation and/or guidance in
the manual. (Then these threads could be significantly shortened. ;-)

Janis

Spiros Bousbouras

unread,
Jul 12, 2021, 5:23:00 PM7/12/21
to
On Mon, 12 Jul 2021 16:25:30 +0200
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

[On AWK dark corners.]
> The GNU
> Awk manual has the topic incoherently spread across three chapters.
> It would certainly be helpful to have a more coherent picture and a
> guidance or suggestion (with all the caveats, e.g. about what happens
> with a RS="^$" statement in other awks). There have also performance
> issues been addressed by Arnold in the past, I think that the @load
> option is the fastest because it bypasses the regexp processing,
> which would aid the user to make an informed choice what to use when.
>
> Until we have such a "directive" I fear we'll repeat our discussions
> every couple years again, and they seem to not be quickly terminated
> discussions on every re-iteration. ;-)

The only person who can provide authoritative answers is Arnold Robbins. I
think he reads comp.lang.awk but I'm not sure if he reads comp.unix.shell
so you should have crossposted to comp.lang.awk (which I've done). As I'm
typing this , I can see that several more posts have been made in the thread
discussing esoteric issues regarding AWK .It would have served everyone best
if these also appeared on comp.lang.awk .

Ed Morton

unread,
Jul 12, 2021, 5:55:22 PM7/12/21
to
On 7/12/2021 4:10 PM, Janis Papanagnou wrote:
> On 12.07.2021 22:59, Ed Morton wrote:
>> On 7/12/2021 3:49 PM, Janis Papanagnou wrote:
>>> On 12.07.2021 17:25, Ed Morton wrote:
>>>> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>>>>> It's more cryptic, no one seems to be perfectly sure how it works
>>>>
>>>> I don't know why it wouldn't just work like any other RS.
>>>
>>> The responses in that other thread left a different impression to me,
>>> that it wasn't obvious or as clear as it should be.
>>
>> I don't recall the previous conversation on it you're referring to but
>> it seems very clear and simple to me.
>>
>>>> gawk looks for
>>>> where a string matching that regexp occurs in the file and uses that to
>>>> identify the end of a record. If no such string exists in the file the
>>>> whole file is stored in $0.
>>>
>>> If all we want is a pattern that is principally non-existing wouldn't
>>> it be clearer to use something like "$^" (i.e. "^$" reversed), which
>>> is a meta-character sequence that does obviously not make any sense.
>>
>> `$^` means the literal chars `$` then `^` since `$` is only an anchor
>> metachar at the end of a regexp or subexpression and `^` only at the
>> beginning. Look:
>>
>> $ echo 'a$^b' | grep '$^'
>> a$^b
>
> We were talking about GNU Awk, don't we?

Yeah, but there's no magic about GNU awks regexp handling. My mistake
was assuming BREs and EREs were the same in regard to anchors.

>
> Consequently I have tested GNU Awk:
>
> $ echo $'a$^b\na$^b' | awk 'BEGIN{RS="$^"}{print NR, $0}'
> 1 a$^b
> a$^b

I stand corrected, apparently that's a difference between BREs and EREs:

$ echo 'a$^b' | grep '$^'
a$^b

$ echo 'a$^b' | grep -E '$^'
$

$ echo 'a$^b' | sed 's/$^/X/'
aXb

$ echo 'a$^b' | sed -E 's/$^/X/'
a$^b

See
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_09
which I thought was saying you could use any character before the `^`
and it wouldn't match which was supported by this test:

$ printf 'ax^b\nax^b\n' | awk 'BEGIN{RS="x^"}{print NR, $0}'
1 ax^b
ax^b

but then I can't explain this which is apparently just ignoring the RS
setting:

$ printf 'a.^b\na.^b\n' | awk 'BEGIN{RS=".^"}{print NR, $0}'
1 a.^b
2 a.^b

Regards,

Ed.

Ed Morton

unread,
Jul 12, 2021, 6:06:53 PM7/12/21
to
On 7/12/2021 4:06 PM, William Unruh wrote:
> On 2021-07-12, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 12.07.2021 17:25, Ed Morton wrote:
>>> On 7/12/2021 9:25 AM, Janis Papanagnou wrote:
>>>> It's more cryptic, no one seems to be perfectly sure how it works
>>>
>>> I don't know why it wouldn't just work like any other RS.
>>
>> The responses in that other thread left a different impression to me,
>> that it wasn't obvious or as clear as it should be.
>>
>>> gawk looks for
>>> where a string matching that regexp occurs in the file and uses that to
>>> identify the end of a record. If no such string exists in the file the
>>> whole file is stored in $0.
>>
>> If all we want is a pattern that is principally non-existing wouldn't
>> it be clearer to use something like "$^" (i.e. "^$" reversed), which
>> is a meta-character sequence that does obviously not make any sense.
>
> Isn't $^ something that occurs at the end of every line? (End of this
> line, beginning of the next)

You're mixing up string and lines and records. `^` means start of a
string and `$` means the end of string.

In a line-oriented tool like grep or sed (without the GNU -z option and
without using a hold space), the string in question is the line that was
just read into memory and so `^` and `$` can be used to find the
start/end of the line because the line in question is the whole string.

In a record-oriented tool like awk, when used with the default RS of
`\n`, the `^` and `$` can be used the same way as in sed or grep, but
when used with a different RS the record can contain newlines so the `^`
and `$` do not match the start and end of lines any more, they match the
start and end of the record. If you use an RS that can't exist in the
input then the whole input file is the record and so `^` matches the
start of the input file while `$` matches the end of the input file.

So no, `$^` does not occur anywhere in any input if we're assuming `$`
and `^` to be anchor metachars in that expression (which they apparently
are when using an ERE such as awk uses).

>
> Actually, I also have problems with ^$ since that would seem to mean and
> empty line,

No, it's an empty string which could be a line given some RS values or
it could be an empty record that's part of a file given other RS values
or it could be an empty file given yet other RS values.

which is certainly possible (LFLF) would seem to have an
> empty line in it. But clearly I would have to know EXACTLU how awk
> determines the start and end of a line.

A line starts with the character following `^` or the previous `\n`. A
line ends with the character before `$` or the next `\n`.

Ed.

Ed Morton

unread,
Jul 12, 2021, 6:45:45 PM7/12/21
to
Here's some examples showing what's matched between `^` and `$` when the
string in memory (the current awk record) is a line, a multi-line
paragraph, and a whole file, all based on the value of `RS`:

The sample input (courtesy of Robert Burns "Tam O'Shanter" written/set
near my home town):

$ cat file
Ah, gentle dames! it gars me greet,
To think how mony counsels sweet,

How mony lengthen'd, sage advices,
The husband frae the wife despises!

Read 1 line at a time:

$ awk 'match($0,/^.*$/) { print "<" substr($0,RSTART,RLENGTH) ">"
}' file
<Ah, gentle dames! it gars me greet,>
<To think how mony counsels sweet,>
<>
<How mony lengthen'd, sage advices,>
<The husband frae the wife despises!>

Read 1 paragraph at a time:

$ awk -v RS='' 'match($0,/^.*$/) { print "<"
substr($0,RSTART,RLENGTH) ">" }' file
<Ah, gentle dames! it gars me greet,
To think how mony counsels sweet,>
<How mony lengthen'd, sage advices,
The husband frae the wife despises!>

Read the whole file at once (RS='^$` could be `RS='anything nonexistent'`):

$ awk -v RS='^$' 'match($0,/^.*$/) { print "<"
substr($0,RSTART,RLENGTH) ">" }' file
<Ah, gentle dames! it gars me greet,
To think how mony counsels sweet,

How mony lengthen'd, sage advices,
The husband frae the wife despises!
>

As you can see `^` and `$` always match the start and end of the record
that awk is currently processing, whether that's a line, or a paragraph,
or a whole file. The only time when `^` and `$` also identify the
start/end of a line is when the whole record is a single line.

To find lines in the multi-line paragraph case, you need to test for
`^|\n` at the start of the lines (`^` to find the start of the first
line, `\n` to find subsequent) and/or `\n|$` at the end of the lines
(`$` to find the end of the last line, `\n` to find previous), depending
on what you want to do with it, and you also need to account for the
fact that if you find `\n` it'll be part of the matching string, unlike
`^` or `$`, e.g.

$ awk -v RS='' 'match($0,/^[^\n]*/) { print "<"
substr($0,RSTART,RLENGTH) ">" }' file
<Ah, gentle dames! it gars me greet,>
<How mony lengthen'd, sage advices,>

$ awk -v RS='' 'match($0,/\n[^\n]*/) { print "<"
substr($0,RSTART+1,RLENGTH-1) ">" }' file
<To think how mony counsels sweet,>
<The husband frae the wife despises!>

Regards,

Ed.

Janis Papanagnou

unread,
Jul 13, 2021, 2:05:51 AM7/13/21
to
$ printf 'a.^b\na.^b\n' | awk 'BEGIN{RS="[.]^"}{print NR, $0}'
1 a.^b
a.^b

It seems ".^" had not been "ignored" as RS but interpreted as string?

Janis

>
> Regards,
>
> Ed.

Janis Papanagnou

unread,
Jul 13, 2021, 2:39:25 AM7/13/21
to
...which still wouldn't explain the outcome of your test case, though.

Hmm..

> Janis
>
>>
>> Regards,
>>
>> Ed.
>

Ed Morton

unread,
Jul 13, 2021, 9:16:27 AM7/13/21
to
I talked to Arnold and he thinks it's a bug which he's investigating.
See https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html.

Ed.

0 new messages