Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading the whole file in one go in GAWK

632 views
Skip to first unread message

Kenny McCormack

unread,
Jun 14, 2013, 8:05:20 AM6/14/13
to
I have the need to read in the whole file as a single record in a GAWK
program. I seem to remember reading recently that there was some special
feature (recently added, sometime in the 4.x series) that supported this.
However, I could not find it in the doc (not that I searched exhaustively).

Anyway, it looks like one way is to set RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz",
i.e., something that never occurs in your text, but is there a more
"systemic" way to do this?

Note that this is *not* the same as setting RS="", since that has a defined
meaning, that has to do with multiple blank lines.

--
Here's a simple test for Fox viewers:

1) Sit back, close your eyes, and think (Yes, I know that's hard for you).
2) Think about and imagine all of your ridiculous fantasies about Barack Obama.
3) Now, imagine that he is white. Cogitate on how absurd your fantasies
seem now.

See? That wasn't hard, was it?

Janis Papanagnou

unread,
Jun 14, 2013, 8:22:52 AM6/14/13
to
Am 14.06.2013 14:05, schrieb Kenny McCormack:
> I have the need to read in the whole file as a single record in a GAWK
> program. I seem to remember reading recently that there was some special
> feature (recently added, sometime in the 4.x series) that supported this.
> However, I could not find it in the doc (not that I searched exhaustively).
>
> Anyway, it looks like one way is to set RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz",
> i.e., something that never occurs in your text, but is there a more
> "systemic" way to do this?
>
> Note that this is *not* the same as setting RS="", since that has a defined
> meaning, that has to do with multiple blank lines.

RS="\0" is what you are looking for.

In gawk RS="\0" is different from RS="".

Janis

Kenny McCormack

unread,
Jun 14, 2013, 9:21:14 AM6/14/13
to
In article <kpf1ur$2m1$1...@speranza.aioe.org>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>RS="\0" is what you are looking for.
>
>In gawk RS="\0" is different from RS="".

I believe this is incorrect. Setting RS to the null character does exactly
that. I use this construct to parse files in /proc, which are null byte
delimited.

Try this (on a Linux system):

gawk 'BEGIN {RS="\0"} {print NR,$0}' /proc/self/environ

P.S. Of course, in real life, setting it to "\0" will usually work - for
the same reason that setting to "zzzzzzzzzzzzzzzzzzzzzzz" works.

--
Modern Conservative: Someone who can take time out
from demanding more flag burning laws, more abortion
laws, more drug laws, more obscenity laws, and more
police authority to make warrantless arrests to remind
us that we need to "get the government off our backs".

Janis Papanagnou

unread,
Jun 14, 2013, 9:38:50 AM6/14/13
to
Am 14.06.2013 15:21, schrieb Kenny McCormack:
> In article <kpf1ur$2m1$1...@speranza.aioe.org>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>> RS="\0" is what you are looking for.
>>
>> In gawk RS="\0" is different from RS="".
>
> I believe this is incorrect. Setting RS to the null character does exactly
> that.

I am not sure what you are saying here. I've done that a couple times
on Linux and Cygwin. (Details on page 56 of the GAWK manual - page
refers to the May 2013 version; RS="\0" was there in earlier versions
as well; at least in 3.1, IIRC.)

> I use this construct to parse files in /proc, which are null byte
> delimited.

If you parse "\0"-delimited data you have another problem; one with
the data. If I use "\0" or "zzz...z", either way the line is aborted
after the string terminator, isn't it?

$ echo $'a\0b\0c\0d\0' | awk 'BEGIN{RS="zzz"}{print NR,"=="$0"=="}' | od -c
0000000 1 = = a \n = = \n
0000011

Back at home I'll check behaviour of $'a\0b\0c\0d\0' on Linux as well.

>
> Try this (on a Linux system):
>
> gawk 'BEGIN {RS="\0"} {print NR,$0}' /proc/self/environ
>
> P.S. Of course, in real life, setting it to "\0" will usually work - for
> the same reason that setting to "zzzzzzzzzzzzzzzzzzzzzzz" works.

You had been asking: "but is there a more "systemic" way to do this?".
Using "\0" instead of "zzzzz...z" is a more systemmaic way in GNU awk.

Janis

Kenny McCormack

unread,
Jun 15, 2013, 9:06:12 AM6/15/13
to
In article <kpf6d8$fq6$1...@speranza.aioe.org>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>Am 14.06.2013 15:21, schrieb Kenny McCormack:
>> In article <kpf1ur$2m1$1...@speranza.aioe.org>,
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> ...
>>> RS="\0" is what you are looking for.
>>>
>>> In gawk RS="\0" is different from RS="".
>>
>> I believe this is incorrect. Setting RS to the null character does exactly
>> that.
>
>I am not sure what you are saying here. I've done that a couple times
>on Linux and Cygwin. (Details on page 56 of the GAWK manual - page
>refers to the May 2013 version; RS="\0" was there in earlier versions
>as well; at least in 3.1, IIRC.)

Sorry for the somewhat elliptical phrasing. What I meant by "does exactly
that" is that it does, in fact, set the record separator to the null
character and does, in fact, parse your input using the null character as
the record separator. I.e., there's nothing "magical" about setting the
record separator to the null character. It's not like GAWK sees that
RS == "\0" and says "Aha! The user has specified a magic value that tells me
to read the entire file into memory as a single record." Effectively,
setting RS="\0" is logically identical to setting it to "zzzzzzzzzzzz".

The example code that I gave, parsing the Linux /proc/self/environ file was
intended to demonstrate that there do exist real world examples of files
that use the null character as a record delimiter and thus that setting
RS="\0" does have real world usefulness. I was entirely serious when I
said that I frequently do just that - and by "just that", I mean parsing
the Linux "environ" files using GAWK (setting RS="\0").

Finally, I did find and read the section you alluded to above, in the big
GAWK PDF file. Interesting reading, although it, too, is a bit elliptical.
It does state that the only way to do this (read the entire file) is to set
RS to a value known not to occur in your data - and that this is thus,
theoretically impossible to do in complete generality. This, in turn,
suggests that maybe there *should* be some magical sentinel value or some
flag or something you could set - to get this functionality. My (obviously
flawed) memory was that there *was* some magic function to do this -
something like "FileGet" or some such - but it seems I've gotten my
languages confused with each other. The tribulations of using so many of
them on a daily basis...

Some things that I found a bit weird about the phrasing of that section of
the GAWK PDF file:

1) It says "You might think ..." (implying that you'd be wrong to do so)
and then a line or two later essentially says "and you'd be right" (at
least for GAWK, although it might not work in other AWKs - to which I say
"But does anyone really care about other AWKs at this point?")

2) Then it says "All other AWKs ...", to which I say "All generalizations
are false." Not to sound like a broken record, but TAWK is fine with
RS="\0" and does, in fact, store strings internally "Pascal style", not "C
style". In fact, if I were to hazard a guess, I'd say that most (in not
all - heh heh) "modern" AWKs (i.e., the ones we care about) have made this
step into the modern world. FWIW, besides GAWK & TAWK, there's MAWK and
there's "BWK's One True AWK". Not sure if there are any others...

3) The section concludes by saying that the "best" way to do this is
not to futz with RS at all, but rather to read the file in the normal, line
at a time mode, concatenating it together - and then, presumably,
processing the result in the "END" block. Now, while I agree to some extent
with the underlying spirit of this advice, and I myself often give newbies
the advice to not futz with the built-in variables (FS, RS, etc) and just
"do it in code", the fact remains that doing so often breaks AWK's
"pattern-action" model. I.e., one of the basic problems with AWK is that
its nifty and oh-so-kewl "pattern-action" model is fragile and often
becomes unusable if your input isn't amenable to its use. This all argues
for a built-in way to read the whole file as a single record (i.e., one
that doesn't depend on setting RS to something you just hope won't be found
in your data).

Finally, note that where this all comes from is that I was writing a
program using the new [*] FPAT functionality in GAWK, where my FPAT
matching records span multiple lines (and include the newlines). Having
read the recommendation (in the quoted PDF file) that "the best way ...", I
suppose I could have built it all up and then used "patsplit()" in the the
"END" block, but that's just so ugly. As I say, it breaks AWK's lovely
pattern-action model. Incidentally, I did change my program to do RS="\0"
instead of all the Zs, but I still think there should be a "more systemic"
way to do it.

[*] New in GAWK (as of 4.x, I think), but of course, TAWK has had it for
years...

--

First of all, I do not appreciate your playing stupid here at all.

- Thomas 'PointedEars' Lahn -

pk

unread,
Jun 15, 2013, 9:11:11 AM6/15/13
to
On Sat, 15 Jun 2013 13:06:12 +0000 (UTC), gaz...@shell.xmission.com (Kenny
McCormack) wrote:

> Finally, I did find and read the section you alluded to above, in the big
> GAWK PDF file. Interesting reading, although it, too, is a bit
> elliptical. It does state that the only way to do this (read the entire
> file) is to set RS to a value known not to occur in your data - and that
> this is thus, theoretically impossible to do in complete generality.

So what's wrong with:

awk '{ data = data sep $0; sep = RS } END { do whatever with tot }'

it's not really "in one go", but the end result is the same.

Kenny McCormack

unread,
Jun 15, 2013, 9:19:21 AM6/15/13
to
If you had read my entire post (yes, I know it was long), you'd see that I
address this point. The fact is, yes, you can do it that way, but it is
UGLY.

P.S. I won't bother addressing all the bugs and typos in the above code
snippet. Unlike the denizens of comp.lang.c, I know perfectly well that it
was intended as conversational only, not nit-pickedly correct, usable code.

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

Aharon Robbins

unread,
Jun 15, 2013, 3:00:32 PM6/15/13
to
In article <kpf0u0$q7c$1...@news.xmission.com>,
Kenny McCormack <gaz...@shell.xmission.com> wrote:
>I have the need to read in the whole file as a single record in a GAWK
>program. I seem to remember reading recently that there was some special
>feature (recently added, sometime in the 4.x series) that supported this.
>However, I could not find it in the doc (not that I searched exhaustively).
>
>Anyway, it looks like one way is to set RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz",
>i.e., something that never occurs in your text, but is there a more
>"systemic" way to do this?
>
>Note that this is *not* the same as setting RS="", since that has a defined
>meaning, that has to do with multiple blank lines.

A nice solution given by Denis Shirokov is

RS = "^$"

From 4.1, you can use

@load "readfile"

to get the readfile extension and then do

string = readfile("/some/path")

HTH,

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon
D.N. Shimshon 9978500 ISRAEL

Kenny McCormack

unread,
Jun 15, 2013, 3:09:07 PM6/15/13
to
In article <kpidkg$45k$1...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
...
>A nice solution given by Denis Shirokov is
>
> RS = "^$"

Interesting. I'll have to think about that one...

>From 4.1, you can use
>
> @load "readfile"
>
>to get the readfile extension and then do
>
> string = readfile("/some/path")

That looks kewl, but, unless I'm missing something, looks like it suffers
the same issue that I raise in my previous (long) post - that it breaks the
nifty AWK "pattern action" model. I.e., once you've got "string", how do
you run the pattern-actions against it?

--
The motto of the GOP "base": You can't *be* a billionaire, but at least you
can vote like one.

Joe User

unread,
Jun 15, 2013, 6:21:03 PM6/15/13
to
On Sat, 15 Jun 2013 19:00:32 +0000, Aharon Robbins wrote:

> A nice solution given by Denis Shirokov is
>
> RS = "^$"
>
> From 4.1, you can use
>
> @load "readfile"
>
> to get the readfile extension and then do
>
> string = readfile("/some/path")

I was unable to find readfile in ftp.armory.com. Whereabouts on the web
can I find these include files?

Thanks.




--
The smart way to keep people passive and obedient
is to strictly limit the spectrum of acceptable
opinion, but allow very lively debate within that
spectrum.

-- Noam Chomsky

Joe User

unread,
Jun 15, 2013, 6:33:00 PM6/15/13
to
On Sat, 15 Jun 2013 19:09:07 +0000, Kenny McCormack wrote:

> That looks kewl, but, unless I'm missing something, looks like it
> suffers the same issue that I raise in my previous (long) post - that it
> breaks the nifty AWK "pattern action" model. I.e., once you've got
> "string", how do you run the pattern-actions against it?

Any match would match the entire file. This seems to work:

gawk -v RS='^$' '/./ {print FILENAME ": " length($0)}' *.txt


--
In his own soul a man bears the source
from which he draws all his sorrows and his joys.

-- Sophocles

Aharon Robbins

unread,
Jun 16, 2013, 1:12:39 PM6/16/13
to
In article <fv5v8a-...@c100.static-216-228-92-121.apid.com>,
Joe User <ax...@yahoo.com> wrote:
>On Sat, 15 Jun 2013 19:00:32 +0000, Aharon Robbins wrote:
>
>> A nice solution given by Denis Shirokov is
>>
>> RS = "^$"
>>
>> From 4.1, you can use
>>
>> @load "readfile"
>>
>> to get the readfile extension and then do
>>
>> string = readfile("/some/path")
>
>I was unable to find readfile in ftp.armory.com. Whereabouts on the web
>can I find these include files?

In the gawk 4.1 distribution.

Aharon Robbins

unread,
Jun 16, 2013, 1:14:01 PM6/16/13
to
In article <kpie4j$p3s$1...@news.xmission.com>,
Kenny McCormack <gaz...@shell.xmission.com> wrote:
>In article <kpidkg$45k$1...@dont-email.me>,
>Aharon Robbins <arn...@skeeve.com> wrote:
>...
>>A nice solution given by Denis Shirokov is
>>
>> RS = "^$"
>
>Interesting. I'll have to think about that one...
>
>>From 4.1, you can use
>>
>> @load "readfile"
>>
>>to get the readfile extension and then do
>>
>> string = readfile("/some/path")
>
>That looks kewl, but, unless I'm missing something, looks like it suffers
>the same issue that I raise in my previous (long) post - that it breaks the
>nifty AWK "pattern action" model. I.e., once you've got "string", how do
>you run the pattern-actions against it?

You can't. You can use split etc. Looks like the ^$ solution is what
you want.

cj2kxvm

unread,
Jun 16, 2013, 5:45:04 PM6/16/13
to
Try this:

gawk 'BEGIN {RS=UndefVar} {print NR,$0}' /proc/self/environ

and this is what I try on my linux system to see how it work:

/usr/bin/gawk 'BEGIN {ORS=RS=UndefVar} {gsub(/\0/,"\\0");print NR,$0}' /proc/self/environ /proc/self/environ | leafpad

It work with 'GNU Awk 3.1.6'.

Jacques

Joe User

unread,
Jun 16, 2013, 6:06:24 PM6/16/13
to
On Sun, 16 Jun 2013 17:12:39 +0000, Aharon Robbins wrote:

> In article <fv5v8a-...@c100.static-216-228-92-121.apid.com>, Joe
> User <ax...@yahoo.com> wrote:
>>On Sat, 15 Jun 2013 19:00:32 +0000, Aharon Robbins wrote:
>>
>>> A nice solution given by Denis Shirokov is
>>>
>>> RS = "^$"
>>>
>>> From 4.1, you can use
>>>
>>> @load "readfile"
>>>
>>> to get the readfile extension and then do
>>>
>>> string = readfile("/some/path")
>>
>>I was unable to find readfile in ftp.armory.com. Whereabouts on the web
>>can I find these include files?
>
> In the gawk 4.1 distribution.

Oh. I thought readfile() was just an awk function. Thanks.

--
An undertaking of great advantage,
but nobody to know what it is.

-- Business description used to find
speculators during a long-ago
stock mania

Kenny McCormack

unread,
Jun 17, 2013, 8:01:20 AM6/17/13
to
In article <kpkrop$46v$2...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
...
>>I.e., once you've got "string", how do
>>you run the pattern-actions against it?
>
>You can't. You can use split etc.

Indeed. The question was basically rhetorical.
(As in, "If I want C, I know where to find it...")

>Looks like the ^$ solution is what you want.

Yep.

--

Some of the more common characteristics of Asperger syndrome include:

* Inability to think in abstract ways (eg: puns, jokes, sarcasm, etc)
* Difficulties in empathising with others
* Problems with understanding another person's point of view
* Hampered conversational ability
* Problems with controlling feelings such as anger, depression
and anxiety
* Adherence to routines and schedules, and stress if expected routine
is disrupted
* Inability to manage appropriate social conduct
* Delayed understanding of sexual codes of conduct
* A narrow field of interests. For example a person with Asperger
syndrome may focus on learning all there is to know about
baseball statistics, politics or television shows.
* Anger and aggression when things do not happen as they want
* Sensitivity to criticism
* Eccentricity
* Behaviour varies from mildly unusual to quite aggressive
and difficult

Brian Donnell

unread,
Jun 17, 2013, 2:29:28 PM6/17/13
to
I'm a hobbyist using The One True Awk, version 20070501 on OS X 10.6.8.
TOTA accepts only a single char for RS. Following a suggestion by Ed Morton
in the 8-27-10 thread 'Question about: RS = "\0"', I tried to read in a whole
file by setting RS=SUBSEP. On the command line, this seems only to set RS =
capital S; but when set in the BEGIN block, it seems to read in the whole file,
even if the file contains "\034", the default value of SUBSEP given in _The Awk
Programming Language_. I suppose \034 is an octal value. (Help me here--I'm
way out of my depth)--is it also some kind of file separator value in shell?

Best wishes, Brian Donnell

Janis Papanagnou

unread,
Jun 17, 2013, 4:21:53 PM6/17/13
to
If you set awk -v RS=SUBSEP the awk variable RS will get the string
"SUBSEP" assigned; it's the same as if you'd do awk 'BEGIN {RS="SUBSEP"}' .
It's different if you write awk 'BEGIN {RS=SUBSEP}' , i.e. without quotes,
where you defined awk variable RS by the awk variable SUBSEP, not by the
string literal "SUBSEP".

Using SUBSEP (the variable, not the string) in a GNU awk program as RS,
it will split the input string as expected at each \034 byte, as expected.
If your awk behaves differently I say it is broken.

Janis

>
> Best wishes, Brian Donnell
>

Digi

unread,
Jul 1, 2013, 9:48:08 PM7/1/13
to
пятница, 14 июня 2013 г., 15:05:20 UTC+3 пользователь Kenny McCormack написал:
> I have the need to read in the whole file as a single record in a GAWK
>
> program. I seem to remember reading recently that there was some special
>
> feature (recently added, sometime in the 4.x series) that supported this.
>
> However, I could not find it in the doc (not that I searched exhaustively).
>
>
>
> Anyway, it looks like one way is to set RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz",
>
> i.e., something that never occurs in your text, but is there a more
>
> "systemic" way to do this?
>
>
>


You can set globvar `RS' to the regular expression that will match any data with the any length higher than zero. Then perform single `getline' statement and whole file data will be available in variable `RT'.
This means that separator is the any data with the length>0. And you can found separator data(that is actually whole file data) in global variable `RT'.
In case if file contains zero data length then of course it will not match with the regexp in RS, but in that case separator will be also empty string.
In any case global variable `RT' can be used for reading whole file data.

I found that following have the best performance:

RS=".{1,}"
if ( getline a < file ) { #filedata is in `RT'
}

but of course the fastest and the better way is just using dynamic extension `readfile'

D

Digi

unread,
Jul 2, 2013, 10:21:04 PM7/2/13
to
вторник, 2 июля 2013 г., 4:48:08 UTC+3 пользователь Digi написал:
I think that

Yes, in that case, why not? Maybe you can define for the type of data that you want to read what combination of RS will be absolute:

RS="SIWEl LLORAc"

I don't believe that one day this combination (just for example) will make troubles.

also if you are running gawk in byte mode then you can use [^\x00-\xFF]

Håkon Hægland

unread,
Jan 5, 2014, 5:11:53 AM1/5/14
to
What is now the "offical" way of doing this?

a) RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
b) RS="\0"
c) RS = "^$"
d) @load "readfile"; BEGIN {str = readfile(ARGV[1])}
e) BEGIN {RS=".{1,}"} { str=RT }
f) RS="[^\x00-\xFF]"



Janis Papanagnou

unread,
Jan 5, 2014, 8:48:36 AM1/5/14
to
On 05.01.2014 11:11, H�kon H�gland wrote:
> What is now the "offical" way of doing this?

[ "this" = Subject: Reading the whole file in one go in GAWK ]

(Not sure what you think qualifies an "official" way.)

>
> a) RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
> b) RS="\0"
> c) RS = "^$"
> d) @load "readfile"; BEGIN {str = readfile(ARGV[1])}
> e) BEGIN {RS=".{1,}"} { str=RT }
> f) RS="[^\x00-\xFF]"

AFAICT;
a) is the only defined standard way of doing that;
it's clumsy and not universally guaranteed for
any data, though. For text files it's probably
better to have some control character defined
as the first character of RS, say, RS = SUBSEP.
b) is the one that I think is even documented in
the GNU awk manual.
c) of similar quality like b) (but relies on RS
being a regexp, which might be an issue only if
you also consider portability across awks).
d) relies on non-standard features @load/readfile()
that may not be available in older gawk versions.
e) is (for my taste) too obscure and smells like an
[unnecessary] hack. It relies on features that
may not be known to standard awk users and not
be available if using older versions of gawk.
f) I'm not sure whether that works reliable in UTF-8
or other non 8 bit character sets resp. locales,
or whether behaviour would even be affected by
the locale setting.
Most proposals imply non-standard features as well
(e.g. regexps in RS, or use of RT or regexp ranges);
not bad per se in the given gawk context, but in any
case I'd use something that can be easily understood
and that is not too clumsy or complex. So, if gawk
is given, then I'd use the documented one b), but c)
seems reasonable as well.

Janis

Ed Morton

unread,
Jan 5, 2014, 1:42:44 PM1/5/14
to
On 1/5/2014 7:48 AM, Janis Papanagnou wrote:
> On 05.01.2014 11:11, H�kon H�gland wrote:
>> What is now the "offical" way of doing this?
>
> [ "this" = Subject: Reading the whole file in one go in GAWK ]
>
> (Not sure what you think qualifies an "official" way.)
>
>>
>> a) RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
>> b) RS="\0"
>> c) RS = "^$"
>> d) @load "readfile"; BEGIN {str = readfile(ARGV[1])}
>> e) BEGIN {RS=".{1,}"} { str=RT }
>> f) RS="[^\x00-\xFF]"

g) { rec = rec $0 RS } END {$0 = rec; ... }
h) awk -v rs="the real RS" '{gsub(rs,SUBSEP)}1' file |
awk 'BEGIN{RS=SUBSEP} ...'

>
> AFAICT;
> a) is the only defined standard way of doing that;
> it's clumsy and not universally guaranteed for
> any data, though. For text files it's probably
> better to have some control character defined
> as the first character of RS, say, RS = SUBSEP.

"a" is also gawk-specific. In a POSIX awk only the first char of an RS string is
actually used as the RS.

> b) is the one that I think is even documented in
> the GNU awk manual.

Correct. That's the one I use for that reason.

> c) of similar quality like b) (but relies on RS
> being a regexp, which might be an issue only if
> you also consider portability across awks).

Also gawk-specific due to multiple chars.

> d) relies on non-standard features @load/readfile()
> that may not be available in older gawk versions.

Never seen it before. Can't imagine why I'd use that.

> e) is (for my taste) too obscure and smells like an
> [unnecessary] hack. It relies on features that
> may not be known to standard awk users and not
> be available if using older versions of gawk.

Indeed. gawk-specific and there's better gawk solutions.

> f) I'm not sure whether that works reliable in UTF-8
> or other non 8 bit character sets resp. locales,
> or whether behaviour would even be affected by
> the locale setting.

Again, gawk-specific at best.

> Most proposals imply non-standard features as well
> (e.g. regexps in RS, or use of RT or regexp ranges);
> not bad per se in the given gawk context, but in any
> case I'd use something that can be easily understood
> and that is not too clumsy or complex. So, if gawk
> is given, then I'd use the documented one b), but c)
> seems reasonable as well.

g and h work in any awk.

Ed.

>
> Janis
>

Håkon Hægland

unread,
Jan 5, 2014, 4:56:18 PM1/5/14
to
On Sunday, January 5, 2014 7:42:44 PM UTC+1, Ed Morton wrote:

Thanks for the input to Janis and Ed Morton.

I am currently only using gawk.. But, yes I agree that it is nice if the setting is portable across different versions of awk. But that will not neccessarily means an good choice if you are are only using gawk.
So for the moment, I narrow the discussion to gawk, and then I wonder if RS="\0" is an efficient way of reading the whole file into $0. And I guess it will fail for binary files, containing "\0" as interior bytes. What I am trying to convey, is that the current situation does not seem completely satisfactory. A solution could be, for instance, in the next release of gawk, to explicity state that: "If you specify RS="^$", it will cause the whole file to be read into $0 in the most efficent way. So then at least everybody knew what they could expect. And there would be no more discussion about this topic..

For other versions of awk, I do not know very much. As I said, I have only used gawk.. Portability is currently not my main issue, rather I would like to see a standard that is documented for the case of gawk so there would be no doubt with respect to what to do, for the case when you like to read the whole file in one go in an efficient manner.

Aharon Robbins

unread,
Jan 5, 2014, 11:03:32 PM1/5/14
to
The manual in the repo already documents RS = "^$". RS = "\0" will be the
same as RS = "" on any other awk.

In article <15b69430-d32e-4d07...@googlegroups.com>,

Janis Papanagnou

unread,
Jan 6, 2014, 1:21:00 AM1/6/14
to
On 06.01.2014 05:03, Aharon Robbins wrote:
> The manual in the repo already documents RS = "^$". RS = "\0" will be the
> same as RS = "" on any other awk.

You're saying that the behaviour with respect to RS="\0" will change
with the next version of gawk?
That will break existing programs, depending on the actual installed
gawk version. Curious; what thoughts led to that change?
Why not keep the old behaviour in addition to documenting the newly
preferred one?

Janis

Aharon Robbins

unread,
Jan 6, 2014, 1:01:30 PM1/6/14
to
In article <ladi0c$mo4$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 06.01.2014 05:03, Aharon Robbins wrote:
>> The manual in the repo already documents RS = "^$". RS = "\0" will be the
>> same as RS = "" on any other awk.
>
>You're saying that the behaviour with respect to RS="\0" will change
>with the next version of gawk?

No. I am saying that an awk program that uses RS = "\0" when run with
gawk will treat the zero byte as the record separator. However, the same
awk program when run with any other awk will act as if the assignment had
been RS = "".

OK? :-)

Arnold

Kenny McCormack

unread,
Jan 6, 2014, 1:21:07 PM1/6/14
to
In article <laer1q$5d0$1...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
>In article <ladi0c$mo4$1...@news.m-online.net>,
>Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>On 06.01.2014 05:03, Aharon Robbins wrote:
>>> The manual in the repo already documents RS = "^$". RS = "\0" will be the
>>> same as RS = "" on any other awk.
>>
>>You're saying that the behaviour with respect to RS="\0" will change
>>with the next version of gawk?
>
>No. I am saying that an awk program that uses RS = "\0" when run with
>gawk will treat the zero byte as the record separator.

I'm actually not sure whether Janis really did suffer a language barrier
problem (the multiple ways in which the word "will" is used in English), or
if he is just kidding around. I could see it going either way.

In any event, I think you'd do better not to use that word. The sentence
above works just as well as:

I am saying that an awk program that uses RS = "\0", when run with
gawk, treats the zero byte as the record separator.

>However, the same
>awk program when run with any other awk will act as if the assignment had
>been RS = "".

False. TAWK does the right thing. I wouldn't be surprised if Mawk did,
too. I think that, in future, we should accept that the phrase "any other
awk" is to be interpreted as "Many vendor-supplied AWKs".

>OK? :-)

No worries.

--
Religion is regarded by the common people as true,
by the wise as foolish,
and by the rulers as useful.

(Seneca the Younger, 65 AD)

Aharon Robbins

unread,
Jan 6, 2014, 1:57:47 PM1/6/14
to
In article <laes6j$11h$1...@news.xmission.com>,
Kenny McCormack <gaz...@shell.xmission.com> wrote:
>>However, the same
>>awk program when run with any other awk will act as if the assignment had
>>been RS = "".
>
>False. TAWK does the right thing.

News to me.

>I wouldn't be surprised if Mawk did,too.

I'm sure it doesn't do the right thing (unless something has changed) since
Mike Brennan told me that mawk uses C strings.

Janis Papanagnou

unread,
Jan 6, 2014, 5:41:44 PM1/6/14
to
On 06.01.2014 19:01, Aharon Robbins wrote:
> In article <ladi0c$mo4$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 06.01.2014 05:03, Aharon Robbins wrote:
>>> The manual in the repo already documents RS = "^$". RS = "\0" will be the
>>> same as RS = "" on any other awk.
>>
>> You're saying that the behaviour with respect to RS="\0" will change
>> with the next version of gawk?
>
> No. I am saying that an awk program that uses RS = "\0" when run with
> gawk will treat the zero byte as the record separator. However, the same
> awk program when run with any other awk will act as if the assignment had
> been RS = "".
>
> OK? :-)

Umm, partly. But enough to re-inspect an older gawk to get the answer... :-)

I seem to have falsely assumed that gawk behaved differently in the past
versions; that a RS="\0" (as a hard-coded exception) would always cause the
whole file to be read. However, trying a gawk 3.1.6 that I found somewhere on
my system on a ksh93 .sh_history file (whose entries are separated by '\0's)
I got aware that this was actually not the case. A '\0' character had never
been a reliable separator in case of binary files. And WRT text files using
RS=SUBSEP would be even better than the suggested RS="\0" in cases when your
programs shall run on other (and older) awks as well.

Janis

>
> Arnold
>

Kenny McCormack

unread,
Jan 6, 2014, 5:55:02 PM1/6/14
to
In article <laeubb$qqf$2...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
>In article <laes6j$11h$1...@news.xmission.com>,
>Kenny McCormack <gaz...@shell.xmission.com> wrote:
>>>However, the same
>>>awk program when run with any other awk will act as if the assignment had
>>>been RS = "".
>>
>>False. TAWK does the right thing.
>
>News to me.

Heh. It's always dangerous to make generalizations on Usenet.
Always, some jerk (heh heh) will come along and provide a counter-example.

>>I wouldn't be surprised if Mawk did,too.
>
>I'm sure it doesn't do the right thing (unless something has changed) since
>Mike Brennan told me that mawk uses C strings.

OK - duly noted.

--
Modern Conservative: Someone who can take time out
from demanding more flag burning laws, more abortion
laws, more drug laws, more obscenity laws, and more
police authority to make warrantless arrests to remind
us that we need to "get the government off our backs".

Janis Papanagnou

unread,
Jan 6, 2014, 5:56:36 PM1/6/14
to
On 06.01.2014 19:21, Kenny McCormack wrote:
> In article <laer1q$5d0$1...@dont-email.me>,
> Aharon Robbins <arn...@skeeve.com> wrote:
>> In article <ladi0c$mo4$1...@news.m-online.net>,
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> On 06.01.2014 05:03, Aharon Robbins wrote:
>>>> The manual in the repo already documents RS = "^$". RS = "\0" will be the
>>>> same as RS = "" on any other awk.
>>>
>>> You're saying that the behaviour with respect to RS="\0" will change
>>> with the next version of gawk?
>>
>> No. I am saying that an awk program that uses RS = "\0" when run with
>> gawk will treat the zero byte as the record separator.
>
> I'm actually not sure whether Janis really did suffer a language barrier
> problem (the multiple ways in which the word "will" is used in English), or
> if he is just kidding around. [...]

I wasn't kidding.

(I cannot rule out that the language might be a source of confusion because
I'm no native speaker. But even in my mother tongue(s) I occasionally observe
that I don't get the intention of specific formulation from someone, because
those rely on (sometimes disputable, sometimes presumed, sometimes prejudice
based) context, or are just ambiguous.[*])

Janis

[*] Communication is an (at least) two subject based process. To understand
what I mean, see http://en.wikipedia.org/wiki/Mutual_information, but the
picture on http://de.wikipedia.org/wiki/Transinformation may be more suitable.

b2do...@gmail.com

unread,
Jan 28, 2014, 4:06:28 PM1/28/14
to
Janis--Since I'm responding on the FizzBuzz thread, I thought I'd respond here too. I understand
what you are saying. The thing I found interesting was that I could set a variable in the BEGIN
block that I couldn't set on the command line.

Thanks again for your instructive responses. Brian Donnell

Janis Papanagnou

unread,
Jan 28, 2014, 10:56:36 PM1/28/14
to
On 28.01.2014 22:06, b2do...@gmail.com wrote:
> On Monday, 17 June 2013 13:21:53 UTC-7, Janis Papanagnou wrote:
>> On 17.06.2013 20:29, Brian Donnell wrote:
>>
[...]
>>>
>>
>>> I'm a hobbyist using The One True Awk, version 20070501 on OS X 10.6.8.
>>
>>> TOTA accepts only a single char for RS. Following a suggestion by Ed Morton
>>
>>> in the 8-27-10 thread 'Question about: RS = "\0"', I tried to read in a whole
>>
>>> file by setting RS=SUBSEP. On the command line, this seems only to set RS =
>>
>>> capital S; but when set in the BEGIN block, it seems to read in the whole file,
>>
>>> even if the file contains "\034", the default value of SUBSEP given in _The Awk
>>
>>> Programming Language_. I suppose \034 is an octal value. (Help me here--I'm
>>
>>> way out of my depth)--is it also some kind of file separator value in shell?
>>
>>
>>
>> If you set awk -v RS=SUBSEP the awk variable RS will get the string
>>
>> "SUBSEP" assigned; it's the same as if you'd do awk 'BEGIN {RS="SUBSEP"}' .
>>
>> It's different if you write awk 'BEGIN {RS=SUBSEP}' , i.e. without quotes,
>>
>> where you defined awk variable RS by the awk variable SUBSEP, not by the
>>
>> string literal "SUBSEP".
>>
>>
>>
>> Using SUBSEP (the variable, not the string) in a GNU awk program as RS,
>>
>> it will split the input string as expected at each \034 byte, as expected.
>>
>> If your awk behaves differently I say it is broken.
>>
>>
>>
>> Janis
>>
>>
>>
>>>
>>
>>> Best wishes, Brian Donnell
>>
>>>
>
> Janis--Since I'm responding on the FizzBuzz thread, I thought I'd respond here too. I understand
> what you are saying. The thing I found interesting was that I could set a variable in the BEGIN
> block that I couldn't set on the command line.

The reason is that in the _awk program context_ the awk variable SUBSEP
has a well defined meaning while on the shell level this is only a string;
the shell does not know awk variables.

awk -v RS=SUBSEP 'BEGIN {...}'

...SUBSEP is a _string literal_ on shell level. A _shell variable_ could
be used, e.g., like this

awk -v RS="${SUBSEP}" 'BEGIN {...}'

(assuming there is anything sensible defined in _shell variable_ SUBSEP).

So the first form code is equivalent to

awk 'BEGIN {RS="SUBSEP"}'

while the second one is equivalent to

awk 'BEGIN {RS="..."}'

(where '... ' is placeholder for the content of the shell variable SUBSEP).

So what you intend to do is trying to inherit a predefined _awk variable_
from _shell context_ where it's just not known.

Janis

b2do...@gmail.com

unread,
Jan 28, 2014, 11:24:38 PM1/28/14
to
Thanks, Janis. Brian

jhank...@gmail.com

unread,
Feb 6, 2014, 4:36:51 PM2/6/14
to
To tie back in the example of parsing the Linux environ file:

gawk 'BEGIN{RS=SUBSEP} {print NR,$0}' /proc/self/environ | less -U

is functionally equivalent to:

gawk -vRS=$'\034' '{print NR,$0}' /proc/self/environ | less -U

...assuming your shell is bash and supports $'string' notation. My bash is: GNU bash, version 4.2.45(1)-release (x86_64-pc-linux-gnu)

(The "| less -U" makes it easier to see the NUL bytes in the output..."| cat -A" will as well.)

Kenny McCormack

unread,
Feb 6, 2014, 6:26:04 PM2/6/14
to
In article <64055e7a-bab3-4fa4...@googlegroups.com>,
<jhank...@gmail.com> wrote:
...
>To tie back in the example of parsing the Linux environ file:
>
>gawk 'BEGIN{RS=SUBSEP} {print NR,$0}' /proc/self/environ | less -U

I *still* don't understand why you people think that setting RS to SUBSEP
has anything to do with parsing Linux environ files. It is as if you
actually do believe that there is some cosmic law that says that the \034
character can never occur in data. Unfortunately, it is trivial in any
shell to create environment variables where either the variable name or the
value (or both) contain this character. That would pretty much break any
solution based on an assumption that SUBSEP can be used as some kind of
separator character.

Furthermore, a lot of people somehow think that setting RS to the null
character (\0) will somehow result in "Reading the whole file in one go in
GAWK". This is easily disproven by trying to read in an environ file where
the null character is user as a field separator (the equivalent of "white
space" in normal text files).

Oh well. So things are when you don't read very well...

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

jonathan...@gmail.com

unread,
Feb 6, 2014, 7:39:50 PM2/6/14
to
On Thursday, February 6, 2014 5:26:04 PM UTC-6, Kenny McCormack wrote:
> In article <64055e7a-bab3-4fa4...@googlegroups.com>,
>
> <jhank...@gmail.com> wrote:
>
> ...
>
> >To tie back in the example of parsing the Linux environ file:
>
> >
>
> >gawk 'BEGIN{RS=SUBSEP} {print NR,$0}' /proc/self/environ | less -U
>
>
>
> I *still* don't understand why you people think that setting RS to SUBSEP
>
> has anything to do with parsing Linux environ files. It is as if you
>
> actually do believe that there is some cosmic law that says that the \034
>
> character can never occur in data. Unfortunately, it is trivial in any
>
> shell to create environment variables where either the variable name or the
>
> value (or both) contain this character. That would pretty much break any
>
> solution based on an assumption that SUBSEP can be used as some kind of
>
> separator character.

Sorry, I did not state my intent clearly. I was providing an example that I felt might clarify the question raised earlier with the environ example about SUBSET being interpreted as a string by bash, but a char value \034 by awk.

I agree completely about making assumptions when using any string/byte value as an EOF marker. This is why EOF is often indicated out-of-band in read()-like library calls.

> Furthermore, a lot of people somehow think that setting RS to the null
>
> character (\0) will somehow result in "Reading the whole file in one go in
>
> GAWK". This is easily disproven by trying to read in an environ file where
>
> the null character is user as a field separator (the equivalent of "white
>
> space" in normal text files).

Exactly.

So, in the environ example, RS=SUBSET "works" accidentally, until such time as \034 appears in an environment value. RS="\0" assigns one key=value pair per record (IIRC, \0 will never occur in the environ keys or values as they are C strings), which was not the desired behavior. Using RS="^$" reads the entire environ into one record, including NUL bytes, and does not break when \034 occurs in an environ value, and is a "correct" way to get the desired behavior, at least in gawk.

-Jonathan Hankins

Kenny McCormack

unread,
Feb 6, 2014, 8:05:31 PM2/6/14
to
In article <f3143156-f403-4e2a...@googlegroups.com>,
<jonathan...@gmail.com> wrote:
...
>So, in the environ example, RS=SUBSET "works" accidentally, until such time as
>\034 appears in an environment value.

Right.

>RS="\0" assigns one key=value pair per
>record (IIRC, \0 will never occur in the environ keys or values as they are C
>strings), which was not the desired behavior.

Right again. I misspoke when I said tht it acted like a field seperator.
I should have said it acts as a record separator - like a newline in
"normal" text files.

>Using RS="^$" reads the entire
>environ into one record, including NUL bytes, and does not break when \034 occurs
>in an environ value, and is a "correct" way to get the desired behavior, at least
>in gawk.

Exactly. That's the take-home from all of this.
People should not be recommending anything other than:

RS="^$"

when people ask about "Reading the whole file in one go in GAWK".

Ed Morton

unread,
Feb 7, 2014, 8:00:32 AM2/7/14
to
On 2/6/2014 7:05 PM, Kenny McCormack wrote:
<snip>
> People should not be recommending anything other than:
>
> RS="^$"
>
> when people ask about "Reading the whole file in one go in GAWK".
>

I've seen that mentioned a couple of times now, can anyone provide a pointer to
some documentation that states and explains that?

The gawk manual at the bottom of
http://www.gnu.org/software/gawk/manual/gawk.html#Records says to use '\0' in
gawk but that it's non-portable:

----
You might think that for text files, the nul character, which consists of a
character with all bits equal to zero, is a good value to use for RS in this case:

BEGIN { RS = "\0" } # whole file becomes one record?

gawk in fact accepts this
----

Hence, I suspect, so many people using it.

Regards,

Ed.

Kenny McCormack

unread,
Feb 7, 2014, 8:56:30 AM2/7/14
to
In article <ld2ldh$86s$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
>On 2/6/2014 7:05 PM, Kenny McCormack wrote:
><snip>
>> People should not be recommending anything other than:
>>
>> RS="^$"
>>
>> when people ask about "Reading the whole file in one go in GAWK".
>>
>
>I've seen that mentioned a couple of times now, can anyone provide a
>pointer to some documentation that states and explains that?

I don't think it is in any manual, but it was posted here by Arnold some
time in the last 6 months or so. That's the first time I'd ever seen it,
and I was quickly and firmly converted to it. I believe Arnold actually
credited someone else (whose name escapes me ATM) with its discovery/invention.

The sense I get from (non-rhetorical) you is that you don't like it,
because it looks "hackish". I would imagine that in certain contexts (such
as working for a bank or insurance company), using it would be considered
"uncouth". But to my mind, the fact is that (rhetorical) you *want*
something like this - something "out of band" - that doesn't involve just
setting RS to a character that you hope and pray won't occur in your data.

It'd be even better, of course, if there was simply a setting that said "I
want to read the whole file in one go" - but for now, we must settle for
the "caret dollar" trick. Of course, the "readfile" extension, which is
available in some GAWK installations, does also speak to this issue.

>The gawk manual at the bottom of
>http://www.gnu.org/software/gawk/manual/gawk.html#Records says to use '\0' in
>gawk but that it's non-portable:
>
>----
>You might think that for text files, the nul character, which consists of a
>character with all bits equal to zero, is a good value to use for RS in this case:
>
> BEGIN { RS = "\0" } # whole file becomes one record?
>
>gawk in fact accepts this
>----

I've pointed out in the past that that this section of the manual is poorly
worded. I expect it will be changed in a future version. Usually, when
people start a paragraph or section of text out with "You might think...",
the endpoint of that paragraph or section is "But, in fact you'd be wrong;
instead, this other thing is true". E.g.

You might think the moon is made of green cheese, but, in fact, it is
made of (insert rigorous scientific discussion of the actual makeup of
the moon here).

Here (in the section of the GAWK manual quoted above), it starts out with
"You might think..." and ends with "and you are right." This is odd use of
language. I think the authors of this section are using the phrase "You
might think" in the sense of "Your mental processes could (and should) lead
you to a good understanding, which is that you can use RS="\0" and it *will*
work, most of the time (but not always, e.g., the Linux environ files)".

>Hence, I suspect, so many people using it.

Agreed. But, as I've shown, it isn't a 100% solution (e.g., the Linux
environ files), so it (the recommendation to use it) shouldn't be accepted
as gospel. I know I keep using the Linux environ files as my example, but
that's not the only case. In fact, there's really no good reason to assume
that any arbitrary text file that you encounter in the real world won't
have null characters in it.

Finally, note that issues of "portability" shouldn't enter into this
thread, given that the Subject title explicitly mentions GAWK. It reminds
me of the idiocy of comp.lang.c, where the regs mindlessly insist that
"portability" is and always will be the uber-goal, despite poster's
insistence that it just isn't relevant in the instant case.

Ed Morton

unread,
Feb 7, 2014, 9:02:37 AM2/7/14
to
On 2/7/2014 7:56 AM, Kenny McCormack wrote:
> In article <ld2ldh$86s$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
>> On 2/6/2014 7:05 PM, Kenny McCormack wrote:
>> <snip>
>>> People should not be recommending anything other than:
>>>
>>> RS="^$"
>>>
>>> when people ask about "Reading the whole file in one go in GAWK".
>>>
>>
>> I've seen that mentioned a couple of times now, can anyone provide a
>> pointer to some documentation that states and explains that?
>
> I don't think it is in any manual, but it was posted here by Arnold some
> time in the last 6 months or so. That's the first time I'd ever seen it,
> and I was quickly and firmly converted to it. I believe Arnold actually
> credited someone else (whose name escapes me ATM) with its discovery/invention.

OK, well hopefully Arnold can point us to something...

> The sense I get from (non-rhetorical) you is that you don't like it,
> because it looks "hackish".

Not in the slightest, I just haven't seen anything anywhere to document it so
I'm reluctant to use it just on some hearsay. I'd be happy to use it if it is in
fact a solution, and of course I'd like to understand why.

Ed.

Janis Papanagnou

unread,
Feb 7, 2014, 9:40:46 AM2/7/14
to
On 07.02.2014 14:56, Kenny McCormack wrote:
> In article <ld2ldh$86s$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
>> On 2/6/2014 7:05 PM, Kenny McCormack wrote:
>> <snip>
>>> People should not be recommending anything other than:
>>>
>>> RS="^$"
>>>
>>> when people ask about "Reading the whole file in one go in GAWK".
>>
>> I've seen that mentioned a couple of times now, can anyone provide a
>> pointer to some documentation that states and explains that?
>
> I don't think it is in any manual, but it was posted here by Arnold some
> time in the last 6 months or so. That's the first time I'd ever seen it,
> and I was quickly and firmly converted to it. I believe Arnold actually
> credited someone else (whose name escapes me ATM) with its discovery/invention.

On 2013-06-15, Arnold wrote:
>>>>
>>>> A nice solution given by Denis Shirokov is
>>>>
>>>> RS = "^$"

> [...]
>
> Agreed. But, as I've shown, it isn't a 100% solution (e.g., the Linux
> environ files), so it (the recommendation to use it) shouldn't be accepted
> as gospel. I know I keep using the Linux environ files as my example, but
> that's not the only case.

Or if processing ksh's .sh_history file on Unix systems in general.[*]

> In fact, there's really no good reason to assume
> that any arbitrary text file that you encounter in the real world won't
> have null characters in it.
> [...]

Janis

[*] Though not sure why that file would be read in one chunk; rather using
the '\0' as RS then seems even more natural then. But with non-GNU awk's
I suppose it's not possible using the standard RS="..." mechanism to parse
lines of records that are separated by '\0'?

Ed Morton

unread,
Feb 7, 2014, 9:58:40 AM2/7/14
to
With non-GNU awks I assume it's not possible to use RS="^$" since that's 2
characters in which case POSIX states that only the first character is used. i
think we're 100% only taking about how to read a whole file with GNU awk.

Any insight into why "^$" works? It looks a lot like it should mean just a
single blank line.

Ed.

Kenny McCormack

unread,
Feb 7, 2014, 10:12:07 AM2/7/14
to
In article <ld2sb2$gm7$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>With non-GNU awks I assume it's not possible to use RS="^$" since that's 2
>characters in which case POSIX states that only the first character is used. i
>think we're 100% only taking about how to read a whole file with GNU awk.

Exactly. Again, as the thread title is:

Reading the whole file in one go in GAWK

so-called "portability" (or "POSIX") is irrelevant.

>Any insight into why "^$" works? It looks a lot like it should mean just a
>single blank line.

I hadn't thought about it much until now; I was willing to just accept it
as "magic" (which, as I argued in my previous post, is how you *should*
view it - that is, as a setting that says "Read the file in one go")

But here's my analysis:

1) First, realize that newlines are no longer special. So, imagine the
contents of the file with "quarks" in place of the newlines. So,
you end up with one very long string.

2) Now try to pattern match ^$ against that long string. The only way
that ^$ can match is if the string is of zero length. Therefore, it
never matches (except in the degenerate case of the file itself being
zero length). So, we have acheived our goal of having a reg exp
that never matches.

In a lot of ways, it is the same as setting RS="", except for the fact that
that value ("") is "magic" - it means to (so to speak) match "paragraphs".

--
The motto of the GOP "base": You can't *be* a billionaire, but at least you
can vote like one.

Janis Papanagnou

unread,
Feb 7, 2014, 10:14:55 AM2/7/14
to
My question was rather meant how to parse files (like ksh's .sh_history,
where records/lines are separated by '\0') if you don't have a GNU awk
available. In GNU awk you can indeed use RS="\0". But how would you parse
such a file record-wise with other awks, where (AFAIK) RS="\0" is
functionally the same as RS="", and the latter does not process the file
line by line.

> i think we're 100% only taking about how to read a whole file with GNU awk.
>
> Any insight into why "^$" works? It looks a lot like it should mean just a
> single blank line.

I don't know. There must be some special handling in awks anyway, since
there's also the case RS="" to consider. Given how Arnold pointed us to the
feature, maybe it's a fortuitous side-effect of the actual implementation?

Actually, if originally awk would not have defined RS="" to read the file
multi-line/blockwise, the RS="^$" pattern might have been a pattern easier
to associate with blockwise processing, and RS="" easier for whole file in
one chunk processing.[*]

Janis

>
> Ed.

Kenny McCormack

unread,
Feb 7, 2014, 10:21:31 AM2/7/14
to
In article <ld2t9f$skm$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 07.02.2014 15:58, Ed Morton wrote:
>> On 2/7/2014 8:40 AM, Janis Papanagnou wrote:
>>>
>>> [*] Though not sure why that file would be read in one chunk; rather using
>>> the '\0' as RS then seems even more natural then. But with non-GNU awk's
>>> I suppose it's not possible using the standard RS="..." mechanism to parse
>>> lines of records that are separated by '\0'?
>>
>> With non-GNU awks I assume it's not possible to use RS="^$" since that's 2
>> characters in which case POSIX states that only the first character is used.
>
>My question was rather meant how to parse files (like ksh's .sh_history,
>where records/lines are separated by '\0') if you don't have a GNU awk
>available.

Who cares?

Completely OT in this thread.

And, for all practical purposes, OT in this group (as I've argued elsewhere
[Google for "self-inflicted wound"]).

--
I've been watching cat videos on YouTube. More content and closer to
the truth than anything on Fox.

Janis Papanagnou

unread,
Feb 7, 2014, 10:38:10 AM2/7/14
to
On 07.02.2014 16:21, Kenny McCormack wrote:
> In article <ld2t9f$skm$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 07.02.2014 15:58, Ed Morton wrote:
>>> On 2/7/2014 8:40 AM, Janis Papanagnou wrote:
>>>>
>>>> [*] Though not sure why that file would be read in one chunk; rather using
>>>> the '\0' as RS then seems even more natural then. But with non-GNU awk's
>>>> I suppose it's not possible using the standard RS="..." mechanism to parse
>>>> lines of records that are separated by '\0'?
>>>
>>> With non-GNU awks I assume it's not possible to use RS="^$" since that's 2
>>> characters in which case POSIX states that only the first character is used.
>>
>> My question was rather meant how to parse files (like ksh's .sh_history,
>> where records/lines are separated by '\0') if you don't have a GNU awk
>> available.
>
> Who cares?
>
> Completely OT in this thread.

I assumed if I put that topic-related point in a footnote that would be okay
and no one offened. My apologies to you and the group. I will open a separate
posting for that question.

>
> And, for all practical purposes, OT in this group (as I've argued elsewhere
> [Google for "self-inflicted wound"]).

I disagree that questions about the awk language (like the one I made) would
be off-topic here in comp.lang.awk.

Janis

>

de...@gissoft.eu

unread,
Feb 7, 2014, 10:52:08 AM2/7/14
to
> Any insight into why "^$" works? It looks a lot like it should mean just a
>
> single blank line.
>
>
>
> Ed.

the goal of /^$/ is in that implementing regxp that is NOT match anything - except the empty string. but even if file that we want to read is the empty(eg have length=0) then data that will be returned by getline will also be empty-string because separator immediately match from the begin of file.

I want to pay attention on that rexp /^$/ is totally normal regular expression so there should be no compatibility issues.

the other way to read whole file in one shot is to using RT.
we can setup RS with rexp that (diametrally) CAN match ANYTHING:

RS=".{1,}"; getline; return RT

the separator will match immediately from the begin - to the end of file in case if file contains at least one character. but even if it's not - then RT will be anyway equals to the empty-string.


My recommendation is in that there is reason to return to the begin of the source of the question. What data you will planning to read? is there really no some special string that we can just assign to the RS to avoid splitting?
I don't belive it.

just set up RS with the explicit(stupid) value:

RS="Abrakadabra" or any other value that we CAN CLAIM - it's never found in data that we want to read

and forget about the problem=)

also I was heared but there is some place where dynamic gawk extensions is available. I never see the place where it's simply available for download but I know that there was dynamic extension readfile() that finally just reading file in one shot.


the goal of /^$/ is in that it's can't match anything except the empty string

this rexp will NOT match anything except the empty string

other variation is .{1,}
the trick is in that making whole file as one huge separator. then reading whole file fro, RT - the separator string variableis available

Ed Morton

unread,
Feb 7, 2014, 10:52:38 AM2/7/14
to
Agreed, I think. I thought about this a bit more before I read the above and my
analysis was this: I think what's happening here is that in an RE "^" followed
by any char means "start of string" and that COULD only match at the start of
the file. Similarly "$" preceded by any char means "end of string" which COULD
only match at the end of the file. So, there is no string in a non-empty file
that satisfies the criteria that it exists at both the start and end of the file
so the result is gawk sees a single record since the RS was not found.

Ed.

Ed Morton

unread,
Feb 7, 2014, 11:51:32 AM2/7/14
to
On 2/7/2014 9:52 AM, de...@gissoft.eu wrote:
>> Any insight into why "^$" works? It looks a lot like it should mean just a
>>
>> single blank line.
>>
>>
>>
>> Ed.
>
> the goal of /^$/ is in that implementing regxp that is NOT match anything - except the empty string. but even if file that we want to read is the empty(eg have length=0) then data that will be returned by getline will also be empty-string because separator immediately match from the begin of file.
>
> I want to pay attention on that rexp /^$/ is totally normal regular expression so there should be no compatibility issues.

Compatibility issues? If you mean with non-GNU awks it's not the RE that's the
issue, it's that other awks can't have a multi-char RS.

>
> the other way to read whole file in one shot is to using RT.
> we can setup RS with rexp that (diametrally) CAN match ANYTHING:
>
> RS=".{1,}"; getline; return RT

I understand what you're trying to do here but it's significantly more
complicated (it'd need even more than the posted code segment to handle
un-readable files, set $1, etc.) than setting RS="^$" and is also gawk-specific
due to use of RT so it's not worth considering IMHO when RS="^$" is an option.

Ed.

Ed Morton

unread,
Feb 7, 2014, 12:01:27 PM2/7/14
to
I'm now convinced Kenny's analysis (and belatedly mine) was correct. It's also
interesting (to me at least) that with RS="^$" the last newline in the file is
part of the record while if you use RS="\n$" it's not and I think that might be
preferable. For example a common use of this would be to read a file as 1 record
with each line as a field:

$ awk -v RS='^$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
<line 1>
<second line>
<the third and final line>
<>

$ awk -v RS='\n$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
<line 1>
<second line>
<the third and final line>

Note that with the first version you need to be aware that there is an extra
field after the terminating newline so I THINK that setting RS="\n$" to read a
whole file is actually a better solution than RS="^$" and it'll work just fine
for files without terminating newlines too as then the RS just won't match anything.

Thoughts?

Ed.

Jonathan Hankins

unread,
Feb 7, 2014, 1:04:00 PM2/7/14
to
You're right. This is mentioned in the gawk manual, (section 4.1 in HEAD):

NOTE: Remember that in `awk', the `^' and `$' anchor
metacharacters match the beginning and end of a _string_, and not
the beginning and end of a _line_. As a result, something like
`RS = "^[[:upper:]]"' can only match at the beginning of a file.
This is because `gawk' views the input file as one long string
that happens to contain newline characters in it. It is thus best
to avoid anchor characters in the value of `RS'.

Also, the specific mention of RS="^$" is in the gawk.texi file from the HEAD branch in the git repo, as Arnold said. Here is the excerpt:

---

File: gawk.info, Node: Readfile Function, Prev: Getlocaltime Function, Up: General Functions

10.2.8 Reading A Whole File At Once
-----------------------------------

Often, it is convenient to have the entire contents of a file available
in memory as a single string. A straightforward but naive way to do
that might be as follows:

function readfile(file, tmp, contents)
{
if ((getline tmp < file) < 0)
return

contents = tmp
while (getline tmp < file) > 0)
contents = contents RT tmp

close(file)
return contents
}

This function reads from `file' one record at a time, building up
the full contents of the file in the local variable `contents'. It
works, but is not necessarily efficient.

The following function, based on a suggestion by Denis Shirokov,
reads the entire contents of the named file in one shot:

# readfile.awk --- read an entire file at once

function readfile(file, tmp, save_rs)
{
save_rs = RS
RS = "^$"
getline tmp < file
close(file)
RS = save_rs

return tmp
}

It works by setting `RS' to `^$', a regular expression that will
never match if the file has contents. `gawk' reads data from the file
into `tmp' attempting to match `RS'. The match fails after each read,
but fails quickly, such that `gawk' fills `tmp' with the entire
contents of the file. (*Note Records::, for information on `RT' and
`RS'.)

In the case that `file' is empty, the return value is the null
string. Thus calling code may use something like:

contents = readfile("/some/path")
if (length(contents) == 0)
# file was empty ...

This tests the result to see if it is empty or not. An equivalent
test would be `contents == ""'.

---

-Jonathan Hankins

Jonathan Hankins

unread,
Feb 7, 2014, 1:10:59 PM2/7/14
to
Some other nuggets from the manual (section 4.1 in HEAD):

"Reaching the end of an input file terminates the current input
record, even if the last character in the file is not the character in
`RS'. (d.c.)"

"[...]The newline, because it
matches `RS', is not part of either record.

When `RS' is a single character, `RT' contains the same single
character. However, when `RS' is a regular expression, `RT' contains
the actual input text that matched the regular expression.

If the input file ended without any text that matches `RS', `gawk'
sets `RT' to the null string."

---

-Jonathan Hankins

Ed Morton

unread,
Feb 7, 2014, 1:35:34 PM2/7/14
to
Finally - a reference! Thanks for that. Don't suppose you know if/where it exists online do you? I expect to be telling people to use either RS="^$" or RS="\n$" in future and saying "go look up the git whatever-it-is" isn't a good general purpose reference to point people to for more info (much like this thread!).

Using RS="^$" is exactly correct for "read the whole file" but I think RS="\n$" provides the more typically (even if incorrectly) expected behavior:

------------------
$ cat file
line 1
line 2

$ gawk -v RS='\n$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
<line 1>
<line 2>

$ gawk -v RS='^$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
<line 1>
<line 2>
<>

$ gawk -v RS='\n$' '{print "<" $0 ">"}' file
<line 1
line 2>

$ gawk -v RS='^$' '{print "<" $0 ">"}' file
<line 1
line 2
>

------------------

so I need to now ponder what the best advice is to give if/when the question comes up in future - either way there's a caveat to explain...

Ed.

Jonathan Hankins

unread,
Feb 7, 2014, 2:10:28 PM2/7/14
to
Sure! The example with RS = "^$" is in the gawk.texi manual in gawk-4.1-stable, so anyone with gawk 4.1 should have the manual with the excerpt I posted. It is NOT in gawk-4.0-stable's manual.

> Using RS="^$" is exactly correct for "read the whole file" but I think RS="\n$" provides the more typically (even if incorrectly) expected behavior:
>
>
>
> ------------------
>
> $ cat file
>
> line 1
>
> line 2
>
>
>
> $ gawk -v RS='\n$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
>
> <line 1>
>
> <line 2>
>
>
>
> $ gawk -v RS='^$' -F'\n' '{for (i=1;i<=NF;i++) print "<" $i ">" }' file
>
> <line 1>
>
> <line 2>
>
> <>
>
>
>
> $ gawk -v RS='\n$' '{print "<" $0 ">"}' file
>
> <line 1
>
> line 2>
>
>
>
> $ gawk -v RS='^$' '{print "<" $0 ">"}' file
>
> <line 1
>
> line 2
>
> >
>
>
>
> ------------------
>
>
>
> so I need to now ponder what the best advice is to give if/when the question comes up in future - either way there's a caveat to explain...

I would say, in the interest of KISS, if you are NOT able/wanting to use the readfile extension in 4.1 as discussed early in this thread, then:

1) Use RS = "^$" to read the entire file into one record.

The contents of $0 after the read should be identical to what you'd get from cat.

2) Only use RS = "\n$" if you want to read the entire file, minus any final newline, into one record.

You could use this to, for example, detect files without a final newline character, by testing the value of RT in END {}. I think this one is generally more prone to confuse people, and wouldn't advocate for it over RS = "^$" when the intent is to read the entire file into one record. In *general*, if you are using awk and care about newlines in your input, you should probably be doing record-per-line (or record-per-paragraph, etc.) processing.

-Jonathan Hankins

admini...@mintywestapps.net

unread,
Feb 14, 2014, 9:30:00 PM2/14/14
to
On Friday, June 14, 2013 8:05:20 PM UTC+8, Kenny McCormack wrote:
> I have the need to read in the whole file as a single record in a GAWK
>
> program.

What a great list of replies.

After reading everyones posts, which helped think about it, how about the additional solution of a range:

1. Simple form:
/^/,/$/

2. Command line:
gawk '/^/,/$/' <filename>

3. With test:
gawk '/^/,/$/{if(NR==1) print}' /proc/self/environ

Such a cheeky question, but I did learn something from the replies.

Thank you list members.

admini...@mintywestapps.net

unread,
Feb 15, 2014, 4:56:30 PM2/15/14
to
gawk 'BEGIN{RS="/^/,/$/"} {bf[FILENAME]=$0} END{for(i in bf) print "Filename = " i ":\n" bf[i]}' /proc/self/environ /etc/group

admini...@mintywestapps.net

unread,
Feb 17, 2014, 12:24:36 PM2/17/14
to
On Friday, June 14, 2013 8:05:20 PM UTC+8, Kenny McCormack wrote:
> I have the need to read in the whole file as a single record in a GAWK
>
> program. I seem to remember reading recently that there was some special
>
> feature (recently added, sometime in the 4.x series) that supported this.
>
> However, I could not find it in the doc (not that I searched exhaustively).
>
>
>
> Anyway, it looks like one way is to set RS="zzzzzzzzzzzzzzzzzzzzzzzzzzzzz",
>
> i.e., something that never occurs in your text, but is there a more
>
> "systemic" way to do this?
>
>
>
> Note that this is *not* the same as setting RS="", since that has a defined
>
> meaning, that has to do with multiple blank lines.
>
>
>
> --
>
> Here's a simple test for Fox viewers:
>
>
>
> 1) Sit back, close your eyes, and think (Yes, I know that's hard for you).
>
> 2) Think about and imagine all of your ridiculous fantasies about Barack Obama.
>
> 3) Now, imagine that he is white. Cogitate on how absurd your fantasies
>
> seem now.
>
>
>
> See? That wasn't hard, was it?

Range:
/^/,/$/

Kenny McCormack

unread,
Feb 17, 2014, 12:31:29 PM2/17/14
to
In article <38e654df-7140-4c7c...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
...
>Range:
>/^/,/$/

Dumb.

I don't get why people are settling into and recommending this stupid
range, which is exactly equivalent to, longer than, and more complex than the
correctly recommended version: /^$/

--

Prayer has no place in the public schools, just like facts
have no place in organized religion.
-- Superintendent Chalmers

admini...@mintywestapps.net

unread,
Feb 17, 2014, 1:03:45 PM2/17/14
to
> >Range:
>
> >/^/,/$/

> Dumb.
>
> I don't get why people are settling into and recommending this stupid
>
> range, which is exactly equivalent to, longer than, and more complex than the
>
> correctly recommended version: /^$/

If using (g)awk to span many files /^$/ joins files in a misleading join.

Hence:
grep -h -E -e \"<Exp>\" adds "\n" to last row

(*)awk: Does not.

(*)awk joins records ending with \0 with the next file.
When dealing with millions of records/files, it's not so easy to discover that (misleading \0) join.


Joe User

unread,
Feb 17, 2014, 2:26:35 PM2/17/14
to
I do this:

$ gawk 'BEGIN { RS="^$"; } 1 {print "rlen = " length($0) } ENDFILE {print
"End " FILENAME}' <(echo -ne '\0') <(echo -ne '\0')
rlen = 1
End /dev/fd/63
rlen = 1
End /dev/fd/62

This seems right to me.

Could you provide a simple bash sequence that demonstrates the problem?


--
There is nothing so useless as doing efficiently
that which should not be done at all.

-- Peter F. Drucker

admini...@mintywestapps.net

unread,
Feb 17, 2014, 7:20:31 PM2/17/14
to
Thank you.

Prior to reading this thread I would not have considered adding \0 to FS.

Solution:
gawk -F=":|\0" '{print $1 ":" $2}' `gawk 'BEGIN{for(i=1;i<=4;i++) print "/proc/self/environ"}'`

Cool.

Quicker than grep -h -E -e.

I need to test this before I comment further.

Thanks.

admini...@mintywestapps.net

unread,
Feb 19, 2014, 6:51:08 AM2/19/14
to
On Tuesday, February 18, 2014 1:31:29 AM UTC+8, Kenny McCormack wrote:
>
> ...
>
> >Range:
>
> >/^/,/$/
>
>
>
> Dumb.
>
>
>
> I don't get why people are settling into and recommending this stupid
>
> range, which is exactly equivalent to, longer than, and more complex than the
>
> correctly recommended version: /^$/
>
>
>
> --
>
>
>
> Prayer has no place in the public schools, just like facts
>
> have no place in organized religion.
>
> -- Superintendent Chalmers


I've done some performance testing and I can say that '^$' is too slow.

On small files of a couple of million rows using '^$' has about a 100 x 1 equivalent performance loss.

The range sample I proposed was slow at 10 x 1 the performance loss.

For full performance 1 x 1 (with \0 processing):

gawk 'BEGIN{RS="/?/,ENDFILE";} {print "\nFilename = " FILENAME;print; print "Filecount = " NR}' /proc/self/environ /etc/group /tmp/*.DATA

Also you should make note that if you seriously were to use ^$, then really you should alter FS too (to fit your purpose).

Aharon Robbins

unread,
Feb 20, 2014, 12:30:20 PM2/20/14
to
In article <80647d8a-6d87-44de...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
>If using (g)awk to span many files /^$/ joins files in a misleading join.

This is not true. End of file always terminates a record.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon
D.N. Shimshon 9978500 ISRAEL

Aharon Robbins

unread,
Feb 20, 2014, 12:36:23 PM2/20/14
to
In article <666bf4d0-648f-42bf...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
>I've done some performance testing and I can say that '^$' is too slow.
>
>On small files of a couple of million rows using '^$' has about a 100 x
>1 equivalent performance loss.
>
>The range sample I proposed was slow at 10 x 1 the performance loss.

It is not searching for a range.

>For full performance 1 x 1 (with \0 processing):
>
>gawk 'BEGIN{RS="/?/,ENDFILE";} {print "\nFilename = " FILENAME;print;
>print "Filecount = " NR}' /proc/self/environ /etc/group /tmp/*.DATA

This says "To find the record separator, look for an optional slash,
followed by the nine characters slash, comma, E, N, D, F, I, L, E".
If it's faster than ^$, that is an artifact of the regexp engine.

$ cat data
Record 1 /,ENDFILE Record 2

$ gawk 'BEGIN { RS = "/?/,ENDFILE" } ; { print }' data
Record 1
Record 2

I suspect that the readline() loadable extension described in the
gawk manual will be the fastest way to read a file as a single string,
since it bypasses the regexp engine and all the memory management
of growing the buffer.

admini...@mintywestapps.net

unread,
Feb 21, 2014, 5:41:47 AM2/21/14
to
Cool. Thanks for that explanation.

I'm still intrigued with this thread/concept.

So that means it's as simple as using EOF | ENDFILE, doesn't it?

Example:

$cat data
Record 1 /,ENDFILE Record 2 EOF Record 3 hello Record 4 1 /,EOF Record 5 there ,EOF Record 6 ENDFILE Record 7

1.
gawk 'BEGIN { RS=ENDFILE } ; { print "Filecount:" NR;print}' data data

2.
gawk 'BEGIN { RS=EOF} ; { print "Filecount:" NR;print}' data data

Both output:
Filecount:1
Record 1 /,ENDFILE Record 2 EOF Record 3 hello Record 4 1 /,EOF Record 5 there ,EOF Record 6 ENDFILE Record 7
Filecount:2
Record 1 /,ENDFILE Record 2 EOF Record 3 hello Record 4 1 /,EOF Record 5 there ,EOF Record 6 ENDFILE Record 7

I'll do more performance testing, when I get a chance.

Thank you.

admini...@mintywestapps.net

unread,
Feb 21, 2014, 5:52:46 AM2/21/14
to
On Friday, February 21, 2014 1:30:20 AM UTC+8, Aharon Robbins wrote:
> >If using (g)awk to span many files /^$/ joins files in a misleading join.
>
>
>
> This is not true. End of file always terminates a record.
>
> --
>
> Aharon (Arnold) Robbins arnold AT skeeve DOT com
>
> P.O. Box 354 Home Phone: +972 8 979-0381
>
> Nof Ayalon
>
> D.N. Shimshon 9978500 ISRAEL

I've been caught out many times with:
ssh remotefiltering "gawk ..." $FILELIST | gawk 'localprocessing'

I'll try to fish out an old testcase and open a different post (if I can find an example).

Aharon Robbins

unread,
Feb 21, 2014, 8:09:27 AM2/21/14
to
In article <91ec3dd9-c2fb-43a4...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
>Cool. Thanks for that explanation.
>
>I'm still intrigued with this thread/concept.
>
>So that means it's as simple as using EOF | ENDFILE, doesn't it?

No.

>Example:
>
>$cat data
>Record 1 /,ENDFILE Record 2 EOF Record 3 hello Record 4 1 /,EOF Record 5
>there ,EOF Record 6 ENDFILE Record 7
>
>1.
>gawk 'BEGIN { RS=ENDFILE } ; { print "Filecount:" NR;print}' data data

This must be an old version of gawk. You should have gotten a syntax error
with gawk 4.x. Since you didn't you are assigning the value of an empty
string to RS.

>2.
>gawk 'BEGIN { RS=EOF} ; { print "Filecount:" NR;print}' data data

EOF is not at all special. Same thing, you are assigning a null
string to RS.

I suggest spending some time reading the gawk manual. At least
chapters 1 through 11 in the current manual.

Kenny McCormack

unread,
Feb 21, 2014, 8:49:53 AM2/21/14
to
In article <le7j67$big$1...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
...
>I suggest spending some time reading the gawk manual. At least
>chapters 1 through 11 in the current manual.

Snarky. I like it.

--
A Catholic woman tells her husband to buy Viagra.

A Jewish woman tells her husband to buy Pfizer.

admini...@mintywestapps.net

unread,
Feb 21, 2014, 8:57:07 AM2/21/14
to
On Friday, February 21, 2014 9:09:27 PM UTC+8, Aharon Robbins wrote:

> I suggest spending some time reading the gawk manual. At least
>
> chapters 1 through 11 in the current manual.

Okay.

A quick scan of systems I'm using @ home:
GNU Awk 3.1.8
mawk - damn fast!
GNU Awk 4.1.0

Explains a few things.

admini...@mintywestapps.net

unread,
Feb 22, 2014, 4:20:57 PM2/22/14
to
> This is not true. End of file always terminates a record.
Read about ssl & ssh - perhaps start with cryptography.

Ref.
http://en.wikipedia.org/wiki/OpenSSL

(grumpy)

admini...@mintywestapps.net

unread,
Feb 22, 2014, 4:55:30 PM2/22/14
to
> >I've done some performance testing and I can say that '^$' is too slow.
>
> >
>
> >On small files of a couple of million rows using '^$' has about a 100 x
>
> >1 equivalent performance loss.
>
> >
>
> >The range sample I proposed was slow at 10 x 1 the performance loss.
>
>
>
> It is not searching for a range.
>
>
>
> >For full performance 1 x 1 (with \0 processing):
>
> >
>
> >gawk 'BEGIN{RS="/?/,ENDFILE";} {print "\nFilename = " FILENAME;print;
>
> >print "Filecount = " NR}' /proc/self/environ /etc/group /tmp/*.DATA
>
>
>
> This says "To find the record separator, look for an optional slash,
>
> followed by the nine characters slash, comma, E, N, D, F, I, L, E".
>
> If it's faster than ^$, that is an artifact of the regexp engine.
>
>
>
> $ cat data
>
> Record 1 /,ENDFILE Record 2
>
>
>
> $ gawk 'BEGIN { RS = "/?/,ENDFILE" } ; { print }' data
>
> Record 1
>
> Record 2
>
>
>
> I suspect that the readline() loadable extension described in the
>
> gawk manual will be the fastest way to read a file as a single string,
>
> since it bypasses the regexp engine and all the memory management
>
> of growing the buffer.

Bypassing FS processing... (and others) - Yes, true.

I've many tests to perform on this (list topic) thread - why? Because I've come to a point where I care about the future of (g)awk.

Why? - There are potentials.

(Re)-reading 4+ enhancements ( as instructed by maintenance + enhancements ).

Wow -

Example:
Performance - on the fly example, if ^$ is poor performance, how to enhance and not flame the idea and embrace the users/ops of this thread.

why not:
([\0].+)(plus others - as an example)\1($^)

Concentrating on performance - enhancing the concept, not killing it. Reduce the CPU usage to minimal compares + adding zero match at the end.

pk

unread,
Feb 22, 2014, 4:50:34 PM2/22/14
to
On Sat, 22 Feb 2014 13:20:57 -0800 (PST), admini...@mintywestapps.net
wrote:
This wouldn't even deserve an answer, but...do you think a wikipedia
page about cryptography knows better than Arnold whether *IN AWK* end of
file always terminates a record or not?

admini...@mintywestapps.net

unread,
Feb 22, 2014, 6:51:08 PM2/22/14
to
On Sunday, February 23, 2014 5:50:34 AM UTC+8, pk wrote:
> On Sat, 22 Feb 2014 13:20:57 -0800 (PST), minty
>
> wrote:
>
>
>
> > > This is not true. End of file always terminates a record.
>
> > Read about ssl & ssh - perhaps start with cryptography.
>
> >
>
> > Ref.
>
> > http://en.wikipedia.org/wiki/OpenSSL
>
> >
>
> > (grumpy)
>
>
>
> This wouldn't even deserve an answer, but...do you think a wikipedia
>
> page about cryptography knows better than Arnold whether *IN AWK* end of
>
> file always terminates a record or not?

Wink.

You got it...

It's a bitch when \0 ends a terminated EOF and ssh +(g)awk joins them... Hence grep -h -v -E -e "Bluah" $wiouhweiouhwoieuhowehow solves the problem.

(but thanks - I've enjoyed this thread and grown)

Aharon Robbins

unread,
Feb 23, 2014, 8:12:44 AM2/23/14
to
In article <99480dc5-2b0a-49b0...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
>Concentrating on performance - enhancing the concept, not killing it.
>Reduce the CPU usage to minimal compares + adding zero match at the end.

Another suggestion from the same source is RS = "[^\000-\377]"
I would be interested to know the performance as compared to RS = "^$"

You should then try

@load "readfile"

in gawk 4.1.0. I suspect it will be the fastest.

Kenny McCormack

unread,
Feb 23, 2014, 8:32:14 AM2/23/14
to
In article <lecs4b$71j$1...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
>In article <99480dc5-2b0a-49b0...@googlegroups.com>,
> <admini...@mintywestapps.net> wrote:
>>Concentrating on performance - enhancing the concept, not killing it.
>>Reduce the CPU usage to minimal compares + adding zero match at the end.
>
>Another suggestion from the same source is RS = "[^\000-\377]"
>I would be interested to know the performance as compared to RS = "^$"
>
>You should then try
>
> @load "readfile"
>
>in gawk 4.1.0. I suspect it will be the fastest.

As I think more and more about this topic, I keep asking myself "Why *not*
use 'readfile' ?" I.e., why *are* we still discussing various ways to
contort RS - rather than just saying "Use readfile"?

Apart from the so-called "portability" concerns (i.e., the fact that it is
not only GAWK-specific, but [more or less]
latest-version-of-GAWK-specific), the objection that has been raised in the
past is that you can't (so goes the objection) use it in combination with
the "regular" AWK pattern/action input loop. But now I'm not sure that
this objection holds any water.

Couldn't you just do:

BEGIN { $0 = readfile("file") }
# And away you go...
/foo/ { ... }

Above is untested, so I may have gotten some of the details wrong.

--
Both the leader of the Mormon Church and the leader of the Catholic
church claim infallibility. Is it any surprise that these two orgs
revile each other? Anybody with any sense knows that 80-yr old codgers
are hardly infallible. Some codgers this age do well to find the crapper
in time and remember to zip-up.

Jonathan Hankins

unread,
Feb 24, 2014, 3:16:22 PM2/24/14
to
On Sunday, February 23, 2014 7:32:14 AM UTC-6, Kenny McCormack wrote:

> Couldn't you just do:
>
>
>
> BEGIN { $0 = readfile("file") }
>
> # And away you go...
>
> /foo/ { ... }
>
>
>
> Above is untested, so I may have gotten some of the details wrong.

Doesn't work. Per the manual (7.1.4.2), $0 is undefined in BEGIN. Assigning to it directly, or using getline() gives it a value, field splitting happens, etc. BUT, when the first non-BEGIN pattern is processed, it starts up the input loop, which either 1) clobbers the $0 set in BEGIN when first record is read, or 2) exits when there is no record to read.

Note that in case 2) above, $0 has the value from the assignment in BEGIN inside any END block, but still, I believe that you can't "pre-load" $0 and then launch into the pattern/action processing and process what you've "pre-loaded".

I think, as far as using readfile() and getting an "awkish idiom" while processing it, your best bet is if/switch inside BEGIN.

I think RS="^$" is documented, and gives you idiomatic processing. I have been following the discussion but am having a little trouble understanding the particular issue minty is concerned with.

-Jonathan

Kenny McCormack

unread,
Feb 24, 2014, 3:19:39 PM2/24/14
to
In article <03e5cedc-0405-47d2...@googlegroups.com>,
Jonathan Hankins <jhank...@gmail.com> wrote:
...
>I think, as far as using readfile() and getting an "awkish idiom" while
>processing it, your best bet is if/switch inside BEGIN.

Yeah, I guess you're right.

As far as I'm concerned, that breaks it. Makes it a non-starter.

As they say, if I wanted to write regular procedural code, I know where to
look for Perl.

--
"Remember when teachers, public employees, Planned Parenthood, NPR and PBS
crashed the stock market, wiped out half of our 401Ks, took trillions in
TARP money, spilled oil in the Gulf of Mexico, gave themselves billions in
bonuses, and paid no taxes? Yeah, me neither."

Jonathan Hankins

unread,
Feb 24, 2014, 3:48:23 PM2/24/14
to
On Monday, February 24, 2014 2:19:39 PM UTC-6, Kenny McCormack wrote:
> As far as I'm concerned, that breaks it. Makes it a non-starter.
>
>
>
> As they say, if I wanted to write regular procedural code, I know where to
>
> look for Perl.

One case where I have used it is to read and process an infrequently-updated file from a well-known location before processing files specified on the command line.

For example, I periodically pull oui.txt down to $HOME:

http://standards.ieee.org/develop/regauth/oui/oui.txt

Then, if I have a script that uses the MAC-to-vendor mappings from oui.txt to do something with files specified on the command line and/or STDIN, I don't need to hard-code the sometimes-changing mappings in the awk script, nor to I have to impose some special logic that I'll never remember, like "FNR == 1 { # this should be oui.txt }", etc.

But yeah, getline() and getfile() don't take advantage of idiomatic pattern/action awk syntax, and I think are there for cases when you need them.

I am still confused about the ssh, \0, EOF issue minty is talking about *shrug*

-Jonathan

Kenny McCormack

unread,
Feb 24, 2014, 4:35:07 PM2/24/14
to
In article <9bee7583-3da9-42bb...@googlegroups.com>,
Jonathan Hankins <jhank...@gmail.com> wrote:
...
>But yeah, getline() and getfile() don't take advantage of idiomatic
>pattern/action awk syntax, and I think are there for cases when you need them.

Agreed.

>I am still confused about the ssh, \0, EOF issue minty is talking about *shrug*

Yes. I think several posters in this thread have gotten themselves
confused about what we are talking about in the thread. They have somehow
conflated the issues of "reading the file all in one go" and that of
"parsing Linux environ files" (aka, the question of setting RS="\0", which
really boils down to "Does your version of AWK use the C string type to
store strings?" - which basically means that you can't use the null
character in your data [*]).

The two issues really have nothing to do with each other other than that
they do both involve using RS. But, as you can deduce from my earlier post
(in which I used the word "contort"), I really don't think that we should
be spending our time searching for some magical weird value for RS, when
what we really want is a system/global setting (i.e., something in PROCINFO)
that says "Read the file in one go" (but still allow me to use the usual
AWK pattern/action mechanism).

Against that, of course, I am aware of Aaron's (and others) resistance to
"doing in the kernel" what can be done (however awkwardly) in user-space.

[*] Yes, I have read the recent posts about "mawk" which imply that, even
though mawk uses C strings, it somehow magically gets around this
limitation in certain situations (specifically, in re: the RS variable) [**].

[**] Or maybe I've mis-read these posts and "mawk" no longer uses "C
strings" as its basic string data type. Who can tell???

--
But the Bush apologists hope that you won't remember all that. And they
also have a theory, which I've been hearing more and more - namely,
that President Obama, though not yet in office or even elected, caused the
2008 slump. You see, people were worried in advance about his future
policies, and that's what caused the economy to tank. Seriously.

(Paul Krugman - Addicted to Bush)

Jonathan Hankins

unread,
Feb 24, 2014, 5:54:47 PM2/24/14
to
On Monday, February 24, 2014 3:35:07 PM UTC-6, Kenny McCormack wrote:

> The two issues really have nothing to do with each other other than that
>
> they do both involve using RS. But, as you can deduce from my earlier post
>
> (in which I used the word "contort"), I really don't think that we should
>
> be spending our time searching for some magical weird value for RS, when
>
> what we really want is a system/global setting (i.e., something in PROCINFO)
>
> that says "Read the file in one go" (but still allow me to use the usual
>
> AWK pattern/action mechanism).

(http://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line)

The fundamental issue is that the special case RS="" (or RS=UndefinedVar, because undefined vars are equivalent to empty string) "uses up" the only syntactically-correct "out of band" value you could assign to RS to specify a value that could never occur in (single char) or match (multi char, a regexp) your input.

There are various "special" values that people put in RS to make it have this "entire file" behavior: ("^$", "/^/,/$/" (barf), "IReallyHopeThisIsn'tInMyData", "\0", etc. Some of them are problematic because they aren't fool-proof. Some may or may not have performance issues for particular data sets, based on what people (not me) are claiming. Some of them are not very mnemonic, IMO.

Well, let's imagine briefly that this special case of RS="" had never existed. We could use RS="\n\n+" to enable "paragraph mode", which it very clearly expresses. And we could use RS="" to enable "entire file mode", which it (IMO) very clearly expresses "there is no record separator, so don't separate the records". We could then use gawk idiomatically to process entire files per record.

So, assuming this would generally be considered a "good thing", let's further imagine how we could modify the gawk we have currently to OPTIONALLY enable this hypothetical behavior:

1) Default value of RS is still "\n". (current behavior)

2) Setting PROCINFO["whatever"]=1 (or any true value?) enables this hypothetical behavior and explicitly breaks compatibility for those who so desire. (new behavior)

3) When PROCINFO["whatever"] is true, you must now explicitly set RS="\n\+" to obtain the original special case behavior ("paragraph mode") of RS="" (new behavior)

4) When PROCINFO["whatever"] is true, setting RS="" (the null string) enables "entire file" mode (new behavior)

This would allow you to process $0 idiomatically, outside of BEGIN/END, in the part of the gawk logic that works with the pattern/action statements:

----

# hypothetical code
BEGIN { PROCINFO["whatever"]=1; RS="" }
/foo/ { foo++;print FILENAME " contains foo" }
/bar/ { bar++;print FILENAME " contains bar" }
END {
print "Number of files containing the string foo: " foo
print "Number of files containing the string bar: " bar
}

----

Given file contents:

file1: "hello foo"
file2: "goodbye dearest bar"
file3: "you are the bar to my foo"
file4: "what the bar are you talking about?"

----

Run:

gawk -f myscript.gawk file1 file2 file3 file4

----

Output:

file1 contains foo
file2 contains bar
file3 contains foo
file3 contains bar
file4 contains bar
Number of files containing the string foo: 2
Number of files containing the string bar: 3

----

The two potential benefits I see to this are:

1) RS="" seems more mnemonic than RS="^$", or at least more memorable, to me:
(in my hypothetical case)
RS="" # there is no record separator
(our current "best" option)
RS="^$" # record separator is a regexp that can never match

2) My hypothetical changes should allow the RS-logic and/or it's tie-in with regexp engine to be bypassed during runtime. Gawk MAY already optimize it out now when RS="^$". And there's an even better chance that the contribution to the runtime performance is so insignificant that it's not worth the programmer time to change. If there really IS some performance impact when RS="^$" for huge datasets, maybe it is worth it.

In conclusion, I think it would have been nice if RS="" disabled record splitting, and RS="\n\+" enabled "paragraph mode", but RS="^$" IS (recently) documented and DOES provide what we need to use the pattern/action logic of gawk on entire file contents, without using readfile() in BEGIN, possibly with yet unproven performance implications.

I've never looked at gawk's source and have no idea if my hypothetical changes are feasible, IF they are even desirable...maybe I'll have a look :-)

-Jonathan

Jonathan Hankins

unread,
Feb 24, 2014, 6:58:04 PM2/24/14
to
On Monday, February 24, 2014 4:54:47 PM UTC-6, Jonathan Hankins wrote:

> I've never looked at gawk's source and have no idea if my hypothetical changes are feasible, IF they are even desirable...maybe I'll have a look :-)

So I had a look :-) I knew this at some point but had to be reminded when I re-read the manual (4.8) based on what I saw in the code:

"There is an important difference between 'RS = ""' and 'RS = "\n\n+"'. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done. (d.c.)"

So there is extra magic to the special case. Still, if my hypothetical gawk "feature" was explicitly enabled (PROCINFO["whatever"]=1), it would basically be switching this special case off, which wouldn't matter, since you would be turning off the special "paragraph mode" anyway. If anyone is interested, I was looking at gawk-4.1.0/io.c:set_RS() and io.c:rsnullscan(). At first glance, if I were going to try to implement my "feature", I'd probably implement PROCINFO["whatever"] and a branch in set_RS() when RS="" to set matchrec = rsnullscan in the default/PROCINFO["whatever"] != 1 case and matchrec = rsnullsnarf(?) in the PROFINFO["whatever"] == 1 case.

Kudos to Arnold et al -- the code was very clear and I found what I was looking for almost immediately!

-Jonathan

Aharon Robbins

unread,
Feb 25, 2014, 8:39:37 AM2/25/14
to
In article <64aea97e-8053-416d...@googlegroups.com>,
Jonathan Hankins <jhank...@gmail.com> wrote:
>On Monday, February 24, 2014 4:54:47 PM UTC-6, Jonathan Hankins wrote:
>
>> I've never looked at gawk's source and have no idea if my hypothetical
>>changes are feasible, IF they are even desirable...maybe I'll have a
>>look :-)
>
>So I had a look :-) I knew this at some point but had to be reminded
>when I re-read the manual (4.8) based on what I saw in the code:
>
>"There is an important difference between 'RS = ""' and 'RS = "\n\n+"'.
>In the first case, leading newlines in the input data file are ignored,
>and if a file ends without extra blank lines after the last record, the
>final newline is removed from the record. In the second case, this
>special processing is not done. (d.c.)"

Indeed. It pays to remember that in standard awk only the first character
of RS is relevant, and that RS = <regexp> is a gawk extension.

>Kudos to Arnold et al -- the code was very clear and I found what I was
>looking for almost immediately!

Thanks. I try. I have to maintain the code and long periods can go by
in between when I look at particular chunks of the code.

>Still, if my hypothetical
>gawk "feature" was explicitly enabled (PROCINFO["whatever"]=1),

In principle, setting something like PROCINFO["read_whole_files"] = 1
isn't a bad way to convey what you want.

I strongly believe that this can be attained simply by enhancing the
readfile extension with an input parser. The xxx_can_take_file() routine
would check the appropriate index and value in PROCINFO, and if true,
respond true. It would then accept the file and read it all in at one
go the first time the get_a_record routine is called, and return EOF
after that. Voila!

Awk code then becomes something like:

@load "readfile" # this would be the enhanced version
BEGIN { PROCINFO["read_whole_files"] = 1 }
# rest of the program here, as usual

Johnathan, since you've shown that you're willing to dive into code,
why not check out the chapter in the gawk documentation on the extension
facility? Then look at the code for the readfile extension and see if
you can build an input parser as I've outlined.

If so, and you're willing to sign the paperwork, I'd be willing to add
the changes to the gawk distribution. (After appropriate code review of
course. :-)

Thanks,

Arnold

de...@gissoft.eu

unread,
Feb 25, 2014, 6:05:40 PM2/25/14
to
This is good news. I also studied the source code of gawk and implement several fixes (in particular isarray() issue that was recently discussed here). However, the question is: how we can combine fixes and changes that was implemented by different persons? I assume something changed in file io.c and it was a successful fix. Now you have some usefull changes in the same file. How we can combine it?

i wonder if there is possibility to write the code that will finding needed places in the source code of gawk and makes the required changes?

Kaz Kylheku

unread,
Feb 25, 2014, 7:24:16 PM2/25/14
to
On 2014-02-25, de...@gissoft.eu <de...@gissoft.eu> wrote:
> here). However, the question is: how we can combine fixes and changes that
> was implemented by different persons?

Incredibly easily.

- patch/quilt
- diff3
- git rebase/merge/cherry-pick/...

For a single source file, if you have the ancestral version that both people
modified (let's call that "base") and the two changed versions (let's call them
"mine" and "yours"), I can merge them using

diff3 -m yours base mine > base-new

any conflicts are marked off in base-new, which we can resolve manually. Then we can replace
base with base-new.

(Conflicts are not always easy to resolve, but usually are. Sometimes there
are purely semantic, and not textual. Say: one developer changes an important
API, and another adds some lines of code that use the old API. Those lines of
code are nowhere near any lines that the other developer modified, so the
merge is clean, but the program doesn't build. Or worse, builds, but doesn't
run.)

All other merging basically follows from this paradigm, except it's more
convenient. Rather than toiling file by file, you do it for an entire tree of
code: merge the changes on your branch against my branch in one step.

> I assume something changed in file io.c
> and it was a successful fix. Now you have some usefull changes in the same
> file. How we can combine it?

Merging is done everyday in software development.

> i wonder if there is possibility to write the code that will finding needed
> places in the source code of gawk and makes the required changes?

Decades late with that idea. Source code control with merging has been around since
around the 1960's.

Patches are still useful. They can be harder to integrate. It's worth learning
the quilt utility, which makes it easy to develop and maintain "stacks" of patches
applied to a code base, and to migrate these stacks of patches from one version
of a code base to another. quilt came out of the Linux kernel culture.
Linux used to be based on developers moving around patches.

If you wanted to play with someone's neat scheduler idea or virtual memory
improvement or whatever, you'd get that patch and apply it. That developer
might have "rediffed" versions of the patch against various kernel revisions
so you could pick the one that cleanly applies to yours.

Today you can just pull a branch from that developer's git repo, and if
you want the changes in a different version, then you can "rebase" them
to that version. Or you can "cherry-pick" individual commits.

GNU Awk has a git repo. If you want to work on Awk, clone the repo. Make the
changes and create your own branch locally. You can make your git repo public
so other people can see your changes.

You can easily integrate your changes with upstream.

jonathan...@gmail.com

unread,
Feb 25, 2014, 7:58:41 PM2/25/14
to
On Tuesday, February 25, 2014 7:39:37 AM UTC-6, Aharon Robbins wrote:

> Johnathan, since you've shown that you're willing to dive into code,
>
> why not check out the chapter in the gawk documentation on the extension
>
> facility? Then look at the code for the readfile extension and see if
>
> you can build an input parser as I've outlined.

Thanks for that -- between the bits you mentioned and what I've skimmed tonight in the manual, I think I will tinker with it a bit. If I get stuck with a design question, I may contact you off list for clarification, if you don't mind?

Thanks -- the extension facility you've incorporated looks very well thought out.

-Jonathan

admini...@mintywestapps.net

unread,
Feb 25, 2014, 8:36:05 PM2/25/14
to
On Wednesday, February 26, 2014 8:24:16 AM UTC+8, Kaz Kylheku wrote:
>
> > here). However, the question is: how we can combine fixes and changes that
>
> > was implemented by different persons?
>
>
>
> Incredibly easily.

Please could we start a new thread on gawk4 development (and maintenance).

After reading most of the manual I can see many new possibilities and yet to be discovered problems.

Great job.

I've found a few bugs and am collecting test cases to submit.

I won't submit segfaults to this list but the bugs, okay.

Perhaps even a new group for gawk4 could be called for.

Aharon Robbins

unread,
Feb 26, 2014, 3:52:02 AM2/26/14
to
In article <af15b227-9701-47b0...@googlegroups.com>,
<admini...@mintywestapps.net> wrote:
>I've found a few bugs and am collecting test cases to submit.

These should be be sent to the bug reporting address. The readers of
this newsgroup may be interested also, but postings here are not the
way to report bugs.

Please submit separate emails for each bug instead of one mail
with several reports.

>I won't submit segfaults to this list but the bugs, okay.

If you can reliably reproduce a segfault, you should also submit
information on that to the bug reporting address.
0 new messages