Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

getline summary: variants, caveats, applications, tips. misuse, howto

389 views
Skip to first unread message

Ed Morton

unread,
Nov 5, 2006, 10:23:21 AM11/5/06
to
Getline Summary, November 5th 2006
----------------------------------

The following summary, composed by Ed Morton to address the recurring
issue of getline (mis)use, was based primarily on information from the
book "Effective Awk Programming", Third Edition By Arnold Robbins;
(http://www.oreilly.com/catalog/awkprog3) with review and additional
input from many of the comp.lang.awk regulars, including Steve Calfee,
Martin Cohen, Manuel Collado, Jürgen Kahrs, Kenny McCormack, Janis
Papanagnou, Anton Treuenfels, Thomas Weidenfeller, John LaBadie an
Edward Rosten.


Getline
-------
getline is fine when used correctly (see below for a list of thos
cases), but it's best avoided by default because:

a) It allows people to stick to their preconceived ideas of how to
program rather than learning the easier way that awk was designed to
read input. It's like C programmers continuing to do procedural
programming in C++ rather than learning the new paradigm and the
supporting language constructs.

b) It has many insidious caveats that come back to bite you either
immediately or in future. The succeeding discussion captures some of
those and explains when getline IS appropriate.

As the book "Effective Awk Programming", Third Edition By Arnold
Robbins; (http://www.oreilly.com/catalog/awkprog3) which provides much
of the source for this discussion says:

"The getline command is used in several different ways and should not be
used by beginners. ... come back and study the getline command after you
have reviewed the rest ... and have a good knowledge of how awk works."

Variants
--------
The following summarises the eight variants of getline applications,
listing which variables are set by each one:

Variant Variables Set
getline $0, ${1...NF}, NF, FNR, NR, FILENAME
getline var var, FNR, NR, FILENAME
getline < file $0, ${1...NF}, NF
getline var < file var
command | getline $0, ${1...NF}, NF
command | getline var var
command |& getline $0, ${1...NF}, NF
command |& getline var var

The "command |& ..." variants are GNU awk (gawk) extensions. gawk also
populates the ERRNO builtin variable if getline fails.

Although calling getline is very rarely the right approach (see
below), if you need to do it the safest ways to invoke getline are:

if/while ( (getline var < file) > 0)
if/while ( (command | getline var) > 0)
if/while ( (command |& getline var) > 0)

since those do not affect any of the builtin variables and they allow
you to correctly test for getline succeeding or failing. If you
need the input record split into separate fields, just call "split()"
to do that.

Caveats
-------
Users of getline have to be aware of the following non-obvious effects
of using it:

a) Normally FILENAME is not set within a BEGIN section, but a
non-redirected call to getline will set it.

b) Calling "getline < FILENAME" is NOT the same as calling "getline".
The second form will read the next record from FILENAME while
the first form will read the first record again.

c) Calling getline without a var to be set will update $0 and $NF so
they will have a different value for subsequent processing than they
had for prior processing in the same condition/action block.

d) Many of the getline variants above set some but not all of the
builtin variables, so you need to be very careful that it's
setting the ones you need/expect it to.

e) According to POSIX, `getline < expression' is ambiguous if
expression contains unparenthesized operators other than `$'; for
example, `getline < dir "/" file' is ambiguous because the
concatenation operator is not parenthesized. You should write it
as `getline < (dir "/" file)' if you want your program to be
portable to other awk implementations.

f) In POSIX-compliant awks (e.g. gawk --posix) a failure of getline
(e.g. trying to read from a non-readable file) will be fatal to
the program, otherwise it won't.

g) Unredirected getline can defeat the simple and usual rule to handle
input file transitions:

FNR==1 { ... start of file actions ... }

File transitions can occur at getlines, so FNR==1 needs to also be
checked after each unredirected (from a specific file name) getline.
e.g. if you want to print the first line of each of these files:

$ cat file1
a
b
$ cat file2
c
d

you'd normally do:

$ awk 'FNR==1{print}' file1 file2
a
c

but if a "getline" snuck in, it could have the unexpected consequence of
skipping the test for FNR==1 and so not printing the first line of the
second file.

$ awk 'FNR==1{print}/b/{getline}' file1 file2
a

h) Using getline in the BEGIN section to skip lines makes your program
difficult to apply to multiple files. e.g. with data like...

some header line
----------------
data line 1
data line 2
...
data line 10000

you may consider using...

BEGIN { getline header; getline }
{ whatever_using_header_and_data_on_the_line() }

instead of...

FNR == 1 { header = $0 }
FNR < 3 { next }
{ whatever_using_header_and_data_on_the_line() }

but the getline version would not work on multiple files since the BEGIN
section would only be executed once, before the first file is processed,
whereas the non-getline version would work as-is. This is one example of
the common case where the getline command itself isn't directly causing
the problem, but the type of design you can end up with if you select a
getline approach is not ideal.

Applications
------------
getline is an appropriate solution for the following:

a) Reading from a pipe, e.g.:

command = "ls"
while ( (command | getline var) > 0) {
print var
}
close(command)

b) Reading from a coprocess, e.g.:

command = "LC_ALL=C sort"
n = split("abcdefghijklmnopqrstuvwxyz", a, "")

for (i = n; i > 0; i--)
print a[i] |& command
close(command, "to")

while ((command |& getline var) > 0)
print "got", var
close(command)

c) In the BEGIN section, reading some initial data that's referenced
during processing multiple subsequent input files, e.g.:

BEGIN {
while ( (getline var < ARGV[1]) > 0) {
data[var]++
}
close(ARGV[1])
ARGV[1]=""
}
$0 in data

d) Recursive-descent parsing of an input file or files, e.g.:

awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' file1 tmp

In all other cases, it's clearest, simplest, less error-prone, and
easiest to maintain to let awks normal text-processing read the records.
In the case of "c", whether to use the BEGIN+getline approach or just
collect the data within the awk condition/action part after
testing for the first file is largely a style choice.

"a" above calls the UNIX command "ls" to list the current directory
contents, then prints the result one line at a time.

"b" above writes the letters of the alphabet in reverse order, one per
line, down the two-way pipe to the UNIX "sort" command. It then closes
the write end of the pipe, so that sort receives an end-of-file
indication. This causes sort to sort the data and write the sorted
data back to the gawk program. Once all of the data has been read,
gawk terminates the coprocess and exits. This is particularly necessary
in order to use the UNIX "sort" utility as part of a coprocess since
sort must read all of its input data before it can produce any output.
The sort program does not receive an end-of-file indication until gawk
closes the write end of the pipe. Other programs can be invoked as just:

command = "program"
do {
print data |& command
command |& getline var
} while (data left to process)
close(command)

Not that calling close() with a second argument is also gawk-specific.

"c" above reads every record of the first file passed as an argument to
awk into an array and then for every subsequent file passed as an
argument will print every record from that file that matches any of
the records that appeared in the first file (and so are stored in the
"data" array). This could alternatively have been implemented as:

# fails if first file is empty
NR==FNR{ data[$0]++; next }
$0 in data

or:

FILENAME==ARGV[1] { data[$0]++; next }
$0 in data

or:

FILENAME=="" { data[$0]++; next }
$0 in data

or (gawk only):

ARGIND==1 { data[$0]++; next }
$0 in data

"d" above not only expands all the lines that say "include subfile", but
by writing the result to a tmp file, resetting ARGV[1] (the highest
level input file) and not resetting ARGV[2] (the tmp file), it then lets
awk do any normal record parsing on the result of the expansion since
that's now stored in the tmp file. If you don't need that, just do the
"print" to stdout and remove any other references to a tmp file or
ARGV[2]. In this case, since it's convenient to use $1 and $2, and no
other part of the program references any builtin variables, getline was
used without populating an explicit variable. This method is limited in
its recursion depth to the total number of open files the OS permits at
one time.

Tips
----
The following tips may help if, after reading the above, you discover
you have an appropriate application for getline or if you're looking for
an alternative solution to using getline:

a) If you need to distinguish between a normal EOF or some read or
opening error, you have to use gawks ERRNO variable or code it as:

if/while ( (e = (getline var < file)) > 0) { ... }
close(file)
if(e < 0) some_error_handling

b) Don't forget to close() any file you open for reading. The
common idiom for getline and other methods of opening files/streams is:

cmd="some command"
do something with cmd
close(cmd)

c) A common misapplication of getline is to just skip a few lines of an
input file. The following discusses how to do that without using getline
with all that implies as discussed above. This discussion builds on the
common awk idiom to "decrement a variable to zero" by putting the
decrement of the variable as the second term in an "and" clause with the
first part being the variable itself, so the decrement only occurs if
the variable is non-zero:

i) Print the Nth record after some pattern:

awk 'c&&!--c;/pattern/{c=N}' file

ii) Print every record except the Nth record after some pattern:

awk 'c&&!--c{next}/pattern/{c=N}' file

iii) Print the N records after some pattern:

awk 'c&&c--;/pattern/{c=N}' file

iv) Print every record except the N records after some pattern:

awk 'c&&c--{next}/pattern/{c=N}' file

In this example there are no blank lines and the output is all aligned
with the left hand column and you want to print $0 for the second record
following the record that contains some pattern, e.g. the number 3:

$ cat file
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
$ awk '/3/{getline;getline;print}' file
line 5

That works Just fine. Now let's see the concise way to do it without
getline:

$ awk 'c&&!--c;/3/{c=2}' file
line 5

It's not quite so obvious at a glance what that does, but it uses an
idiom that most awk programmers could do well to learn and it is briefer
and avoids all those getline caveats.

Now let's say we want to print the 5th line after the pattern instead of
the 2nd line. Then we'd have:

$ awk '/3/{getline;getline;getline;getline;getline;print}' file
line 8
$ awk 'c&&!--c;/3/{c=5}' file
line 8

i.e. we have to add a whole series of additional getline calls to the
getline version, as opposed to just changing the counter from 2 to 5 for
the non-getline version. In reality, you'd probably completely rewrite
the getline version to use a loop:

$ awk '/3/{for (c=1;c<=5;c++) getline; print}' file
line 8

Still not as concise as the non-getline version, has all the getline
caveats and required a redesign of the code just to change a counter.

Now let's say we also have to print the word "Eureka" if the number 4
appears in the input file. With the getline verion, you now have to do
something like:

$ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" }
print}' file
Eureka!
line 8

whereas with the non-getline version you just have to do:

$ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file
Eureka!
line 8

i.e. with the getline version, you have to work around the fact that
you're now processing records outside of the normal awk work-loop,
whereas with the non-getline version you just have to drop your test for
"4" into the normal place and let awks normal record processing deal
with it like it always does.

Actually, if you look closely at the above you'll notice we just
unintentionally introduced a bug in the getline version. Consider what
would happen in both versions if 3 and 4 appear on the same line. The
non-getline version would behave correctly, but to fix the getline
version, you'd need to duplicate the condition somewhere, e.g. perhaps
something like this:

$ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline }
if ($0 ~ /4/) print "Eureka!"; print}' file
Eureka!
line 8

Now consider how the above would behave when there aren't 5 lines left
in the input file or when the last line of the file contains both a 3
and a 4. i.e. there are still design questions to be answered and bugs
that will appear at the limits of the input space.

Ignoring those bugs since this is not intended as a discussion on
debugging getline programs, let's say you no longer need to print the
5th record after the number 3 but still have to do the Eureka on 4. With
the getline version, you'd strip out the test for 3 and the getline
stuff to be left with:

$ awk '{if ($0 ~ /4/) print "Eureka!"}' file
Eureka!

which you'd then presumably rewrite as:

$ awk '/4/{print "Eureka!"}' file
Eureka!

which is what you get just by removing everything involving the test for
3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:

$ awk '/4/{print "Eureka!"}' file
Eureka!

i.e. again, one small requirement change required a complete redesign of
the getline code, but just the absolute minimum necessary tweak to the
non-getline version.

So, what you see above in the getline case was significant redesign
required for every tiny requirement change, much larger amounts of
handwritten code required, insidious bugs introduced during development
and challenging design questions at the limits of your input space,
whereas the non-getline version always had less code, was much easier to
modify as requirements changed, and was much more obvious, predictable,
and correct in how it would behave at the limits of the input space.

Kenny McCormack

unread,
Nov 5, 2006, 11:16:16 AM11/5/06
to
In article <pI-dna4bi7VxnNPY...@comcast.com>,

Ed Morton <mor...@lsupcaemnt.com> wrote:
>Getline Summary, November 5th 2006
>----------------------------------
>
>The following summary, composed by Ed Morton to address the recurring
>issue of getline (mis)use, was based primarily on information from the
>book "Effective Awk Programming", Third Edition By Arnold Robbins;
>(http://www.oreilly.com/catalog/awkprog3) with review and additional
>input from many of the comp.lang.awk regulars, including Steve Calfee,
>Martin Cohen, Manuel Collado, Jürgen Kahrs, Kenny McCormack, Janis
>Papanagnou, Anton Treuenfels, Thomas Weidenfeller, John LaBadie an
>Edward Rosten.

Not to take anything away from what you've written, but, just for
conciseness, doesn't it all boil down to:
1) Never use non-redirected getline (i.e., it only makes sense
if you are using a named file, a pipe, or a co-process)
2) Almost never use getline w/o a named variable (i.e., if you
are using the versions that populate $0, be sure you know
what you're doing and why)

Ed Morton

unread,
Nov 5, 2006, 12:39:34 PM11/5/06
to
Kenny McCormack wrote:

That's about what this part was trying to say:

> Although calling getline is very rarely the right approach (see
> below), if you need to do it the safest ways to invoke getline are:
>
> if/while ( (getline var < file) > 0)
> if/while ( (command | getline var) > 0)
> if/while ( (command |& getline var) > 0)

Regards,

Ed.

Janis

unread,
Nov 6, 2006, 9:05:35 AM11/6/06
to
Ed Morton wrote:
>
> Getline Summary, November 5th 2006
> <snip>

>
> or:
>
> FILENAME==ARGV[1] { data[$0]++; next }
> $0 in data
>
> or:
>
> FILENAME=="" { data[$0]++; next }
> $0 in data

I am somewhat concerned about these examples; the former is expecting
arguments (does not fit using stdin, just explicit arguments; okay).
Though the latter compares to "", what is this supposed to do? On my
gawk (on Cygwin) reading from stdin will have a "-" in FILENAME (if
reading from stdin was the intention). Or just a typo?

Thinking about that (and with recent discussions in mind) I wonder
about a (or "the") sophisticated way to write awk programs that work
consistently with both, stdin and arguments. Usually I avoid ARGV[].

Janis

Ed Morton

unread,
Nov 6, 2006, 1:59:15 PM11/6/06
to
Janis wrote:
> Ed Morton wrote:
>
>>Getline Summary, November 5th 2006
>><snip>
>>
>>or:
>>
>> FILENAME==ARGV[1] { data[$0]++; next }
>> $0 in data
>>
>>or:
>>
>> FILENAME=="" { data[$0]++; next }
>> $0 in data
>
>
> I am somewhat concerned about these examples; the former is expecting
> arguments (does not fit using stdin, just explicit arguments; okay).
> Though the latter compares to "", what is this supposed to do? On my
> gawk (on Cygwin) reading from stdin will have a "-" in FILENAME (if
> reading from stdin was the intention). Or just a typo?

It's a typo. I originally wrote it as:

FILENAME=="<specific name>"

but I was going to put it on a web page and when it got to the server
everything between the "<...>"s got stripped and I didn't notice, then
when I did the copy/paste back to the NG, it ended up as you saw. Thanks
for catching it.

>
> Thinking about that (and with recent discussions in mind) I wonder
> about a (or "the") sophisticated way to write awk programs that work
> consistently with both, stdin and arguments. Usually I avoid ARGV[].

I think the challenge would be to see if you can come up with an
application that lets you read stdin twice without explicitly storing
the input. That's what I often manipulate ARGV[] for when reading from
files. e.g.:

$ cat file
the record
$ awk 'BEGIN{ARGV[ARGC++]=ARGV[1]}{print FILENAME":",$0}' file
file: the record
file: the record
$ echo "the record" | awk 'BEGIN{ARGV[ARGC++]=ARGV[1]}{print
FILENAME":",$0}'
-: the record

Regards,

Ed.

Ed Morton

unread,
Nov 11, 2006, 9:53:17 AM11/11/06
to
Getline Summary, November 11th 2006
-----------------------------------

The following summary, composed by Ed Morton to address the recurring
issue of getline (mis)use, was based primarily on information from the
book "Effective Awk Programming", Third Edition By Arnold Robbins;
(http://www.oreilly.com/catalog/awkprog3) with review and additional
input from many of the comp.lang.awk regulars, including Steve Calfee,
Martin Cohen, Manuel Collado, Jürgen Kahrs, Kenny McCormack, Janis

Papanagnou, Anton Treuenfels, Thomas Weidenfeller, John LaBadie and
Edward Rosten.


Getline
-------
getline is fine when used correctly (see below for a list of those

you'd normally do:

you may consider using...

instead of...

or:

or:

FILENAME=="specificFileName" { data[$0]++; next }

0 new messages