Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

More on BEGINFILE / ENDFILE

94 views

Skip to first unread message

Aharon Robbins

unread,

Jan 20, 2009, 2:10:28 PM1/20/09

Hi all. Here are my thoughts on Manuel's notes; apologies for the long
delay in replying.

First, thanks for the detailed post!

> To begin with, here are a couple of issues that require clarification:
>
> - Inside a BEGINFILE block, the predefined variables FILENAME, FNR, NF
> and $0-$NF are not mutually consistent. FILENAME and FNR are updated,
> but NF and $0-$NF still keep their previous values. To be consistent,
> the latter should have null values.

I disagree; I see these as like the END block, where $0 - $NF still
have their values.

> - BEGINFILE and ENDFILE, as well as the return code of getline, are
> proper places to notify input errors to the user code, instead of
> automatically handle them as fatal. The current behaviour is to set
> ERRNO in BEGINFILE for non-existent files, but also to generate a fatal
> error message and abort the whole process if the user doesn't issue a
> nextfile command. It seems that this policy is somehow related to that
> of getline in the unpatched gawk-3.1.6. Shouldn't it be better to fully
> leave to the user the right of handling input errors if they can be
> notified? - Error handling of non readable files (like directories) has
> been discussed in a recent thread on gnu.utils.bug.

There are two issues here: getline, and BEGINFILE. They are not related.

I agreed that getline on a directory should not be fatal and have patched
the gawk-stable tree to make this change.

The BEGINFILE issue is a little harder. The current behavior when
there is no BEGINFILE is that a directory or other problematic file
on the command line is a fatal error.

I think this should remain the behavior; I have provided a hook for
the programmer to catch problems in BEGINFILE, but I think it's up
to the programmer to make use of it. I don't want to change the
current default for when there is no BEGINFILE, or have a BEGINFILE
that doesn't skip a bad file silently skip the bad file for the
programmer.

If there is a really compelling argument as to why I'm mistaken,
I'm willing to listen and consider it.

> With respect to the interaction between BEGINFILE/ENDFILE and
> nextfile/next/getline/exit these are my personal expectations:
>
> - Both BEGINFILE and ENDFILE, if present, should always be executed for
> each opened (or attempted to be opened) input file argument, even if the
> process is stopped prematurely.

Assuming that the current file is readable, this should be what happens;
a nextfile in the regular code should run the ENDFILE rule and then
move to the BEGINFILE rule for the next file.

> - getline should be forbidden inside BEGINFILE/ENDFILE. The current
> patch is OK in this point.

Right.

> - getline must execute BEGINFILE/ENDFILE blocks at input file changes.
> The current patch is OK in this point.

Right.

Re other posts on this, trust me, that anything but `getline var < "a_file" '
inside a BEGINFILE or ENDFILE would be a very, very, bad idea. When I
mentioned getting into "severe recursive brain meltdown", it was my brain
whose meltdown I was trying to avoid. :-)

> - nextfile should terminate BEGINFILE (if executed inside BEGINFILE) and
> then transfers to ENDFILE, if present. The current patch is buggy in
> this aspect.

Umm, unconditionally? Even if the file was "bad"?

My original thought was that if you did a nextfile inside BEGINFILE, it
was only because there was no reason to process the file at all.

I want to think about this one a bit; you may be right though, that the
symmetry is important.

The code to make this happen could get ugly though. Sigh.

> - next should be forbidden inside BEGINFILE/ENFILE blocks. The current
> patch is OK in this point.

Right.

> - exit should execute ENDFILE for the current file, if any, and then
> transfer to END. The current patch is buggy in this aspect.

I think this is right. I will try to fix it.

> What should I do next to further test the patch or to report the test
> details?

I will look at revising the patch. It would be helpful if you could
package up your tests somehow for inclusion in the gawk test suite
as part of the patch.

Your post was quite helpful; I appreciate your time and effort.

Thanks again,

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Manuel Collado

unread,

Jan 20, 2009, 5:55:03 PM1/20/09

Aharon Robbins escribió:

> Hi all. Here are my thoughts on Manuel's notes; apologies for the long
> delay in replying.
>
> First, thanks for the detailed post!
>
>> To begin with, here are a couple of issues that require clarification:
>>
>> - Inside a BEGINFILE block, the predefined variables FILENAME, FNR, NF
>> and $0-$NF are not mutually consistent. FILENAME and FNR are updated,
>> but NF and $0-$NF still keep their previous values. To be consistent,
>> the latter should have null values.
>
> I disagree; I see these as like the END block, where $0 - $NF still
> have their values.

Well, it is probably wise to keep the current behavior unaltered, but it
could be seen as bit inconsistent. The unpatched gawk-3.1.6 keeps NF and
$0-$NF in the END rule, but sets FILENAME and FNR according to the last
input file argument, even if it is a zero-length file ($0 is the last
record of the previous file in this case).

mawk and nawk also behaves like gawk, but "The One True AWK" sets NF and
$0-$NF to null values in the END rule.

Well, I'm not formally asking for a change. Just trying to clarify the
rules.

>
>> - BEGINFILE and ENDFILE, as well as the return code of getline, are
>> proper places to notify input errors to the user code, instead of
>> automatically handle them as fatal. The current behaviour is to set
>> ERRNO in BEGINFILE for non-existent files, but also to generate a fatal
>> error message and abort the whole process if the user doesn't issue a
>> nextfile command. It seems that this policy is somehow related to that
>> of getline in the unpatched gawk-3.1.6. Shouldn't it be better to fully
>> leave to the user the right of handling input errors if they can be
>> notified? - Error handling of non readable files (like directories) has
>> been discussed in a recent thread on gnu.utils.bug.
>
> There are two issues here: getline, and BEGINFILE. They are not related.
>
> I agreed that getline on a directory should not be fatal and have patched
> the gawk-stable tree to make this change.

The documentation says that getline will return -1 for errors, and at
first sight doesn't make distinctions among open errors or read errors,
nor among different kind of files or pipes. I've been surprised by the
fact that getline doesn't return when trying to read from a non-existent
file.

>
> The BEGINFILE issue is a little harder. The current behavior when
> there is no BEGINFILE is that a directory or other problematic file
> on the command line is a fatal error.
>
> I think this should remain the behavior; I have provided a hook for
> the programmer to catch problems in BEGINFILE, but I think it's up
> to the programmer to make use of it. I don't want to change the
> current default for when there is no BEGINFILE, or have a BEGINFILE
> that doesn't skip a bad file silently skip the bad file for the
> programmer.

I certainly agree. Few programmers actually look at the return status of
service invocations, so it is probably wise to let the gawk core handle
errors by default. But there should be a way to let the user code handle
input errors, on demand, when they can be appropriately notified, via
getline, BEGINFILE or ENDFILE. Perhaps a command-line flag?

Please also consider that ENDFILE is an appropriate place to catch
errors while reading records. And the return code of getline can be a
catch-all place.

BTW, xgawk sees XML parsing errors as non-fatal. And this poses a
problem, because there is no satisfactory place where to notify them.
Currently, these error notifications must be handled at the next
non-error input event, or inside the END rule. An ENDFILE block would be
a much better place to handle them.

>
> If there is a really compelling argument as to why I'm mistaken,
> I'm willing to listen and consider it.

Been capable to on-demand handle all input errors in the user code can
be seen as an improvement. This could be a first step towards
programming non-stop services in gawk.

As discussed in a previous message, the alternatives are:
- nexfile means skipping all further processing of the current file
- nexfile means skipping the remaining records of the current file (but
not the possible final action).

Perhaps there is an analogy between BEGINFILE/ENDFILE/nextfile, and
BEGIN/END/exit. exit inside BEGIN doesn't terminate the whole process,
but gives control to the END rule, if present.

>
>> - next should be forbidden inside BEGINFILE/ENFILE blocks. The current
>> patch is OK in this point.
>
> Right.
>
>> - exit should execute ENDFILE for the current file, if any, and then
>> transfer to END. The current patch is buggy in this aspect.
>
> I think this is right. I will try to fix it.
>
>> What should I do next to further test the patch or to report the test
>> details?
>
> I will look at revising the patch. It would be helpful if you could
> package up your tests somehow for inclusion in the gawk test suite
> as part of the patch.

Perhaps my stuff is too large to be included in the test suite (30~40
test runs for just a single feature). Anyway I'll try to make them
readable by other people and not just by me.

>
> Your post was quite helpful; I appreciate your time and effort.
>
> Thanks again,
>
> Arnold

Regards,
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Aharon Robbins

unread,

Jan 21, 2009, 12:12:56 AM1/21/09

Hi Manuel.

>>> - Inside a BEGINFILE block, the predefined variables FILENAME, FNR, NF
>>> and $0-$NF are not mutually consistent. FILENAME and FNR are updated,
>>> but NF and $0-$NF still keep their previous values. To be consistent,
>>> the latter should have null values.
>>
>> I disagree; I see these as like the END block, where $0 - $NF still
>> have their values.
>
>Well, it is probably wise to keep the current behavior unaltered, but it
>could be seen as bit inconsistent. The unpatched gawk-3.1.6 keeps NF and
>$0-$NF in the END rule, but sets FILENAME and FNR according to the last
>input file argument, even if it is a zero-length file ($0 is the last
>record of the previous file in this case).

This falls out of the implementation; it's a real corner case.

>mawk and nawk also behaves like gawk, but "The One True AWK" sets NF and
>$0-$NF to null values in the END rule.

Here "The One True AWK" is not POSIX compliant. mawk and gawk are.
I'm pretty sure Brian Kernighan knows about this, but it's not a
trivial fix; I've looked at his code.

>Well, I'm not formally asking for a change. Just trying to clarify the
>rules.

I can try to make the documentation more explicit on these issues.

>>> - BEGINFILE and ENDFILE, as well as the return code of getline, are
>>> proper places to notify input errors to the user code, instead of
>>> automatically handle them as fatal. The current behaviour is to set
>>> ERRNO in BEGINFILE for non-existent files, but also to generate a fatal
>>> error message and abort the whole process if the user doesn't issue a
>>> nextfile command. It seems that this policy is somehow related to that
>>> of getline in the unpatched gawk-3.1.6. Shouldn't it be better to fully
>>> leave to the user the right of handling input errors if they can be
>>> notified? - Error handling of non readable files (like directories) has
>>> been discussed in a recent thread on gnu.utils.bug.
>>
>> There are two issues here: getline, and BEGINFILE. They are not related.
>>
>> I agreed that getline on a directory should not be fatal and have patched
>> the gawk-stable tree to make this change.
>
>The documentation says that getline will return -1 for errors, and at
>first sight doesn't make distinctions among open errors or read errors,
>nor among different kind of files or pipes. I've been surprised by the
>fact that getline doesn't return when trying to read from a non-existent
>file.

Um, I don't know what you're referring to here:

$ gawk-3.1.6 'BEGIN { print (getline x < "/no/file") }'
-1

Am I missing someting?

>> The BEGINFILE issue is a little harder. The current behavior when
>> there is no BEGINFILE is that a directory or other problematic file
>> on the command line is a fatal error.
>>
>> I think this should remain the behavior; I have provided a hook for
>> the programmer to catch problems in BEGINFILE, but I think it's up
>> to the programmer to make use of it. I don't want to change the
>> current default for when there is no BEGINFILE, or have a BEGINFILE
>> that doesn't skip a bad file silently skip the bad file for the
>> programmer.
>
>I certainly agree. Few programmers actually look at the return status of
>service invocations, so it is probably wise to let the gawk core handle
>errors by default. But there should be a way to let the user code handle
>input errors, on demand, when they can be appropriately notified, via
>getline, BEGINFILE or ENDFILE. Perhaps a command-line flag?

I don't understand what you're suggesting. getline returns -1; the
programmer must check for that. Inside a BEGINFILE, ERRNO can be non-empty;
the programmer must check for that. The hooks are there.

>Please also consider that ENDFILE is an appropriate place to catch
>errors while reading records.

What kind of errors show up while reading records that are catchable?
Can you give me an explicit example, because I'm not understanding you.

>And the return code of getline can be a catch-all place.

See above.

>BTW, xgawk sees XML parsing errors as non-fatal. And this poses a
>problem, because there is no satisfactory place where to notify them.
>Currently, these error notifications must be handled at the next
>non-error input event, or inside the END rule. An ENDFILE block would be
>a much better place to handle them.

How? By setting some special variable? Show me some code so I can
understand what you mean, please.

>> If there is a really compelling argument as to why I'm mistaken,
>> I'm willing to listen and consider it.
>
>Been capable to on-demand handle all input errors in the user code can
>be seen as an improvement. This could be a first step towards
>programming non-stop services in gawk.

Again, I don't understand what you're looking for.

>>> - nextfile should terminate BEGINFILE (if executed inside BEGINFILE) and
>>> then transfers to ENDFILE, if present. The current patch is buggy in
>>> this aspect.
>>
>> Umm, unconditionally? Even if the file was "bad"?
>>
>> My original thought was that if you did a nextfile inside BEGINFILE, it
>> was only because there was no reason to process the file at all.
>>
>> I want to think about this one a bit; you may be right though, that the
>> symmetry is important.
>>
>> The code to make this happen could get ugly though. Sigh.
>
>As discussed in a previous message, the alternatives are:
>- nexfile means skipping all further processing of the current file
>- nexfile means skipping the remaining records of the current file (but
>not the possible final action).

The current code does the first; you are suggesting the second, and
I am starting to think that you are right.

>Perhaps there is an analogy between BEGINFILE/ENDFILE/nextfile, and
>BEGIN/END/exit. exit inside BEGIN doesn't terminate the whole process,
>but gives control to the END rule, if present.

Yes, I see that.

>Perhaps my stuff is too large to be included in the test suite (30~40
>test runs for just a single feature). Anyway I'll try to make them
>readable by other people and not just by me.

That would be helpful; at least for my testing. :-)

Thanks,

Manuel Collado

unread,

Jan 21, 2009, 4:37:39 AM1/21/09

Aharon Robbins escribió:

> Hi Manuel.
>
>>>> - Inside a BEGINFILE block, the predefined variables FILENAME, FNR, NF
>>>> and $0-$NF are not mutually consistent. FILENAME and FNR are updated,
>>>> but NF and $0-$NF still keep their previous values. To be consistent,
>>>> the latter should have null values.
>>> I disagree; I see these as like the END block, where $0 - $NF still
>>> have their values.

>> [snipped...]

>
>> Well, I'm not formally asking for a change. Just trying to clarify the
>> rules.
>
> I can try to make the documentation more explicit on these issues.

OK. I was thinking about $0 as the FNR-th record of FILENAME. Now I've
started to think about $0 as the global NR-th input record, and this
seems consistent, even if FILENAME doesn't contains this record.

>
>>>> - BEGINFILE and ENDFILE, as well as the return code of getline, are
>>>> proper places to notify input errors to the user code, instead of
>>>> automatically handle them as fatal. The current behaviour is to set
>>>> ERRNO in BEGINFILE for non-existent files, but also to generate a fatal
>>>> error message and abort the whole process if the user doesn't issue a
>>>> nextfile command. It seems that this policy is somehow related to that
>>>> of getline in the unpatched gawk-3.1.6. Shouldn't it be better to fully
>>>> leave to the user the right of handling input errors if they can be
>>>> notified? - Error handling of non readable files (like directories) has
>>>> been discussed in a recent thread on gnu.utils.bug.
>>> There are two issues here: getline, and BEGINFILE. They are not related.
>>>
>>> I agreed that getline on a directory should not be fatal and have patched
>>> the gawk-stable tree to make this change.
>> The documentation says that getline will return -1 for errors, and at
>> first sight doesn't make distinctions among open errors or read errors,
>> nor among different kind of files or pipes. I've been surprised by the
>> fact that getline doesn't return when trying to read from a non-existent
>> file.
>
> Um, I don't know what you're referring to here:
>
> $ gawk-3.1.6 'BEGIN { print (getline x < "/no/file") }'
> -1
>
> Am I missing someting?

I was speaking about non-redirected getline:

$ gawk 'BEGIN { print getline x }' /no/file
gawk: fatal: cannot open file `/no/file' for reading (No such file or
directory)

>
>>> The BEGINFILE issue is a little harder. The current behavior when
>>> there is no BEGINFILE is that a directory or other problematic file
>>> on the command line is a fatal error.
>>>
>>> I think this should remain the behavior; I have provided a hook for
>>> the programmer to catch problems in BEGINFILE, but I think it's up
>>> to the programmer to make use of it. I don't want to change the
>>> current default for when there is no BEGINFILE, or have a BEGINFILE
>>> that doesn't skip a bad file silently skip the bad file for the
>>> programmer.
>> I certainly agree. Few programmers actually look at the return status of
>> service invocations, so it is probably wise to let the gawk core handle
>> errors by default. But there should be a way to let the user code handle
>> input errors, on demand, when they can be appropriately notified, via
>> getline, BEGINFILE or ENDFILE. Perhaps a command-line flag?
>
> I don't understand what you're suggesting. getline returns -1; the
> programmer must check for that. Inside a BEGINFILE, ERRNO can be non-empty;
> the programmer must check for that. The hooks are there.

See above.

>
>> Please also consider that ENDFILE is an appropriate place to catch
>> errors while reading records.
>
> What kind of errors show up while reading records that are catchable?
> Can you give me an explicit example, because I'm not understanding you.

I wonder if the following code could be used to test for readability of
a set of files, and report failures:

BEGINFILE {
if (ERRNO) {
print "error opening " FILENAME ": " ERRNO
ERRNO = ""
}
}
ENDFILE {
if (ERRNO) {
print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}

>
>> And the return code of getline can be a catch-all place.
>
> See above.

Same by using getline alone:

BEGIN {
while ((getline x) != 0) {
if ((getline x) < 0) {
print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}
}

If input errors are not flagged as fatal, the above examples will
process all arguments, even some of them are unreadable.

[snipped ...]

Aharon Robbins

unread,

Jan 22, 2009, 12:28:47 AM1/22/09

In article <gl6qdb$kb9$1...@heraldo.rediris.es>,

Manuel Collado <m.co...@invalid.domain> wrote:
>> Um, I don't know what you're referring to here:
>>
>> $ gawk-3.1.6 'BEGIN { print (getline x < "/no/file") }'
>> -1
>>
>> Am I missing someting?
>
>I was speaking about non-redirected getline:
>
>$ gawk 'BEGIN { print getline x }' /no/file
>gawk: fatal: cannot open file `/no/file' for reading (No such file or directory)

Ah. Well, this goes back to the fact that command-line filenames are fatal
errors when they can't be opened. It is historical practice and consitent
with the other awks:

$ mawk 'BEGIN { print getline x; print 1 }' /no/file
mawk: cannot open /no/file (No such file or directory)

$ nawk 'BEGIN { print getline x; print 1 }' /no/file
nawk: can't open file /no/file
source line number 1

I understand your point, but I worry that it is too big a break with
historical compatibility.

>>> Please also consider that ENDFILE is an appropriate place to catch
>>> errors while reading records.
>>
>> What kind of errors show up while reading records that are catchable?
>> Can you give me an explicit example, because I'm not understanding you.
>
>I wonder if the following code could be used to test for readability of
>a set of files, and report failures:
>
>BEGINFILE {
> if (ERRNO) {
> print "error opening " FILENAME ": " ERRNO
> ERRNO = ""
> }
>}

That works nows; you should then add a nextfile to skip the bad file.

>ENDFILE {
> if (ERRNO) {
> print "error reading " FILENAME " after line " FNR ": " ERRNO
> }
>}

I don't think anything would ever get here; If the file cannot be opened
or is a directory, you catch it in the BEGINFILE block.

If the file was already successfully opened, and if, say, an NFS file
went away and read() returned -1, then getline would return -1, instead
of falling into the ENDFILE block.

>Same by using getline alone:
>
>BEGIN {
> while ((getline x) != 0) {
> if ((getline x) < 0) {
> print "error reading " FILENAME " after line " FNR ": " ERRNO
> }
> }
>}

(Your loop, BTW, checks the file's readability for every record read;
not very efficient.)

The gawk manual shows you how to do this by simply looping through ARGV,
using a redirected getline to test for readability and then removing
the bad element from ARGV. Changing gawk so that the above would work
would break historical compatibility for no really good reason.

>If input errors are not flagged as fatal, the above examples will
>process all arguments, even some of them are unreadable.

I understand, but I think the current BEGINFILE semantics give you
the hooks to handle things adequately.

Manuel Collado

unread,

Jan 22, 2009, 4:41:02 AM1/22/09

Aharon Robbins escribió:

> In article <gl6qdb$kb9$1...@heraldo.rediris.es>,
> Manuel Collado <m.co...@invalid.domain> wrote:
>>> Um, I don't know what you're referring to here:
>>>
>>> $ gawk-3.1.6 'BEGIN { print (getline x < "/no/file") }'
>>> -1
>>>
>>> Am I missing someting?
>> I was speaking about non-redirected getline:
>>
>> $ gawk 'BEGIN { print getline x }' /no/file
>> gawk: fatal: cannot open file `/no/file' for reading (No such file or directory)
>
> Ah. Well, this goes back to the fact that command-line filenames are fatal
> errors when they can't be opened. It is historical practice and consitent
> with the other awks:
>
> $ mawk 'BEGIN { print getline x; print 1 }' /no/file
> mawk: cannot open /no/file (No such file or directory)
>
> $ nawk 'BEGIN { print getline x; print 1 }' /no/file
> nawk: can't open file /no/file
> source line number 1
>
> I understand your point, but I worry that it is too big a break with
> historical compatibility.

It the feature is desirable, a new command-line switch, like
"--no-fatal-input", could be used to explicitly request to change the
historical behavior.

>
>>>> Please also consider that ENDFILE is an appropriate place to catch
>>>> errors while reading records.
>>> What kind of errors show up while reading records that are catchable?
>>> Can you give me an explicit example, because I'm not understanding you.
>> I wonder if the following code could be used to test for readability of
>> a set of files, and report failures:
>>
>> BEGINFILE {
>> if (ERRNO) {
>> print "error opening " FILENAME ": " ERRNO
>> ERRNO = ""
>> }
>> }
>
> That works nows; you should then add a nextfile to skip the bad file.
>
>> ENDFILE {
>> if (ERRNO) {
>> print "error reading " FILENAME " after line " FNR ": " ERRNO
>> }
>> }
>
> I don't think anything would ever get here; If the file cannot be opened
> or is a directory, you catch it in the BEGINFILE block.
>
> If the file was already successfully opened, and if, say, an NFS file
> went away and read() returned -1, then getline would return -1, instead
> of falling into the ENDFILE block.

Hummm.., you said that input errors are always fatal, and not reported
throughout unredirected getline. Right?

And the normal case is to read the file by the input loop, with no
explicit getline.

>
>> Same by using getline alone:
>>
>> BEGIN {
>> while ((getline x) != 0) {
>> if ((getline x) < 0) {
>> print "error reading " FILENAME " after line " FNR ": " ERRNO
>> }
>> }
>> }
>
> (Your loop, BTW, checks the file's readability for every record read;
> not very efficient.)

Well, a file is fully readable if every record is readable.

>
> The gawk manual shows you how to do this by simply looping through ARGV,
> using a redirected getline to test for readability and then removing
> the bad element from ARGV.

This is cumbersome.

> .. Changing gawk so that the above would work

> would break historical compatibility for no really good reason.

See above (run-time switch).

>
>> If input errors are not flagged as fatal, the above examples will
>> process all arguments, even some of them are unreadable.
>
> I understand, but I think the current BEGINFILE semantics give you
> the hooks to handle things adequately.

In most cases yes, but not always. In particular the XML extension of
xgawk really needs a place to report errors in the middle of a file,
without stopping further processing of other files. I see ENDFILE as an
excellent point for handling these errors (XML parsing and encoding
conversion errors).

>
> Thanks,

Thanks for your attention.

Aharon Robbins

unread,

Jan 26, 2009, 2:57:32 PM1/26/09

In article <gl9evo$60e$1...@heraldo.rediris.es>,

Manuel Collado <m.co...@invalid.domain> wrote:
>It the feature is desirable, a new command-line switch, like
>"--no-fatal-input", could be used to explicitly request to change the
>historical behavior.

Gawk already has too many options, in my humble opinion. If you need
this, just add -f readable.awk to the command line. Voila, your unreadable
files are no longer in ARGV to bother you. You can customize it to
print errors or whatever you need.

>>>>> Please also consider that ENDFILE is an appropriate place to catch
>>>>> errors while reading records.
>>>> What kind of errors show up while reading records that are catchable?
>>>> Can you give me an explicit example, because I'm not understanding you.
>>> I wonder if the following code could be used to test for readability of
>>> a set of files, and report failures:
>>>
>>> BEGINFILE {
>>> if (ERRNO) {
>>> print "error opening " FILENAME ": " ERRNO
>>> ERRNO = ""
>>> }
>>> }
>>
>> That works nows; you should then add a nextfile to skip the bad file.
>>
>>> ENDFILE {
>>> if (ERRNO) {
>>> print "error reading " FILENAME " after line " FNR ": " ERRNO
>>> }
>>> }
>>
>> I don't think anything would ever get here; If the file cannot be opened
>> or is a directory, you catch it in the BEGINFILE block.
>>
>> If the file was already successfully opened, and if, say, an NFS file
>> went away and read() returned -1, then getline would return -1, instead
>> of falling into the ENDFILE block.
>
>Hummm.., you said that input errors are always fatal, and not reported
>throughout unredirected getline. Right?
>
>And the normal case is to read the file by the input loop, with no
>explicit getline.

It's not clear what you're asking. A read error has two possible outcomes
in gawk:

1. If reading via getline from a command-line file, an error is returned.
2. If reading via the main input loop, the error is fatal.

You seem to be suggesting that in case 2, the error not be fatal, but
instead go into the ENDFILE block with ERRNO set.

Given the current structure of the code, this might be doable. I have
to think about this some more as to whether it's the right thing to do.
It is something I hadn't thought about.

>>> Same by using getline alone:
>>>
>>> BEGIN {
>>> while ((getline x) != 0) {
>>> if ((getline x) < 0) {
>>> print "error reading " FILENAME " after line " FNR ": " ERRNO
>>> }
>>> }
>>> }
>>
>> (Your loop, BTW, checks the file's readability for every record read;
>> not very efficient.)
>
>Well, a file is fully readable if every record is readable.

Yes, but your code actually reads two records per iteration... :-)

I think you want:

BEGIN {
while ((val = (getline x)) != 0) {
if (val < 0)

print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}

Anyway, in practice, it is hard to have a case where a file is readable
part way through the processing and then suddenly becomes unreadable.
I have to wonder if trying to catch it isn't a case of diminishing returns.

>> The gawk manual shows you how to do this by simply looping through ARGV,
>> using a redirected getline to test for readability and then removing
>> the bad element from ARGV.
>
>This is cumbersome.

Not any more so than your double getline loop... :-)

>>> If input errors are not flagged as fatal, the above examples will
>>> process all arguments, even some of them are unreadable.
>>
>> I understand, but I think the current BEGINFILE semantics give you
>> the hooks to handle things adequately.
>
>In most cases yes, but not always. In particular the XML extension of
>xgawk really needs a place to report errors in the middle of a file,
>without stopping further processing of other files. I see ENDFILE as an
>excellent point for handling these errors (XML parsing and encoding
>conversion errors).

Do these errors show up as a result of calls to read? Are they currently
fatal errors?

In other words, with the ENDFILE patch in place, can you not make use of
it as you want? What other changes are needed in the gawk internals to give
you what you're looking for?

Manuel Collado

unread,

Jan 30, 2009, 5:28:29 PM1/30/09

Aharon Robbins escribió:

> In article <gl9evo$60e$1...@heraldo.rediris.es>,
> Manuel Collado <m.co...@invalid.domain> wrote:

> [...]

Yes. This is exactly what I was suggesting.

>
> Given the current structure of the code, this might be doable. I have
> to think about this some more as to whether it's the right thing to do.
> It is something I hadn't thought about.
>
>>>> Same by using getline alone:
>>>>
>>>> BEGIN {
>>>> while ((getline x) != 0) {
>>>> if ((getline x) < 0) {
>>>> print "error reading " FILENAME " after line " FNR ": " ERRNO
>>>> }
>>>> }
>>>> }
>>> (Your loop, BTW, checks the file's readability for every record read;
>>> not very efficient.)
>> Well, a file is fully readable if every record is readable.
>
> Yes, but your code actually reads two records per iteration... :-)

Ooops! My mistake. It was late night when I wrote it. My neurons slipped :-(

>
> I think you want:
>
> BEGIN {
> while ((val = (getline x)) != 0) {
> if (val < 0)
> print "error reading " FILENAME " after line " FNR ": " ERRNO
> }
> }

Of course.

>
> Anyway, in practice, it is hard to have a case where a file is readable
> part way through the processing and then suddenly becomes unreadable.

Well, if AWK had true unicode support sometime in the future, we can
have errors in the middle of a UTF-8 file, like "invalid byte sequence".

> I have to wonder if trying to catch it isn't a case of diminishing returns.

I'm not a salesman :-)

Seriously, forcing every user to take the responsibility for error
handling will diminish returns. But allowing it to users that explicitly
request it, instead of forbidding, would be a good thing.

>
>>> The gawk manual shows you how to do this by simply looping through ARGV,
>>> using a redirected getline to test for readability and then removing
>>> the bad element from ARGV.
>> This is cumbersome.
>
> Not any more so than your double getline loop... :-)

Ditto.

>
>>>> If input errors are not flagged as fatal, the above examples will
>>>> process all arguments, even some of them are unreadable.
>>> I understand, but I think the current BEGINFILE semantics give you
>>> the hooks to handle things adequately.
>> In most cases yes, but not always. In particular the XML extension of
>> xgawk really needs a place to report errors in the middle of a file,
>> without stopping further processing of other files. I see ENDFILE as an
>> excellent point for handling these errors (XML parsing and encoding
>> conversion errors).
>
> Do these errors show up as a result of calls to read? Are they currently
> fatal errors?

In xgawk, reading input in XML mode is handled by feeding input text
chunks to the expat parser which in turn delivers "records" in the form
of XML SAX events. So non wellformed XML files generate faulty "records"
in the middle of the file, and they are considered non-fatal. The
current action is to set ERRNO and automatically ignore the rest of the
file and proceed to the next input file.

This means that the error notification is mixed with the next valid
record, when FILENAME and other special values related to the faulty
record have been updated to refer to the current new record (no problem
if the faulty file is the last one).

We could create a special XMLERROR event to notify non-fatal errors at
the same level of normal records, but the addition of the ENDFILE
feature opens another possibility for reporting non-fatal errors and
automatically continue processing the next input file. And this new
possibility can also be used to report errors of regular text files, and
not only XML ones.

>
> In other words, with the ENDFILE patch in place, can you not make use of
> it as you want?

Yes. It can be used.

> What other changes are needed in the gawk internals to give
> you what you're looking for?

None w.r.t. XML processing. But I've been always surprised by the fact
that input errors are so strictly flagged as fatal. I guess that this
early design decision obeys to the fact that there are no adequate
places where to notify them to the user code, other than the return code
of getline. And handling getlined faulty records as non-fatal and
normal faulty records as fatal would certainly be an inconsistency.

But the addition of the ENDFILE block creates a new place for reporting
main input read errors. Awk scripts with an ENDFILE rule can potentially
catch all input read errors in a uniform way.

All said, I agree that adding new features to gawk must be done very
carefully. Perhaps tawk can be used as a reference model in this
particular case.

Regards.

Aharon Robbins

unread,

Feb 3, 2009, 2:15:53 PM2/3/09

In article <gma4un$lft$1...@localhost.localdomain>,
Aharon Robbins <arn...@skeeve.com> wrote:
>In article <glvuuu$ghk$1...@heraldo.rediris.es>,

>Manuel Collado <m.co...@invalid.domain> wrote:
>>> It's not clear what you're asking. A read error has two possible outcomes
>>> in gawk:
>>>
>>> 1. If reading via getline from a command-line file, an error is returned.
>>> 2. If reading via the main input loop, the error is fatal.
>>>
>>> You seem to be suggesting that in case 2, the error not be fatal, but
>>> instead go into the ENDFILE block with ERRNO set.
>>
>>Yes. This is exactly what I was suggesting.
>

>OK. The diff below should do this. It is relative to the BEGINFILE
>patch. I will be updating that patch on http://www.skeeve.com shortly.

I think the patch is wrong. I will just update the patch on skeeve.com
and post a note here when it's ready.

Sorry for the noise.

Aharon Robbins

unread,

Feb 3, 2009, 2:12:23 PM2/3/09

In article <glvuuu$ghk$1...@heraldo.rediris.es>,

Manuel Collado <m.co...@invalid.domain> wrote:
>> It's not clear what you're asking. A read error has two possible outcomes
>> in gawk:
>>
>> 1. If reading via getline from a command-line file, an error is returned.
>> 2. If reading via the main input loop, the error is fatal.
>>
>> You seem to be suggesting that in case 2, the error not be fatal, but
>> instead go into the ENDFILE block with ERRNO set.
>
>Yes. This is exactly what I was suggesting.

OK. The diff below should do this. It is relative to the BEGINFILE

patch. I will be updating that patch on http://www.skeeve.com shortly.

>> Anyway, in practice, it is hard to have a case where a file is readable

>> part way through the processing and then suddenly becomes unreadable.
>
>Well, if AWK had true unicode support sometime in the future, we can
>have errors in the middle of a UTF-8 file, like "invalid byte sequence".

I don't even want to think about this. This is the job iconv is meant to do.

>> Do these errors show up as a result of calls to read? Are they currently
>> fatal errors?
>
>In xgawk, reading input in XML mode is handled by feeding input text
>chunks to the expat parser which in turn delivers "records" in the form
>of XML SAX events. So non wellformed XML files generate faulty "records"
>in the middle of the file, and they are considered non-fatal. The
>current action is to set ERRNO and automatically ignore the rest of the
>file and proceed to the next input file.
>
>This means that the error notification is mixed with the next valid
>record, when FILENAME and other special values related to the faulty
>record have been updated to refer to the current new record (no problem
>if the faulty file is the last one).
>
>We could create a special XMLERROR event to notify non-fatal errors at
>the same level of normal records, but the addition of the ENDFILE
>feature opens another possibility for reporting non-fatal errors and
>automatically continue processing the next input file. And this new
>possibility can also be used to report errors of regular text files, and
>not only XML ones.

I think the diff below gives you what you want, as long as your version
of get_a_record puts an appropriate value into the *errcode variable.

>Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Thanks again for the feedback.

Arnold
----------------------------------------------------------------------------------
--- io.c.save 2008-12-25 08:59:30.000000000 +0200
+++ io.c 2009-02-03 22:29:31.000000000 +0200
@@ -353,7 +353,8 @@
fname = arg->stptr;
errno = 0;
curfile = iop_open(fname, binmode("r"), &mybuf, & isdir, FALSE);
- update_ERRNO();
+ if (! do_traditional)
+ update_ERRNO();

/* This is a kludge. */
unref(FILENAME_node->var_value);
@@ -442,17 +443,25 @@
char *begin;
register int cnt;
int retval = 0;
+ int errcode = 0;

if (at_eof(iop) && no_data_left(iop))
cnt = EOF;
else if ((iop->flag & IOP_CLOSED) != 0)
cnt = EOF;
else
- cnt = get_a_record(&begin, iop, NULL);
+ cnt = get_a_record(&begin, iop, & errcode);

if (cnt == EOF) {
cnt = 0;
retval = 1;
+ if (errcode > 0) {
+ if (do_traditional)
+ fatal(_("error reading input file `%s': %s"),
+ iop->name, strerror(errcode));
+ else
+ update_ERRNO_saved(errcode);
+ }
} else {
NR += 1;
FNR += 1;
@@ -959,10 +968,12 @@
lintwarn(_("close: `%.*s' is not an open file, pipe or co-process"),
(int) tmp->stlen, tmp->stptr);

- /* update ERRNO manually, using errno = ENOENT is a stretch. */
- cp = _("close of redirection that was never opened");
- unref(ERRNO_node->var_value);
- ERRNO_node->var_value = make_string(cp, strlen(cp));
+ if (! do_traditional) {
+ /* update ERRNO manually, using errno = ENOENT is a stretch. */
+ cp = _("close of redirection that was never opened");
+ unref(ERRNO_node->var_value);
+ ERRNO_node->var_value = make_string(cp, strlen(cp));
+ }

free_temp(tmp);
return tmp_number((AWKNUM) -1.0);
@@ -3037,13 +3048,10 @@
iop->flag |= IOP_AT_EOF;
return EOF;
} else if (iop->count == -1) {
- if (! do_traditional && errcode != NULL) {
+ iop->flag |= IOP_AT_EOF;
+ if (errcode != NULL)
*errcode = errno;
- iop->flag |= IOP_AT_EOF;
- return EOF;
- } else
- fatal(_("error reading input file `%s': %s"),
- iop->name, strerror(errno));
+ return EOF;
} else {
iop->dataend = iop->buf + iop->count;
iop->off = iop->buf;

0 new messages