Note to standards jockeys: No, this isn't a bug in your precious GAWK in
the usual "standards" sense. So, don't even bother.
The fact is that, under certain conditions, it *is* a mis-feature, and it
would be nice to at least have the option of continuing. I note in
passing that TAWK handles this rather better - you get a warning about a
missing file, but the script continues. Ideal, of course, would be a
settable option, so you can select the behavior that you want.
Note: I am talking about files read in the "automatic input loop", not
via "getline".
Obviously one solution would be to hack (fix) the GAWK source code and
recompile, but that is inconvenient for me (due to some reasons beyond
the scope of this document). So, I elected to fix it via an "interposer".
See below.
This solution works for me under Linux - you may need to adjust
accordingly for your environment.
$ cat open_fix.c
/* A lib to fix the GAWK missing files problem */
/* Usage: export LD_PRELOAD=/path/to/this/lib */
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
static int (*myopen64) (const char *,int);
int open64(const char *path, int flags, ...) {
int ret;
if (!myopen64)
myopen64 = (int (*)(const char *,int)) dlsym(RTLD_NEXT,"open64");
ret = myopen64(path,flags);
return ret != -1 ? ret : myopen64("/dev/null",flags);
}
$ gcc -fPIC -W -Wall -Werror -c open_fix.c
$ ld -G -h libopen_fix.so.1 -ldl -o libopen_fix.so open_fix.o
$ LD_PRELOAD=./libopen_fix.so gawk '{print FILENAME,$0}' goodfile badfile goodfile1
Have fun!
> Obviously one solution would be to hack (fix) the GAWK source code and
> recompile, but that is inconvenient for me (due to some reasons beyond
> the scope of this document). So, I elected to fix it via an "interposer".
What's wrong with checking if the file exists like this:
awk '{print FILENAME,$0}' "$( [ -f file ]&&echo file||echo /dev/null )" etc.
(except the fact that one usually doesn't post to that a newsgroup)
Maybe because that would be impractical if you're processing many file
arguments...?
awk '{print FILENAME,$0}' prefix*.ext
Janis
Oops, makes not much sense with wildcards.
1) The above is ugly.
2) The above is ugly.
3) It doesn't scale. My issue is that I have many, many input files and
I'd hate to have to code that kludge in a loop.
4) The issue is that the files can appear or disappear at any time - so
there is a race condition going on - you can't really rely on the above
to work.
5) I was going to point out that the above is shell, so OT, but then again,
I suppose C is OT as well. But not quite so much OT as shell is.
Anyway, it works for me, and that's the important thing.
I have always thought that it is better to enhance (fix) the language,
then to kludge around it.
> In article <g8h7u1$4pq$1...@aioe.org>, pk <p...@pk.invalid> wrote:
>>On Wednesday 20 August 2008 14:48, Kenny McCormack wrote:
>>
>>> Obviously one solution would be to hack (fix) the GAWK source code and
>>> recompile, but that is inconvenient for me (due to some reasons beyond
>>> the scope of this document). So, I elected to fix it via an
>>> "interposer".
>>
>>What's wrong with checking if the file exists like this:
>>
>>awk '{print FILENAME,$0}' "$( [ -f file ]&&echo file||echo /dev/null )"
>>etc.
>>
>>(except the fact that one usually doesn't post that to a newsgroup)
>>
>
> 1) The above is ugly.
> 2) The above is ugly.
Very good technical reasons.
> 3) It doesn't scale. My issue is that I have many, many input files and
> I'd hate to have to code that kludge in a loop.
You still have to code *your* kludge.
> 4) The issue is that the files can appear or disappear at any time - so
> there is a race condition going on - you can't really rely on the
> above to work.
This is a good reason (which you didn't mention before).
> 5) I was going to point out that the above is shell, so OT, but then
> again, I suppose C is OT as well. But not quite so much OT as shell is.
Yes, of course you are the one who decides that, I had forgot it.
>>> What's wrong with checking if the file exists like this:
>>>
>>> awk '{print FILENAME,$0}' "$( [ -f file ]&&echo file||echo /dev/null
>>> )" etc.
>>
>>
>> Maybe because that would be impractical if you're processing many file
>> arguments...?
>>
>> awk '{print FILENAME,$0}' prefix*.ext
>
> Oops, makes not much sense with wildcards.
You're right, you need even more ugly kludges in that case, while the
interposer works fine because it only sees the filenames as expanded by the
shell.
>> Maybe because that would be impractical if you're processing many file
>> arguments...?
>>
>> awk '{print FILENAME,$0}' prefix*.ext
>
> Oops, makes not much sense with wildcards.
In that case the shell does all the work and the resulting file list
contains only files that actually exist. If we want to be picky, there's
still the race condition problem between the moment the shell expands the
list and awk tries to open each file.
Indeed.
>> 3) It doesn't scale. My issue is that I have many, many input files and
>> I'd hate to have to code that kludge in a loop.
>
>You still have to code *your* kludge.
One man's kludge is another man's thing of beauty.
>> 4) The issue is that the files can appear or disappear at any time - so
>> there is a race condition going on - you can't really rely on the
>> above to work.
>
>This is a good reason (which you didn't mention before).
Yes. In fact, that's the real problem - the race condition between when
the shell expands the filenames and when AWK gets around to reading them.
By the way, my input file specification is: /proc/*/cmdline
>> 5) I was going to point out that the above is shell, so OT, but then
>> again, I suppose C is OT as well. But not quite so much OT as shell is.
>
>Yes, of course you are the one who decides that, I had forgot it.
Yes. I am the boss here. And don't nobody be forgettin' it!
Yes, that's why I cancelled my original message and added this comment.
I think it's still a Good Thing to let an "invisible" layer handle that
instead of using explicit workarounds for each of the given files and
avoiding "non-scalable" (as Kenny called it) shell constructs, which was
the intention introducing my wildcard example in the first place to show
the problem that arises with many file arguments.
Janis
Good post. Thanks.
I like your concept of an "invisible layer".
I agree. A fatal error in this situation stinks. You could work around it, of
course, with an up-front getline test:
$ ls f1 f2 f3
ls: cannot access f2: No such file or directory
f1 f3
$ cat f1 f3
f1, line 1
f3, line 1
$ awk '1' f1 f2 f3
f1, line 1
awk: (FILENAME=f1 FNR=1) fatal: cannot open file `f2' for reading (No such file
or directory)
$ awk 'BEGIN{if ((getline<ARGV[2])<0) ARGV[2]="/dev/null"; else
close(ARGV[2])}1' f1 f2 f3
f1, line 1
f3, line 1
$ echo "f2, line 1" > f2
$ awk 'BEGIN{if ((getline<ARGV[2])<0) ARGV[2]="/dev/null"; else
close(ARGV[2])}1' f1 f2 f3
f1, line 1
f2, line 1
f3, line 1
You can't tell from the getline if the file's missing or just can't be opened
but you probably wouldn't care.
Ed.
This approach seems sensible to me. And rather than use LD_PRELOAD
to solve the problem, why not use an xgawk include file? If
you stick the follwing file in /usr/share/xgawk/fixopen.awk
BEGIN {
for (i = 1; i < ARGC; i++) {
if ((getline < ARGV[i]) < 0)
delete ARGV[i]
else
close(ARGV[i])
}
}
then you can say:
bash-3.1$ xgawk '1; END {print "DONE"}' /tmp/does_not_exist
xgawk: cmd. line:1: fatal: cannot open file `/tmp/does_not_exist' for
reading (No such file or directory)
vs.
bash-3.1$ xgawk -i fixopen '1; END {print "DONE"}' /tmp/does_not_exist
DONE
That seems perhaps easier to maintain than using LD_PRELOAD. Plus
it has the advantage of deleting the file from the argument list,
so there's no need to open /dev/null in its place.
Regards,
Andy
P.S. I recognize that this has a problem in the case where the
user is passing other types of information (besides filenames)
on the command line. For some people that may be a problem;
I tend not to pass non-filename arguments very often...
Read the rest of the thread and you will understand why this is a
non-starter.
It's historical practice. Unix awk has worked this way since forever. IF
you don't need the filenames, you could always use
cat /proc/*/cmdline 2>/dev/null | awk 'program text'
In any case, it would not be a good idea to change gawk's default
behavior in this case.
Kenny: You are, of course, welcome to fork the gawk code base and create
a language that works to your specifications. You have my blessings.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
> Kenny: You are, of course, welcome to fork the gawk code base and create
> a language that works to your specifications. You have my blessings.
In this particular case, I think a command line switch to enable the
behavior could be enough.
Yeah, but I think it'd make sense to see the default behavior changed and just
do the abort if a new switch or the existing "--compat/traditional" switch was
being used.
On the other hand, I've never actually encountered this problem in real use so
it's just an opinion...
Ed.
>> In this particular case, I think a command line switch to enable the
>> behavior could be enough.
>>
>
> Yeah, but I think it'd make sense to see the default behavior changed and
> just do the abort if a new switch or the existing "--compat/traditional"
> switch was being used.
That would break the scripts that relay on the traditional behavior, but
again I never wrote one that does and I don't know how many of those are
there around (if any).
> On the other hand, I've never actually encountered this problem in real
> use so it's just an opinion...
Mine too.
> That would break the scripts that relay
Should be "rely", of course.
That's a rather cryptic response. I have read the thread. I have
always found LD_PRELOAD solutions to be hacks that are difficult to
maintain.
The point is that xgawk already exists as a testbed for new gawk
features. It currently contains some features that could help with
your problem. And it would also be a good place to add a patch to
address this particular issue, if you feel the existing xgawk
facilities
aren't rich enough.
Regards,
Andy
I agree with 'pk' on this one. A switch to invoke the "non-traditional"
behavior is the way to go. While I *admire* the TAWK way, I tend to
agree that the "traditional" Unix/GAWK way is what most users expect.
>On the other hand, I've never actually encountered this problem in real use so
>it's just an opinion...
True. And that's what makes this whole thread rather, shall we say,
unique. It is hard to imagine a real world instance of this _other than_
when dealing with /proc...
Still, I think that the LD_PRELOAD method is good - obviously this
syntax functions as a "switch" - if I want this functionality, I use
LD_PRELOAD; if I don't, I don't. As I said, if I were really serious
about making this a permanent change, I'd fix it in the source, but it's
just not feasible for me to do that at the moment.
While I usually would agree with that, in this case we're talking about
something that almost never happens so I doubt if anyone would add that switch
every time they invoke awk just in case it does, so if we have a switch to
invoke the "new" behavior then it'll probably never get used so those who would
fall over this problem still will, and there's an alternative workaround using
getline IF you need to deal with it, so it's just pointless to add a switch to
turn ON the new behavior.
On the other hand, making the new behavior the default would almost certainly
not cause anyone any problems, and if it does they can add the new switch.
Ed.
I agree quite strongly! Having existing awk programs - including all those
relied upon for normal system functioning - suddenly have the potential to fail
silently rather than verbosely in the case of a missing file would be a very
bad idea.
I have no objection to an option to enable alternate behavior, though I'm among
those who would have little use for it.
John
--
John DuBois spc...@armory.com KC6QKZ/AE http://www.armory.com/~spcecdt/
It doesn't have to be silent, there's no reason for it to be a catastrophic
failure like today, there's no real reason an application should want a
significant difference between trying to open a missing file vs trying to open
an unreadable file like today, and a missing file is handled inconsistently
today between being opened by getline vs being opened in the normal work loop so
handling of missing files could seriously be considered as broken right now and
this prooposal is a fix.
> I have no objection to an option to enable alternate behavior, though I'm among
> those who would have little use for it.
Right, but then no-one would actually use it as I mentioned elsethread.
Ed.
+++ ./io.c 2008-08-22 10:30:05.534799000 -0400
@@ -316,6 +316,11 @@ nextfile(int skipping)
if (isdir && do_traditional)
continue;
#endif
+ if (whiny_users) {
+ warning(_("cannot open file `
%s' for rea
ding (%s)"),
+ fname,
strerror(errno));
+ continue;
+ }
goto give_up;
}
curfile->flag |= IOP_NOFREE_OBJ;
I imagine that should satisfy the various constituencies.
Regards,
Andy
Interesting. Looks like we may have to frequently mention here on the
newsgroup, for the benefit of the various newbies, the need to set
WHINY_USERS in order to get proper functionality of GAWK.
Note that I am sort-of, semi, half-kidding. I do strongly believe that
array sorting is just natural and should always be on (unless your
arrays are really, really, huge, or your machine made during the Stone
Age, I can't see how it can cost). However, as my posts here have made
clear, I'm not all that certain that this "file not found" issue is in
need of an over-arching solution. I.e., I could see turning
WHINY_USERS on for the array sorting, but not necessarily
wanting/needing this other feature turned on.
I suppose I should search the current sources to see what, if any, other
effects may have been tied to WHINY_USERS.
This need is set by all of the existing awk code out there, most of which is
not run interactively, and approximately none of which does any sort of error
checking on availablity of input files. I do *not* want that code to continue
to produce output, exit successfully, etc. if input files are not available.
>there's no real reason an application should want a
>significant difference between trying to open a missing file vs trying to open
>an unreadable file like today
What significant difference?
> and a missing file is handled inconsistently
>today between being opened by getline vs being opened in the normal work loop
This is exactly the difference that *should* exist. In a getline loop, there
is a failure indication intrinsically available to the code. If a file is
simply skipped, there isn't.
In fact, let me put it this way: If I was designing the language today, I would
make it behave (almost) exactly as it does. A file that couldn't be opened for
any reason would, by default, be a fatal error. What I might do differently:
a) provide a command-line option to make it a non-fatal error; and b) provide a
failure block which, if used, would make it an otherwise-silent non-event:
something like OPENFAIL { }.
Oh yes, and add SIGNAL { } too. Having to wrap gawk script in a shell wrapper
to catch signals -- well, it can be done, like the shell wrapper for open fail.
Grant.
--
http://bugsplatter.id.au/
It does that today if the input file is empty and *I* don't care if it's empty
or can't be opened or doesn't exist.
>
>>there's no real reason an application should want a
>>significant difference between trying to open a missing file vs trying to open
>>an unreadable file like today
>
>
> What significant difference?
There isn't one in general. I was looking at a difference that only exists on
cygwin:
$ ls -l f?
-rw-r--r-- 1 morton mkgroup-l-d 11 Aug 20 16:58 f1
---------- 1 morton mkgroup-l-d 0 Aug 23 18:49 f2
-rw-r--r-- 1 morton mkgroup-l-d 11 Aug 20 16:59 f3
$ gawk '1' f1 f2 f3
f1, line 1
f3, line 1
$ rm -f f2
$ gawk '1' f1 f2 f3
f1, line 1
gawk: (FILENAME=f1 FNR=1) fatal: cannot open file `f2' for reading (No such file
or directory)
It looked like it was quitely skipping the unreadable file, but when I added
content to that file:
$ ls -l f?
-rw-r--r-- 1 morton mkgroup-l-d 11 Aug 20 16:58 f1
---------- 1 morton mkgroup-l-d 14 Aug 23 18:51 f2
-rw-r--r-- 1 morton mkgroup-l-d 11 Aug 20 16:59 f3
$ gawk '1' f1 f2 f3
f1, line 1
file2, line 1
f3, line 1
I see it's just that cygwin is ignoring the unreadable permission of f2.
>
>>and a missing file is handled inconsistently
>>today between being opened by getline vs being opened in the normal work loop
>
>
> This is exactly the difference that *should* exist. In a getline loop, there
> is a failure indication intrinsically available to the code. If a file is
> simply skipped, there isn't.
That's just a design choice. You could choose to set some standard variable and
have it available for anyone who cared to test, probably in the END section.
> In fact, let me put it this way: If I was designing the language today, I would
> make it behave (almost) exactly as it does. A file that couldn't be opened for
> any reason would, by default, be a fatal error.
Why? If you're going to do that, why not make an empty file a fatal error too?
If you care about it, why not test all the files up front and then not open any
of them rather than producing partial output? Those are rhetorical questions - I
don't really care what the answers are as what to do is just a matter of
opinion, BUT a fatal error tears the rug out from under you in terms of handling
various input coonditions.
> What I might do differently: a) provide a command-line option to make it a
non-fatal error; and b) provide a
> failure block which, if used, would make it an otherwise-silent non-event:
> something like OPENFAIL { }.
I agree with both of those, though obviously I'd switch the default behavior.
Ed.
I think what most people are arguing is that you *can't* change the
default behavior, however much we wish it had been done right (IOHO) in
the beginning, because it *might* break existing code. The situation is
much the same as that which has Solaris keeping two very broken programs
around (and makes them the default on the default PATH). I am
referring, of course, to their keeping /bin/awk (very broken) and
/bin/sh (original sh, warts and all, even as the world is moving towards
the so-called "POSIX" shell).
P.S. IOHO: In our humble opinion
In the current savannah CVS source, I see only 3 places where
WHINY_USERS
matters:
1. Array index sorting (using qsort) in "for" loops.
2. The new patch to turn an open failure into a warning instead of a
fatal error
3. In the profiling code, there is a place in pp_string_fp where
it changes how characters are printed (octal vs. %c). I'm not
exactly sure how that manifests itself.
Perhaps WHINY_USERS should be a bitmask to selectively enable one or
more
of these features.
Regards,
Andy
This whole discussion reminds me of the BEGINFILE/ENDFILE proposal
that was discussed in this group in 2006. If that extension were
implemented, then it would be much easier to handle missing files
(e.g. the BEGINFILE rule could test whether the file is readable and
skip it if not). There has always been an issue of whether a gawk
program might be interested in knowing whether a zero-length file
was supplied as an argument. Currently, there's no simple way to
detect that situation. But with BEGINFILE/ENDFILE, that situation
could be detected, and I think we could also find a way to handle
unreadable files. Perhaps this is worth implementing as an
xgawk extension?
Regards,
Andy
> There has always been an issue of whether a gawk
> program might be interested in knowing whether a zero-length file
> was supplied as an argument. Currently, there's no simple way to
> detect that situation. But with BEGINFILE/ENDFILE, that situation
> could be detected, and I think we could also find a way to handle
> unreadable files. Perhaps this is worth implementing as an
> xgawk extension?
You probably remember that Peter Saveliev has already
implemented this and supplied a documented patch
against xgawk:
http://lml.ls.fi.upm.es/~mcollado/xmlgawk/b-e-g-summary.html
http://xgawk.radlinux.org/Articles/patch-fileworks/show
This was motivated by a different whiny user, so that characters with
the high bit set (e.g. Chinese) come out in the output in the same way
they went in.
>Perhaps WHINY_USERS should be a bitmask to selectively enable one or
>more of these features.
Nah.
Arnold
I seem to recall that the gawk manual shows you exactly how to do this
with a library file if it's important to you. I think it was less than 20
lines of code, and all it takes is adding a -f xxx.awk (I forget the name)
to the command line.
We can argue back and forth forever as to what the "right" design decision
is, but as is the case with many things in awk, we (I) am constrained by
both historical practice and standards. Since there's an easy workaround
for those who want it, using standard awk features, I don't see this as
a major issue.
Yes, the gawk ARGIND extension gives the ability to
detect zero-length files. It is discussed here:
http://www.gnu.org/manual/gawk/html_node/Empty-Files.html
I think many of the issues discussed here can be addressed by
including
awk code libraries, such as that one. Note that xgawk makes it a bit
easier to
do this (by adding an @include directive). However, it seems to me
that BEGINFILE would still be needed if there's a desire to be able
to change parsing mode based on the filename. Suppose, for example,
that you
wanted to change the value of RS based on the filename (or, in the
case of
xgawk, you want to switch to XML parsing mode for a file that ends
in .xml).
This needs to be done before the 1st record is read. Is there any way
to do
this without having a BEGINFILE hook? In practice, what I do is use
getline to take control of this process instead of passing the
filename
on the command line. That solves the problem for me, but I know that
some people really detest getline and prefer to use command-line file
arguments for all processing...
Regards,
Andy
Regards,
Andy
I do need the filenames. Still, this is an interesting "yet another
shell/script kludge" solution to the problem. I imagine if one really
wanted to go down this route, one could kludge something up with
something like "pr" that does preserve the filenames - probably piping
the output of "pr" through AWK to do the cleanup, etc, etc.
>In any case, it would not be a good idea to change gawk's default
>behavior in this case.
Agreed - although it now seems that you have, in fact, put together an
"official" patch for this.
>Kenny: You are, of course, welcome to fork the gawk code base and create
>a language that works to your specifications. You have my blessings.
Well, thank you for that. But as you know, I'm not really interested in
changing the "official" sources. Rather, I explicitly went for a
solution that doesn't require recompiling GAWK itself (but still has the
benefit of being a "system" (C-code level) solution).
Here is my final (?) code for this. It functions as a drop-in
replacement for "gawk" (can be used in the #! line). Compile as
indicated in the comments. Yes, this is unabashedly Linux-specific.
/* A lib to fix the GAWK missing files problem */
/* Usage: export LD_PRELOAD=/path/to/this/lib */
/* Compile via:
* gcc -s -W -Wall -Werror -fpic -pie -rdynamic -o libopen_fix.so open_fix.c -ldl
*/
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
int open64(const char *path, int flags, ...) {
static int (*real_open64) (const char *,int);
static char *warn;
int ret;
if (!real_open64) {
real_open64 = (int (*)(const char *,int)) dlsym(RTLD_NEXT,"open64");
if ((warn = getenv("OPENFIX_WARN"))) {
long t = time(0);
fprintf(stderr,"openfix: Initializing at: %s",ctime(&t));
}
}
if ((ret = real_open64(path,flags)) != -1) return ret;
if (warn)
fprintf(stderr,"openfix: open failed for '%s' (flags = %d)\n",path,flags);
return real_open64("/dev/null",flags);
}
int main(int argc, char **argv) {
char *buff = malloc(512);
int ret;
if ((ret = readlink("/proc/self/exe",buff,512)) == -1)
perror("readlink"), exit(1);
buff[ret] = 0;
setenv("LD_PRELOAD",buff,1);
if (getenv("OPENFIX_WARN"))
printf("LD_PRELOAD = '%s'\n",getenv("LD_PRELOAD"));
execvp("gawk",argv);
perror("gawk");
exit(!!argc);
}
Yes, it does. Checking for Readable Data Files and Checking For
Zero-length Files, both on pg. 195 of the Oct. 2007 version of the
manual, GAWK: Effective AWK Programming.
Keep in mind that none of these "script kludges" are effective in the
cases where it matters (e.g., when processing files in /proc).
That's true. I suspect the only way to handle such a situation is to
have a BEGINFILE rule that is called after the open has been
attempted.
If the file open failed, then BEGINFILE might be called with ERRNO set
to a non-NULL string. Inside the BEGINFILE rule, one could call
nextfile to skip on error. If nextfile is not called from BEGINFILE,
then this would be a fatal error (after BEGINFILE processing has
completed). That approach would give
the same behavior as now in the default case (where there is no
BEGINFILE
rule), but it would provide a hook for recovering if the file open
fails (by calling nextfile from BEGINFILE if ERRNO is non-null).
Unfortunately, I don't think either of the 2 BEGINFILE/ENDFILE patches
currently floating around gives this behavior. But perhaps it could
be achieved?
Regards,
Andy
I don't understand why you guys are still beating this horse.
The problem has been solved.
Time to move on - and slay other dragons.
> I don't understand why you guys are still beating this horse.
>
> The problem has been solved.
This is probably because of other ongoing discussions
that are not visible her in comp.lang.awk.