In article <kpf6d8$fq6$
1...@speranza.aioe.org>,
Janis Papanagnou <
janis_pa...@hotmail.com> wrote:
>Am 14.06.2013 15:21, schrieb Kenny McCormack:
>> In article <kpf1ur$2m1$
1...@speranza.aioe.org>,
>> Janis Papanagnou <
janis_pa...@hotmail.com> wrote:
>> ...
>>> RS="\0" is what you are looking for.
>>>
>>> In gawk RS="\0" is different from RS="".
>>
>> I believe this is incorrect. Setting RS to the null character does exactly
>> that.
>
>I am not sure what you are saying here. I've done that a couple times
>on Linux and Cygwin. (Details on page 56 of the GAWK manual - page
>refers to the May 2013 version; RS="\0" was there in earlier versions
>as well; at least in 3.1, IIRC.)
Sorry for the somewhat elliptical phrasing. What I meant by "does exactly
that" is that it does, in fact, set the record separator to the null
character and does, in fact, parse your input using the null character as
the record separator. I.e., there's nothing "magical" about setting the
record separator to the null character. It's not like GAWK sees that
RS == "\0" and says "Aha! The user has specified a magic value that tells me
to read the entire file into memory as a single record." Effectively,
setting RS="\0" is logically identical to setting it to "zzzzzzzzzzzz".
The example code that I gave, parsing the Linux /proc/self/environ file was
intended to demonstrate that there do exist real world examples of files
that use the null character as a record delimiter and thus that setting
RS="\0" does have real world usefulness. I was entirely serious when I
said that I frequently do just that - and by "just that", I mean parsing
the Linux "environ" files using GAWK (setting RS="\0").
Finally, I did find and read the section you alluded to above, in the big
GAWK PDF file. Interesting reading, although it, too, is a bit elliptical.
It does state that the only way to do this (read the entire file) is to set
RS to a value known not to occur in your data - and that this is thus,
theoretically impossible to do in complete generality. This, in turn,
suggests that maybe there *should* be some magical sentinel value or some
flag or something you could set - to get this functionality. My (obviously
flawed) memory was that there *was* some magic function to do this -
something like "FileGet" or some such - but it seems I've gotten my
languages confused with each other. The tribulations of using so many of
them on a daily basis...
Some things that I found a bit weird about the phrasing of that section of
the GAWK PDF file:
1) It says "You might think ..." (implying that you'd be wrong to do so)
and then a line or two later essentially says "and you'd be right" (at
least for GAWK, although it might not work in other AWKs - to which I say
"But does anyone really care about other AWKs at this point?")
2) Then it says "All other AWKs ...", to which I say "All generalizations
are false." Not to sound like a broken record, but TAWK is fine with
RS="\0" and does, in fact, store strings internally "Pascal style", not "C
style". In fact, if I were to hazard a guess, I'd say that most (in not
all - heh heh) "modern" AWKs (i.e., the ones we care about) have made this
step into the modern world. FWIW, besides GAWK & TAWK, there's MAWK and
there's "BWK's One True AWK". Not sure if there are any others...
3) The section concludes by saying that the "best" way to do this is
not to futz with RS at all, but rather to read the file in the normal, line
at a time mode, concatenating it together - and then, presumably,
processing the result in the "END" block. Now, while I agree to some extent
with the underlying spirit of this advice, and I myself often give newbies
the advice to not futz with the built-in variables (FS, RS, etc) and just
"do it in code", the fact remains that doing so often breaks AWK's
"pattern-action" model. I.e., one of the basic problems with AWK is that
its nifty and oh-so-kewl "pattern-action" model is fragile and often
becomes unusable if your input isn't amenable to its use. This all argues
for a built-in way to read the whole file as a single record (i.e., one
that doesn't depend on setting RS to something you just hope won't be found
in your data).
Finally, note that where this all comes from is that I was writing a
program using the new [*] FPAT functionality in GAWK, where my FPAT
matching records span multiple lines (and include the newlines). Having
read the recommendation (in the quoted PDF file) that "the best way ...", I
suppose I could have built it all up and then used "patsplit()" in the the
"END" block, but that's just so ugly. As I say, it breaks AWK's lovely
pattern-action model. Incidentally, I did change my program to do RS="\0"
instead of all the Zs, but I still think there should be a "more systemic"
way to do it.
[*] New in GAWK (as of 4.x, I think), but of course, TAWK has had it for
years...
--
First of all, I do not appreciate your playing stupid here at all.
- Thomas 'PointedEars' Lahn -