On 07.02.2014 19:34, Jonathan Hankins wrote:
> On Friday, February 7, 2014 9:38:28 AM UTC-6, Janis Papanagnou wrote:
>>
>> In GNU awk you can use RS="\0" to parse files with '\0' terminated
>> records (like ksh's .sh_history file). I'd be interested how one would
>> parse data files with '\0' terminated lines using non-GNU awk's.[*]
>>
>
> I believe a pre-condition to answering this for any specific awk
> implementation is understanding whether or not the implementation uses C
> strings for file I/O. Since C strings are NUL-delimited, handling input
> containing NUL characters would not be supported in such an
> implementation.
>
> It's conceivable that any given implementation could use something other
> than C strings for file I/O, which is part of why GAWK is able to handle
> this type of input.
File I/O will use the read/write functions with buffers and lengths, so
I suppose we should focus on whether data is internally processed as and
restricted by the C strings.
>
> Once you have determined that a specific implementation can handle NUL
> characters in file I/O, you have to determine if the implementation's
> grammar parser allows you to process the input in a way that will give you
> the result you need. For example, can the string literal "\0" be used to
> specify NUL? Can RS therefore be set to NUL, and if so, does it "do the
> right thing" and split records on NUL characters? Do other built-in
> functions such as split() accept NUL in the regular expression argument,
> and if so, do they "do the right thing"?
Well, yes; it's crucial what the language _grammer_ defines, but the
actual parser of the implementation must support what the awk grammar
generally defines. The original awk book from Aho, Weinberger, and
Kernighan defines what is possible in regexps, and the escape sequence
\ddd (with ddd being a 1-3 digit sequence of digits) allows to define
the ASCII NUL character, e.g. as \0. Therefore I'd concluce that an
awk implementation that has problems processing data with NULs would
be broken; i.e. if there's no other restriction explicitly mentioned.
Generally I don't expect an implementation restriction ("C strings")
to invalidate a language grammar or a language specification.
You can certainly do split($0,a,/\0/), and my expectatin would be that
this works in non-GNU awks as well. (But I have no other awks available
at the moment to check.) For consistency reasons I'd hope that strings
in awk as well wouldn't handle the NUL differently, so that a RS="\0"
or split($0,a,"\0") would do the trick as well. (But I seem to recall
to have heard that RS="\0" would behave like RS="" in other awks, and
so I suppose that split($0,a,"\0") therefore might also have a problem
and probably behave different from split($0,a,/\0/).)
If split($0,a,/\0/) would work in other awks that would then be the
answer to my question. (If split($0,a,"\0") or RS="\0" would also work,
yet better.)
Can someone confirm how split($0,a,/\0/) would behave. A possible test
case could be
printf "A\n\0B\n\0C\n" |
awk 'BEGIN{RS=SUBSEP}
{n=split($0,a,/\0/);for(i=1;i<=n;i++)printf "%d: \"%s\"\n",i,a[i]}'
>
> Note that these issues are not awk specific -- you could be asking the
> same questions about any utility, such as tr, sort, etc.
Well, if the tool specifies restrictions on processing NUL that's fine,
otherwise I'd assume that all characters are processable. (I am aware
that there can be bad surprises in some cases anyway.)
Janis
>
> -Jonathan Hankins
>