How to parse data files with '\0' terminated lines?

Janis Papanagnou

unread,

Feb 7, 2014, 10:38:28 AM2/7/14

to

In GNU awk you can use RS="\0" to parse files with '\0' terminated records
(like ksh's .sh_history file). I'd be interested how one would parse data
files with '\0' terminated lines using non-GNU awk's.[*]

Janis

[*] @Kenny, yes, I use gawk, so this is actually an academical question
but I am interested in serious answers anyway and hope you don't mind me
posting this question.

Jonathan Hankins

unread,

Feb 7, 2014, 1:34:26 PM2/7/14

to

I believe a pre-condition to answering this for any specific awk implementation is understanding whether or not the implementation uses C strings for file I/O. Since C strings are NUL-delimited, handling input containing NUL characters would not be supported in such an implementation.

It's conceivable that any given implementation could use something other than C strings for file I/O, which is part of why GAWK is able to handle this type of input.

Once you have determined that a specific implementation can handle NUL characters in file I/O, you have to determine if the implementation's grammar parser allows you to process the input in a way that will give you the result you need. For example, can the string literal "\0" be used to specify NUL? Can RS therefore be set to NUL, and if so, does it "do the right thing" and split records on NUL characters? Do other built-in functions such as split() accept NUL in the regular expression argument, and if so, do they "do the right thing"?

Note that these issues are not awk specific -- you could be asking the same questions about any utility, such as tr, sort, etc.

-Jonathan Hankins

Janis Papanagnou

unread,

Feb 8, 2014, 4:45:00 AM2/8/14

to

On 07.02.2014 19:34, Jonathan Hankins wrote:
> On Friday, February 7, 2014 9:38:28 AM UTC-6, Janis Papanagnou wrote:
>>
>> In GNU awk you can use RS="\0" to parse files with '\0' terminated
>> records (like ksh's .sh_history file). I'd be interested how one would
>> parse data files with '\0' terminated lines using non-GNU awk's.[*]
>>
>

> I believe a pre-condition to answering this for any specific awk
> implementation is understanding whether or not the implementation uses C
> strings for file I/O. Since C strings are NUL-delimited, handling input
> containing NUL characters would not be supported in such an
> implementation.
>
> It's conceivable that any given implementation could use something other
> than C strings for file I/O, which is part of why GAWK is able to handle
> this type of input.

File I/O will use the read/write functions with buffers and lengths, so
I suppose we should focus on whether data is internally processed as and
restricted by the C strings.

>
> Once you have determined that a specific implementation can handle NUL
> characters in file I/O, you have to determine if the implementation's
> grammar parser allows you to process the input in a way that will give you
> the result you need. For example, can the string literal "\0" be used to
> specify NUL? Can RS therefore be set to NUL, and if so, does it "do the
> right thing" and split records on NUL characters? Do other built-in
> functions such as split() accept NUL in the regular expression argument,
> and if so, do they "do the right thing"?

Well, yes; it's crucial what the language _grammer_ defines, but the
actual parser of the implementation must support what the awk grammar
generally defines. The original awk book from Aho, Weinberger, and
Kernighan defines what is possible in regexps, and the escape sequence
\ddd (with ddd being a 1-3 digit sequence of digits) allows to define
the ASCII NUL character, e.g. as \0. Therefore I'd concluce that an
awk implementation that has problems processing data with NULs would
be broken; i.e. if there's no other restriction explicitly mentioned.
Generally I don't expect an implementation restriction ("C strings")
to invalidate a language grammar or a language specification.

You can certainly do split($0,a,/\0/), and my expectatin would be that
this works in non-GNU awks as well. (But I have no other awks available
at the moment to check.) For consistency reasons I'd hope that strings
in awk as well wouldn't handle the NUL differently, so that a RS="\0"
or split($0,a,"\0") would do the trick as well. (But I seem to recall
to have heard that RS="\0" would behave like RS="" in other awks, and
so I suppose that split($0,a,"\0") therefore might also have a problem
and probably behave different from split($0,a,/\0/).)

If split($0,a,/\0/) would work in other awks that would then be the
answer to my question. (If split($0,a,"\0") or RS="\0" would also work,
yet better.)

Can someone confirm how split($0,a,/\0/) would behave. A possible test
case could be

printf "A\n\0B\n\0C\n" |
awk 'BEGIN{RS=SUBSEP}
{n=split($0,a,/\0/);for(i=1;i<=n;i++)printf "%d: \"%s\"\n",i,a[i]}'

>
> Note that these issues are not awk specific -- you could be asking the
> same questions about any utility, such as tr, sort, etc.

Well, if the tool specifies restrictions on processing NUL that's fine,
otherwise I'd assume that all characters are processable. (I am aware
that there can be bad surprises in some cases anyway.)

Janis

>
> -Jonathan Hankins
>

Kenny McCormack

unread,

Feb 8, 2014, 7:25:12 AM2/8/14

to

In article <ld4uas$c4d$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...

>Well, if the tool specifies restrictions on processing NUL that's fine,
>otherwise I'd assume that all characters are processable. (I am aware
>that there can be bad surprises in some cases anyway.)

I'm pretty sure that Arnold has mentioned once or twice that "mawk" a) uses
"C strings" and b) (therefore) cannot handle nulls in strings. And,
indeed, the man page ("man mawk") says:

BUGS
mawk cannot handle ascii NUL \0 in the source or data files. You
can output NUL using printf with %c, and any other 8 bit character is
acceptable input.

As you know, I don't much care for "standards" and language-lawyering, so
I'm not going to get all steamy about how mawk must therefore be a "bad"
implementation. I just accept things as they are...

That said, and FWIW, I'm pretty sure it [*] is a "dark corner" sort of thing.
I'm also pretty sure that GAWK counts it as a "GAWK feature" (that is, a
Good Thing) that GAWK handles it correctly.

[*] And by "it", we mean an AWK implementation being able to handle nulls
in strings.

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

Dave Sines

unread,

Feb 9, 2014, 9:57:34 AM2/9/14

to

Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <ld4uas$c4d$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>>Well, if the tool specifies restrictions on processing NUL that's fine,
>>otherwise I'd assume that all characters are processable. (I am aware
>>that there can be bad surprises in some cases anyway.)
>
> I'm pretty sure that Arnold has mentioned once or twice that "mawk" a) uses
> "C strings" and b) (therefore) cannot handle nulls in strings. And,
> indeed, the man page ("man mawk") says:
>
> BUGS
> mawk cannot handle ascii NUL \0 in the source or data files. You
> can output NUL using printf with %c, and any other 8 bit character is
> acceptable input.

That's no longer the case.

http://www.invisible-island.net/mawk/CHANGES-contents.html#t20090726

Aharon Robbins

unread,

Feb 14, 2014, 6:41:48 AM2/14/14

to

In article <ujhksax...@perseus.wenlock-data.co.uk>,

Dave Sines <dave.gma...@googlemail.com.invalid> wrote:
>> BUGS
>> mawk cannot handle ascii NUL \0 in the source or data files. You
>> can output NUL using printf with %c, and any other 8 bit character is
>> acceptable input.
>
>That's no longer the case.
>
> http://www.invisible-island.net/mawk/CHANGES-contents.html#t20090726

It looks like this is only for RS and not general purpose handling of NUL
characters.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon
D.N. Shimshon 9978500 ISRAEL

Dave Sines

unread,

Feb 14, 2014, 11:23:22 AM2/14/14

to

Aharon Robbins <arn...@skeeve.com> wrote:
> In article <ujhksax...@perseus.wenlock-data.co.uk>,
> Dave Sines <dave.gma...@googlemail.com.invalid> wrote:
>>> BUGS
>>> mawk cannot handle ascii NUL \0 in the source or data files. You
>>> can output NUL using printf with %c, and any other 8 bit character is
>>> acceptable input.
>>
>>That's no longer the case.
>>
>> http://www.invisible-island.net/mawk/CHANGES-contents.html#t20090726
>
> It looks like this is only for RS and not general purpose handling of NUL
> characters.

That came in release 20090920 (note caveat re stdin) with an update for
split() in 20100419.

<http://www.invisible-island.net/mawk/CHANGES-index.html#t20090920>

+ two changes for embedded nulls, allows FS to be either a null or
contain a character class with null, e.g., '\000' or '[ \000]':

+ modify built-in regular expression functions to accept embedded
nulls.

+ modify input reader FINgets() to accept embedded nulls in data read
from files. Data read from standard input is line-buffered, and is
still null-terminated.

<http://www.invisible-island.net/mawk/CHANGES-index.html#t20100419>

+ modify split() to handle embedded nulls in the string to split, e.g.,
BEGIN{s="a\0b"; print length(s); n = split(s,f,""); print n}