Using a regexp as field separator does not work!

Ronny

unread,

Jul 10, 2008, 10:01:03 AM7/10/08

to

I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input
file
contains fields which are separated by a vertical, optionally followed
by
spaces. Here is a small test program for this file format:

BEGIN {
print "run starts"
FS="| *"
}

{
print "processing line with",NF,"fields {",$0,"}"
}

When using the following 2-line input file:

# set art
0,1,4|set art

I get as output:

run starts
processing line with 3 fields { # set art }
processing line with 2 fields { 0,1,4|set art }

Still, a space is taken as field separator!

What am I doing wrong?

Ronald

Loki Harfagr

unread,

Jul 10, 2008, 10:22:15 AM7/10/08

to

Thu, 10 Jul 2008 07:01:03 -0700, Ronny did cat :

You want a regexp but you described a string, change:
FS="| *"
to be:
FS=/| */

Ronny

unread,

Jul 10, 2008, 10:40:57 AM7/10/08

to

On 10 Jul., 16:22, Loki Harfagr <l...@thedarkdesign.free.fr.INVALID>
wrote:

> Thu, 10 Jul 2008 07:01:03 -0700, Ronny did cat :

> > BEGIN {
> > print "run starts"
> > FS="| *"
> > }

> > Still, a space is taken as field separator!

> You want a regexp but you described a string, change:
> FS="| *"
> to be:
> FS=/| */

Agreed, your solution works. Now I reread the man pages more
carefully, and
there is one thing which puzzles me:

... If FS is a single
character, fields are separated by that character. If FS is
the null
string, then each individual character becomes a separate
field. Oth-
erwise, FS is expected to be a full regular expression...

I agree that this explains why I have to use // for describing my
regexp. But
now I wonder what is the semantics of my original code? Taking the man
page
literally, FS can be either the null string, or a string of length
one, or a
regexp. It doesn't say anything about FS being a string of length > 1.
If this
is forbidden, I would have expected an error message, but it was
accepted...

Ronald

Ronny

unread,

Jul 10, 2008, 10:56:01 AM7/10/08

to

Plus, I also just noticed an error in my regexp. Since the
vertical bar is a metacharacter, I should write it as

/\| */

Ronald

Ronny

unread,

Jul 10, 2008, 11:04:00 AM7/10/08

to

Hmmmm... still does not work. Here is my modified program:

BEGIN {
print "run starts"
FS=/[|] */
}

{
print "processing line with",NF,"fields {STATUS=",$1," CMD=",$2,"}"
}

My input file contains:

0,1,4|set art

and I get as result:

run starts
processing line with 2 fields {STATUS= CMD= ,1,4|set art }

This means it does get two fields, but splits at the beginning of the
line!

Playing around, I found that the correct way to write the FS
assignment goes like this:

FS="[|] *"

So the problem was not the usage of a string instead of a regexp, but
that my
regexp was wrong....

Ronald

Dave B

unread,

Jul 10, 2008, 11:45:46 AM7/10/08

to

Ronny wrote:

> Agreed, your solution works. Now I reread the man pages more
> carefully, and
> there is one thing which puzzles me:
>
> ... If FS is a single
> character, fields are separated by that character. If FS is
> the null string, then each individual character becomes a separate

> field. Otherwise, FS is expected to be a full regular expression...

>
> I agree that this explains why I have to use // for describing my
> regexp.

No, you don't. The following works:

awk -v FS='\\| *'

The problem is that "|" must be escaped twice.

> But now I wonder what is the semantics of my original code?

FS="| *"

This uses as field separator either nothing, or *zero* or more spaces. An
empty FS is undefined by the standard, and is allowed as a special case by
GNU awk, but only through the special syntax FS='', or FS=, or -F ''.

In your case, I'm not sure how the regex is parsed. It seems to behave as if
runs of space are used as separator, but not with awk's default semantics:

$ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
4

abc
de
f

But I'm not sure about what goes on behind the scenes here. Hopefully
someone will shed some light here.

> Taking the man page literally, FS can be either the null string, or a
> string of length one, or a regexp. It doesn't say anything about FS
> being a string of length > 1.

In that case, it's taken as a regex.

> If this is forbidden, I would have expected an error message, but it
> was accepted...

Because it's perfectly valid.

--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X++<x-o-O)c=c"%c";
X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o,
x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O,
x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o*o+O+x,x+x*o*o-O,X-O}'

Ed Morton

unread,

Jul 10, 2008, 12:03:10 PM7/10/08

to

An FS that's a single blank character is a special case that treats contiguous
sequences of any white space as a single separator and, importantly, strips off
leading white space from the record. You specified an FS that's an RE instead of
a single blank character so you do not get the "special" behavior, just like you
wouldn't if you specified a literal blank character:

-------------
$ echo ' abc de f' |awk -v FS=' ' '{print NF;for(i=1;i<=NF;i++)print$i}'
3
abc
de
f

$ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
4

abc
de
f
$ echo ' abc de f' |awk -v FS='[ ]' '{print NF;for(i=1;i<=NF;i++)print$i}'
5

abc
de

f
--------------

Note the leading white space above when you don't use the single blank character FS.

>
>
>>Taking the man page literally, FS can be either the null string, or a
>>string of length one, or a regexp. It doesn't say anything about FS
>>being a string of length > 1.
>
>
> In that case, it's taken as a regex.
>
>
>>If this is forbidden, I would have expected an error message, but it
>>was accepted...
>
>
> Because it's perfectly valid.
>

For the OP, this is what you want:

BEGIN {
print "run starts"
FS="[|] *"
}

{

print "processing line with",NF,"fields {",$0,"}"
}

Regards,

Ed.

Dave B

unread,

Jul 10, 2008, 12:27:35 PM7/10/08

to

Ed Morton wrote:

>> FS="| *"
>>[snip]

> An FS that's a single blank character is a special case that treats contiguous
> sequences of any white space as a single separator and, importantly, strips off
> leading white space from the record. You specified an FS that's an RE instead of
> a single blank character so you do not get the "special" behavior, just like you
> wouldn't if you specified a literal blank character:

Yes, I know that...but I'm not sure how awk determines it has encountered a
"field separator" if the regex '| *' is used as FS. What part of the
alternation is used? It seems that it's effectively treated like ' *', but
the corner case where a "nothingness" matches (which is allowed by ' *', and
which would thus make it behave similarly as if FS='') never happens. These
seem to be equivalent:

$ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'

$ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'

But I don't know why.

> For the OP, this is what you want:
>
> BEGIN {
> print "run starts"
> FS="[|] *"

or FS='\\| *'

Ed Morton

unread,

Jul 10, 2008, 12:46:23 PM7/10/08

to

On 7/10/2008 11:27 AM, Dave B wrote:
> Ed Morton wrote:
>
>
>>>FS="| *"
>>>[snip]
>>
>>An FS that's a single blank character is a special case that treats contiguous
>>sequences of any white space as a single separator and, importantly, strips off
>>leading white space from the record. You specified an FS that's an RE instead of
>>a single blank character so you do not get the "special" behavior, just like you
>>wouldn't if you specified a literal blank character:
>
>
> Yes, I know that...but I'm not sure how awk determines it has encountered a
> "field separator" if the regex '| *' is used as FS. What part of the
> alternation is used? It seems that it's effectively treated like ' *', but
> the corner case where a "nothingness" matches (which is allowed by ' *', and
> which would thus make it behave similarly as if FS='') never happens. These
> seem to be equivalent:
>
> $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>
> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>
> But I don't know why.

I'm not sure I understand the question. An FS of a null character is another
special case, just like an FS of a single blank character is a special case. A
null character appearing as part of an FS that's an RE isn't treated the same as
a null character that IS an FS, just like a single blank character appearing as
part of an FS that's an RE isn't treated the same as a blank character that IS
an FS.

So, when you write FS='| *' you're saying the FS is either nothing at all OR a
sequence of zero or more blanks. Yes, that doesn't make sense so you can
obviously optimize it to ' *' but there's plenty of REs we see people write that
could be optimized and awk doesn't try to analyze and warn you about any of them
other than a useless backslash.

Ed.

Dave B

unread,

Jul 10, 2008, 1:13:11 PM7/10/08

to

Ed Morton wrote:

>> Yes, I know that...but I'm not sure how awk determines it has encountered a
>> "field separator" if the regex '| *' is used as FS. What part of the
>> alternation is used? It seems that it's effectively treated like ' *', but
>> the corner case where a "nothingness" matches (which is allowed by ' *', and
>> which would thus make it behave similarly as if FS='') never happens. These
>> seem to be equivalent:
>>
>> $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>>
>> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>>
>> But I don't know why.
>
> I'm not sure I understand the question. An FS of a null character is another
> special case, just like an FS of a single blank character is a special case. A
> null character appearing as part of an FS that's an RE isn't treated the same as
> a null character that IS an FS, just like a single blank character appearing as
> part of an FS that's an RE isn't treated the same as a blank character that IS
> an FS.

Agreed, although strictly speaking we don't have "null characters", but
rather "null" or "empty" regexes here.

> So, when you write FS='| *' you're saying the FS is either nothing at all OR a
> sequence of zero or more blanks. Yes, that doesn't make sense so you can
> obviously optimize it to ' *' but there's plenty of REs we see people write that
> could be optimized and awk doesn't try to analyze and warn you about any of them
> other than a useless backslash.

Ok, I'll try to explain better. Suppose we have FS='x|y', and the input is

fooxbarybaz

We know that, when awk encounters 'x', FS matches, so awk decides that that
'x' is a field separator. The same happens when awk gets to 'y', later.
Every time, awk (actually, awk's regex engine) has used a certain part of
the alternation in the FS regex to try a match and decide if a piece of
input was to be considered a field separator (of course, this is a simple
regex, but it can be more complex, with each part matching longer strings,
or with a different structure. Also, I'm using a regex of the form x|y
because it's similar to the case at hand).

Now, if FS='| *' (again an alternation), and the input is

abc

in theory, awk should immediatley find a match for FS, since the part to the
left of "|" is an empty regex, which matches at the beginning of the string,
at the end, and between any two characters. And, the part to the right of
the "|" (" *") also allows matching an empty string, although awk should
not get that far, since the part to the left of the "|" already matches. But
this does not happen. Also, as each character is examined, awk should find a
match for the empty regex between any two characters, but again that doesn't
happen. But awk DOES know how to do that, because if you do

a="abc"; gsub(/| */,"X",a)

you correctly get

XaXbXcX

So, my doubt was: why isn't awk matching the null regex (using either part
of the alternation appearing in FS)? I guess the answer is: because FS is
special and does not work that way, unless FS is explicitly set to '' (GNU
awk only). Ok. But then, how does it work? Why does awk choose to treat a
nonsense FS like '| *' as if it were ' *'? What's the logic behind that?
Hope this was clearer.

loki harfagr

unread,

Jul 10, 2008, 1:25:24 PM7/10/08

to

On Thu, 10 Jul 2008 08:04:00 -0700, Ronny wrote:

> Playing around, I found that the correct way to write the FS assignment
> goes like this:
>
> FS="[|] *"
>

That's right I don't know what I was thinking of !-)
I first started to post my reply with that solution, then
I thought I'd better check your sample before to post and
then quickly got in testing other forms and stupidly just
checked the numbers and not the fields, in the end I
found that *new* regexp form (/| */) was much nicer than the
one I was used to use ("[|] *") and posted it straight.

> So the problem was not the usage of a string instead of a regexp, but
> that my
> regexp was wrong....

Now I feel stupid and sorry but at least I'm happy to see
that what I thought previously was still right and next
time I'll hold on it !-)
I hope the reason background is I was simply dying from the
heat and not that I'm about to really blow a fuse!-)

Again, sorry for the blund...

Ed Morton

unread,

Jul 10, 2008, 9:32:46 PM7/10/08

to

Yes, it's clearer. I think the inclusion of a leading "|" is a red herring since
the null string it would match is just a subset of what could be matched by " *"
so you'd expect exactly the same behavior in either gsub() or in setting FS for
" *" or "| *" and that is what you get.

The problem I think I see, though, is that I'd expect this:

$ echo "abc" | awk '{gsub(/ */,"X")}1'
XaXbXcX

to produce the same output as this:

$ echo "abc" | awk '{gsub(//,"X")}1'
XaXbXcX

which it does, or this:

$ echo "abc" | awk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
abc

which it doesn't and I can't think of any reason for that.

Regards,

Ed.

Ed Morton

unread,

Jul 10, 2008, 9:33:37 PM7/10/08

to

On 7/10/2008 12:13 PM, Dave B wrote:

Yes, it's clearer. I think the inclusion of a leading "|" is a red herring since

Dave B

unread,

Jul 11, 2008, 4:09:43 AM7/11/08

to

Ed Morton wrote:

> Yes, it's clearer. I think the inclusion of a leading "|" is a red herring since
> the null string it would match is just a subset of what could be matched by " *"

That seems reasonable. But still, even ' *' does not match the null string
(but see below).

> $ echo "abc" | awk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
> abc
>
> which it doesn't and I can't think of any reason for that.

Ok, this last is the one example that illustrates the point, that I wasn't
able to make so clearly.

But, rummaging through gawk source distribution, I found a note in the NEWS
file, that says:

Changes from 2.11beta to 2.11.1 (production)
--------------------------------------------
...snip...
Fixed FS splitting to never match null strings, per book.

Now, I don't know what the "book" was at the time of gawk 2.11. However,
looking at the sources (field.c), it seems that a null match is simply
ignored (at least, that's my understanding after a quick skim through the
re_parse_field function; I may be wrong though). So, that's why the ' *'
matches only when it can match one or more spaces.

And, finally, I guess all of that is very gawk-specific.

Ed Morton

unread,

Jul 11, 2008, 9:45:58 AM7/11/08

to

On 7/11/2008 3:09 AM, Dave B wrote:
> Ed Morton wrote:
>
>
>>Yes, it's clearer. I think the inclusion of a leading "|" is a red herring since
>>the null string it would match is just a subset of what could be matched by " *"
>
>
> That seems reasonable. But still, even ' *' does not match the null string
> (but see below).
>
>
>>$ echo "abc" | awk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
>>abc
>>
>>which it doesn't and I can't think of any reason for that.
>
>
> Ok, this last is the one example that illustrates the point, that I wasn't
> able to make so clearly.
>
> But, rummaging through gawk source distribution, I found a note in the NEWS
> file, that says:
>
> Changes from 2.11beta to 2.11.1 (production)
> --------------------------------------------
> ...snip...
> Fixed FS splitting to never match null strings, per book.
>
>
> Now, I don't know what the "book" was at the time of gawk 2.11. However,
> looking at the sources (field.c), it seems that a null match is simply
> ignored (at least, that's my understanding after a quick skim through the
> re_parse_field function; I may be wrong though). So, that's why the ' *'
> matches only when it can match one or more spaces.
>
> And, finally, I guess all of that is very gawk-specific.
>

No, it applies to nawk and /usr/xpg4/bin/awk on Solaris too. They're all
consistent with gawks FS treatment, but there's varying results for the gsub():

---------------
$ echo "abc" | nawk '{gsub(//,"X")}1'
nawk: empty regular expression
source line number 1
context is
>>> {gsub(//,"X") <<<

$ echo "abc" | nawk '{gsub(/ */,"X")}1'
XaXbXcX

$ echo "abc" | nawk '{gsub(/| */,"X")}1'
nawk: illegal primary in regular expression | * at *
source line number 1
context is
{gsub(/| >>> */,"X") <<<

$ echo "abc" | nawk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
abc
-----------------
$ echo "abc" | /usr/xpg4/bin/awk '{gsub(//,"X")}1'
XaXbXcX

$ echo "abc" | /usr/xpg4/bin/awk '{gsub(/ */,"X")}1'
XaXbXcX

$ echo "abc" | /usr/xpg4/bin/awk '{gsub(/| */,"X")}1'
abc

$ echo "abc" | /usr/xpg4/bin/awk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
abc
-------------------
$ echo "abc" | gawk '{gsub(//,"X")}1'
XaXbXcX

$ echo "abc" | gawk '{gsub(/ */,"X")}1'
XaXbXcX

$ echo "abc" | gawk '{gsub(/| */,"X")}1'
XaXbXcX

$ echo "abc" | gawk --posix '{gsub(/| */,"X")}1'
XaXbXcX

$ echo "abc" | gawk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
abc
-------------------

It's interesting that gawk even in POSIX mode has a different result from
/usr/xpg4/bin/awk for gsub(/| */,"X").

There is something in the POSIX standard about any EREs containing NUL
characters producing undefined behavior. I wonder if this is related.

Regards,

Ed.

Kenny McCormack

unread,

Jul 11, 2008, 10:37:31 AM7/11/08

to

In article <48776416...@lsupcaemnt.com>,
Ed Morton <mor...@lsupcaemnt.com> wrote:
...

>> And, finally, I guess all of that is very gawk-specific.
>>
>
>No, it applies to nawk and /usr/xpg4/bin/awk on Solaris too. They're all
>consistent with gawks FS treatment, but there's varying results for the gsub():

You have to get in tune with the lingo. In the terminology of the
standards jockeys (both here [cla], over in the shell groups, and, of
course, in standards jockey heaven [clc]), something is referred to as
"implementation specific" even if it works on 99% of the systems extant.

In fact, this is usually the case. It is almost always a very common
"extension", present, for all practical purposes, in everybody's
environment, yet they refer to it as "implementation specific".

Dave B

unread,

Jul 11, 2008, 11:12:46 AM7/11/08

to

Ed Morton wrote:

> It's interesting that gawk even in POSIX mode has a different result from
> /usr/xpg4/bin/awk for gsub(/| */,"X").
>
> There is something in the POSIX standard about any EREs containing NUL
> characters producing undefined behavior. I wonder if this is related.

Thanks for the tests.

I found this, that *could* explain something:

"A vertical-line appearing first or last in an ERE, or immediately following
a vertical-line or a left-parenthesis, or immediately preceding a
right-parenthesis, produces undefined results".

So yeah, even if 99% of the implementations support it, you showed that that
does not turn it into something whose outcome can be relied upon.

Instead, what seems to be consistent is the handling of FS, although I
haven't been able to find something about that in the standard. But it's
still valuable information to save.