How awk `split' works?

PRC

unread,

May 8, 2008, 6:49:28 AM5/8/08

to

The results of running the following script
gawk 'BEGIN {
n = split("(a,b,c,d)", a, /[(,)]/);
printf("n=%d\n", n);
for(i=1; i<=n; i++)
printf(" -%s-\n",a[i]);
}'
is
n=6
--
-a-
-b-
-c-
-d-
--

instead of
n=4
-a-
-b-
-c-
-d-
which is expected.

I have no ideas how these results come out.

Dave B

unread,

May 8, 2008, 6:53:58 AM5/8/08

to

You are telling awk to use either '(', ',' or ')' as field separator for
splitting.

Given your string '(a,b,c,d)' awk sees six fields:

- one empty field before the '('
- a
- b
- c
- d
- one empty field after the ')'

Try this:

$ echo '(a,b,c,d)' | awk -F '[(,)]' '{print NF}'
6

--
D.

PRC

unread,

May 8, 2008, 7:13:38 AM5/8/08

to

I see
But if FS is space, awk will skip empty fields. Why does awk work in
different ways for cases where FS is space and where FS is regular
expression?

Dave B

unread,

May 8, 2008, 7:21:17 AM5/8/08

to

On Thursday 8 May 2008 13:13, PRC wrote:

> I see
> But if FS is space, awk will skip empty fields. Why does awk work in
> different ways for cases where FS is space and where FS is regular
> expression?

Because space is explicitly defined to be a special case.

Compare:

$ echo ' a b ' | awk '{print NF}'
2
$ echo ',,a,,b,,' | awk -F, '{print NF}'
7

From the standard:

The following describes FS behavior:

1. If FS is a null string, the behavior is unspecified.

2. If FS is a single character:

1. If FS is <space>, skip leading and trailing <blank>s; fields
shall be delimited by sets of one or more <blank>s.

2. Otherwise, if FS is any other character c, fields shall be
delimited by each single occurrence of c.

3. Otherwise, the string value of FS shall be considered to be an
extended regular expression. Each occurrence of a sequence matching
the extended regular expression shall delimit fields.

And splitting in split() works the same way.

--
D.

Janis

unread,

May 8, 2008, 7:55:03 AM5/8/08

to

On 8 Mai, 13:13, PRC <panruoc...@gmail.com> wrote:

[Please don't top-post.]

> I see
> But if FS is space, awk will skip empty fields. Why does awk work in
> different ways for cases where FS is space and where FS is regular
> expression?

In your example you didn't use FS. Generally there are some
special cases implemented with the semantics of FS/RS and
spaces or null strings. I suppose to get the best benefits
from a concise awk interface and powerful features. Besides
that your program behaves exactly the same way if you specify
a space as regexp...

n = split(" a b c d ", a, / /);

In your application, since you know the data and delimiters,
just change your loop

for(i=2; i<n; i++)

Janis

Ed Morton

unread,

May 8, 2008, 7:56:43 AM5/8/08

to

and if you want to literally use a single blank character as the field
separator, specify it as '[ ]':

$ echo ' a b ' | awk '{print NF}'
2

$ echo ' a b ' | awk -F'[ ]' '{print NF}
7

and if you want to use repetitions of a given character (or RE), specify it as
'<pattern>+':

$ echo ',,a,,b,,' | awk -F, '{print NF}'
7

$ echo ',,a,,b,,' | awk -F',+' '{print NF}'
4

and if you want it treated the same as the default FS, you need to strip away
any leading and trailing occurences of the FS:

$ echo ',,a,,b,,' | awk -F',+' '{gsub("^"FS"|"FS"$","");print NF}'
2

So, the default FS behavior is a shorthand that lets us write:

$ echo ' a b ' | awk '{print NF}'
2

instead of:

$ echo ' a b ' | awk -F'[[:blank:]]+' '{gsub("^"FS"|"FS"$","");print NF}'
2

Ed.