fourth argument to split

Aharon Robbins

unread,

Dec 8, 2008, 2:30:39 PM12/8/08

to

Hi. I missed some of the discussion and the articles expired on my news
server but it doesn't make much difference.

In my opinion the correct way to go is with the version where seps[n]
is the text that follows $n, for 1 <= n <= NF. In addition, if there
is leading whitespace (where the third argument is " "), then seps[0]
is that text.

This is "better" because the meaning of seps[n] for 1 <= n <= NF is
always the same, and becase field separators are the stuff between fields.

The inconsistency of having a seps[0] is minor, easily explained, and
logic. This behavior also makes it easy to reconstruct the record with
the straightforward fragment (as was pointed out):

line = seps[0]
for (i = 1; i <= NF; i++)
line = line $i seps[i]

Steffen, please prepare a fresh patch with these semantics; initial
patches to gawk.texi, gawk.1 and awkcard.in would be of enormous
value as well.

Please include a test program and data as well.

Thanks to everyone who participated in this thread.

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Ed Morton

unread,

Dec 8, 2008, 3:06:48 PM12/8/08

to

On Dec 8, 1:30 pm, arn...@skeeve.com (Aharon Robbins) wrote:
> Hi. I missed some of the discussion and the articles expired on my news
> server but it doesn't make much difference.
>
> In my opinion the correct way to go is with the version where seps[n]
> is the text that follows $n, for 1 <= n <= NF. In addition, if there
> is leading whitespace (where the third argument is " "), then seps[0]
> is that text.

That wasn't quite one of the choices. Do you mean:

...the version where seps[n] is the text that follows $n, for 1 <= n <

NF. In addition, if there is leading whitespace (where the third

argument is " "), then seps[0] is that text and if there is trailing
whitespace (where the third argument is " "), then seps[NF] is that
text.

or do you mean to populate seps[NF] with a NULL string if there's no
trailing whitespace or the separator != " "?

>
> This is "better" because the meaning of seps[n] for 1 <= n <= NF is
> always the same, and becase field separators are the stuff between fields.

I assume you mean

...the meaning of seps[n] for 1 <= n < NF is always the same

but with the other proposal you could say:

...the meaning of seps[n] for 1 < n <= NF is always the same

so that statement doesn't seem to make either one "better".

>
> The inconsistency of having a seps[0] is minor, easily explained, and
> logic. This behavior also makes it easy to reconstruct the record with
> the straightforward fragment (as was pointed out):
>
> line = seps[0]
> for (i = 1; i <= NF; i++)
> line = line $i seps[i]

That was true for the other proposal too:

line = seps[1]
for (i=1; i<=NF; i++)
line = line $i seps[i+1]

but with the latter version you aren't potentially creating elements
in the seps[] array by printing it, so that statement doesn't seem to
make the former one "better" either.

If you're comfortable with your decision, then that's fine and I hope
I don't sound like a sore loser, but it sounds above like maybe some
of the parts of the discussion you missed were important.

Ed.

Manuel Collado

unread,

Dec 8, 2008, 6:08:06 PM12/8/08

to

Aharon Robbins escribió:

> Hi. I missed some of the discussion and the articles expired on my news
> server but it doesn't make much difference.

This newsgroup is archived at Google:

http://groups.google.es/group/comp.lang.awk/topics

Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Ed Morton

unread,

Dec 9, 2008, 11:16:32 AM12/9/08

to

On Dec 8, 2:06 pm, Ed Morton <mortons...@gmail.com> wrote:
> On Dec 8, 1:30 pm, arn...@skeeve.com (Aharon Robbins) wrote:
>
> > Hi. I missed some of the discussion and the articles expired on my news
> > server but it doesn't make much difference.
>
> > In my opinion the correct way to go is with the version where seps[n]
> > is the text that follows $n, for 1 <= n <= NF. In addition, if there
> > is leading whitespace (where the third argument is " "), then seps[0]
> > is that text.
>
> That wasn't quite one of the choices. Do you mean:
>
> ...the version where seps[n] is the text that follows $n, for 1 <= n <
> NF. In addition, if there is leading whitespace (where the third
> argument is " "), then seps[0] is that text and if there is trailing
> whitespace (where the third argument is " "), then seps[NF] is that
> text.
>
> or do you mean to populate seps[NF] with a NULL string if there's no
> trailing whitespace or the separator != " "?

I hate to throw another wrench in the works, but if you do have in
mind adding a NULL string in that case, that might solve several
problems with the original "start at zero" proposal. It'd mean that
the 4th argument to "split()" wouldn't be "sepsArray[]", it'd be
"ftsArray[]" and would contain the field terminators rather than field
separators like RT contains the Record Terminator rather than record
separators. In that case ftsArray[1] has GOT to contain the field
terminator following field 1, so if there was a leading field
separator, then ftsArray[0] would HAVE to contain it, and if there
were no trailing field separator than ftsArray[NF] would be NULL like
RT is NULL if there's no record terminator at the end of the file.

I think that'd be easy for people to grasp as the ftsArray semantics
would be consistent with the RT semantics (other than ftsArray[0]),
and would leave no ambiguity about whether a field separator is
associated with the field before or after it since field terminators
are cleary associated with the preceeding field they terminate.

If we could also agree to always populate ftsArray[0] with a NULL
string if no preceeeding field separator exists (just like split()
populates fldsArray[1] if FS!=" " and the record starts with FS) then
it'd resolve my other 2 issues:

Given:

n = split($0,flds,FS,fts)

1) n, which is length(flds), would always equal length(fts) - 1 so we
wouldn't have to guess or compute the length of fts[].
2) this would not add elements to the fts[] array:
line=fts[0]

for (i=1;i<=NF;i++)

line = line flds[i] fts[i]

We actually could make FT a keyword like RT and just have split()
populate that FT[] array like awk populates RT and match() populates
RSTART and RLENGTH rather than adding a 4th argument to split() but
that's a detail.

Regards,

Ed.

Steffen Schuler

unread,

Dec 9, 2008, 6:04:08 PM12/9/08

to

Aharon Robbins wrote:
[...]

> Steffen, please prepare a fresh patch with these semantics; initial
> patches to gawk.texi, gawk.1 and awkcard.in would be of enormous
> value as well.
>
> Please include a test program and data as well.

[...]

Hi Arnold, hi awk-users,

attached is a tested patch with documentation of the gawk-extension
"fourth argument of split".

The patch has the semantics as you, Arnold, wants it; includes patches
for gawk.texi, gawk.1, and awkcard.in and provides the test case
splitarg4 with test data.

After patching automake and autoconf have to be called.

Arnold, I'm not a native English writer. Please review the documentation
changes.

Remark: awkcard.in has to be corrected for the parameter "-r": the text
is to big for the box bounds.

--
Steffen

patch-gawk-splitarg4.diff

Aharon Robbins

unread,

Dec 10, 2008, 10:47:46 PM12/10/08

to

Hi Ed.

In article <de7e5fdc-828c-402a...@13g2000yql.googlegroups.com>,

Ed Morton <morto...@gmail.com> wrote:
>On Dec 8, 1:30 pm, arn...@skeeve.com (Aharon Robbins) wrote:
>> Hi. I missed some of the discussion and the articles expired on my news
>> server but it doesn't make much difference.
>>
>> In my opinion the correct way to go is with the version where seps[n]
>> is the text that follows $n, for 1 <= n <= NF. In addition, if there
>> is leading whitespace (where the third argument is " "), then seps[0]
>> is that text.
>
>That wasn't quite one of the choices. Do you mean:
>
>...the version where seps[n] is the text that follows $n, for 1 <= n <
>NF. In addition, if there is leading whitespace (where the third
>argument is " "), then seps[0] is that text and if there is trailing
>whitespace (where the third argument is " "), then seps[NF] is that
>text.

Yes.

>or do you mean to populate seps[NF] with a NULL string if there's no
>trailing whitespace or the separator != " "?

Effectively, it doesn't matter if seps[NF] is explicitly assigned ""
or just referenced, since that's what gets returned. But in any case
I meant the earliler statement.

>If you're comfortable with your decision, then that's fine and I hope
>I don't sound like a sore loser, but it sounds above like maybe some
>of the parts of the discussion you missed were important.

I think I saw all the important points of the discussion, but I'm
comfortable with my decision; it's how I would have designed the
feature on my own.

I think that this was an interesting, worthwhile and fruitful
experiment for everyone in the group to participate in, and I
appreciate your contribution as well as everyone else's.

Thanks again,

Ed Morton

unread,

Dec 10, 2008, 11:12:00 PM12/10/08

to

On Dec 10, 9:47 pm, arn...@skeeve.com (Aharon Robbins) wrote:
> Hi Ed.
>
> In article <de7e5fdc-828c-402a-a8c2-9b7aa88e1...@13g2000yql.googlegroups.com>,

Now that I've got my head around thinking of the array as containing
field terminators and realise I can get it populated as I want by just
a tweak:

n = split($0,flds,FS,fts); fts[0]; fts[NF]

then I'm fine with it too and I'd also like to thank Manuel and
Steffen in particular for participating in the discussion.

Regards,

Ed.

Janis Papanagnou

unread,

Dec 11, 2008, 6:01:37 AM12/11/08

to

Ed Morton wrote:
> [...]

> We actually could make FT a keyword like RT and just have split()
> populate that FT[] array like awk populates RT and match() populates
> RSTART and RLENGTH rather than adding a 4th argument to split() but
> that's a detail.

Not necessarily. Having a fourth argument could make optimizations
possible that are hard (or practically impossible for certain cases)
to implement otherwise.

Janis