Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

unpatsplit()

203 views
Skip to first unread message

charlemagn...@gmail.com

unread,
Feb 5, 2017, 1:52:41 PM2/5/17
to
I'm a big fan of patsplit() in gawk and find it curious there is no unpatsplit() as it's essential. It's fairly easy to write an unpatsplit() but I've never seen one in 3rd party libraries or anywhere.

An example.

str = "https://archive.org/2016-04.01.010101/http://google.com"

To remove the "-" and "." from the date portion "2016-04.01.01010"

patsplit(str, field, "/", sep)
gsub(/[.]|[-]/, "", sep[3])
str = unpatsplit(field, sep)

This is a demo (it could be done with split()), but the same technique can be used when processing large text files to cut-out field sections using regex, processing the field and/or sep, and re-assemble.

Ed Morton

unread,
Feb 5, 2017, 6:27:53 PM2/5/17
to
On 2/5/2017 12:52 PM, charlemagn...@gmail.com wrote:
> I'm a big fan of patsplit() in gawk and find it curious there is no unpatsplit() as it's essential. It's fairly easy to write an unpatsplit()

There's your answer. If the awk language provided facilities to do things that
are easy to do otherwise then we'd have a bloated, incomprehensible syntax and
there's already a tool/language available for anyone who wants those
characteristics.

Ed.

charlemagn...@gmail.com

unread,
Feb 5, 2017, 6:57:37 PM2/5/17
to
On Sunday, February 5, 2017 at 6:27:53 PM UTC-5, Ed Morton wrote:

> There's your answer. If the awk language provided facilities to do things that
> are easy to do otherwise then we'd have a bloated, incomprehensible syntax and
> there's already a tool/language available for anyone who wants those
> characteristics.

Of course agreed. Maybe my post was not clear. As noted it's easy enough to write in userspace. Rather I'm surprised that "3rd party libraries" did not include them, that no one has ever mentioned it, that basically such a thing does not appear to exist anywhere that I have seen before, nor mentioned in the docs .. yet it's an essential technique to get the most out of patsplit()

Kaz Kylheku

unread,
Feb 5, 2017, 7:15:05 PM2/5/17
to
On 2017-02-05, charlemagn...@gmail.com <charlemagn...@gmail.com> wrote:
> On Sunday, February 5, 2017 at 6:27:53 PM UTC-5, Ed Morton wrote:
>
>> There's your answer. If the awk language provided facilities to do things that
>> are easy to do otherwise then we'd have a bloated, incomprehensible syntax and
>> there's already a tool/language available for anyone who wants those
>> characteristics.
>
> Of course agreed. Maybe my post was not clear. As noted it's easy
> enough to write

Sure; just evidently very hard to for you to describe in terms
of concrete requirements. :)

> libraries" did not include them, that no one has ever mentioned it,
> that basically such a thing does not appear to exist anywhere that I

Until you specify what exactly function does, it exists exactly as much
as the sound of a one-handed clap.

Marc de Bourget

unread,
Feb 6, 2017, 11:45:51 AM2/6/17
to
Hi Charlemagne, just out of curiosity: Can you show us your userspace solution?

charlemagn...@gmail.com

unread,
Feb 6, 2017, 4:48:11 PM2/6/17
to
On Monday, February 6, 2017 at 11:45:51 AM UTC-5, Marc de Bourget wrote:

> Hi Charlemagne, just out of curiosity: Can you show us your userspace solution?

Sure, if you can improve or break that would be great.

Given two arrays created by patsplit() (field and sep) recombine into a string.

function unpatsplit(field,sep, c,o) {

if(length(field) > length(sep)) return

o = sep[0]
c = 1
while(c < length(field) + 1) {
o = o field[c] sep[c]
c++
}
return o

}

Ed Morton

unread,
Feb 6, 2017, 9:31:22 PM2/6/17
to
On 2/6/2017 3:48 PM, charlemagn...@gmail.com wrote:
> On Monday, February 6, 2017 at 11:45:51 AM UTC-5, Marc de Bourget wrote:
>
>> Hi Charlemagne, just out of curiosity: Can you show us your userspace solution?
>
> Sure, if you can improve or break that would be great.
>
> Given two arrays created by patsplit() (field and sep) recombine into a string.
>
> function unpatsplit(field,sep, c,o) {
>
> if(length(field) > length(sep)) return

idk why the above would be useful.

>
> o = sep[0]
> c = 1
> while(c < length(field) + 1) {
> o = o field[c] sep[c]
> c++
> }

Instead of

c = 1
while(c < length(field) + 1) {
o = o field[c] sep[c]
c++
}

consider:

for(c=1; c < length(field) + 1; c++) {
o = o field[c] sep[c]
}

> return o
>
> }
>

Andrew Schorr

unread,
Feb 7, 2017, 8:50:19 AM2/7/17
to
On Monday, February 6, 2017 at 4:48:11 PM UTC-5, charlemagn...@gmail.com wrote:
> Given two arrays created by patsplit() (field and sep) recombine into a string.
>
> function unpatsplit(field,sep, c,o) {
>
> if(length(field) > length(sep)) return
>
> o = sep[0]
> c = 1
> while(c < length(field) + 1) {
> o = o field[c] sep[c]
> c++
> }
> return o
>
> }

I think this is fairly similar to the "join" extension function that we were discussing in 2015. I have been waiting for the gawk 4.2 before implementing, because there's a new API feature in there that will make it much easier to finish. Once 4.2 comes out, I can probably finish "join" pretty quickly.

Reading between the lines -- I'm claiming this is not peculiar to patsplit; the same should work with the regular split function.

Regards,
Andy

Kaz Kylheku

unread,
Feb 7, 2017, 9:35:44 AM2/7/17
to
On 2017-02-07, Ed Morton <morto...@gmail.com> wrote:
> Instead of
>
> c = 1
> while(c < length(field) + 1) {
> o = o field[c] sep[c]
> c++
> }
>
> consider:
>
> for(c=1; c < length(field) + 1; c++) {

Still slightly silly.

x < y + 1 --> x <= y

:)

Ed Morton

unread,
Feb 7, 2017, 10:08:33 AM2/7/17
to
Agreed that's the obvious next step for this code, I was just trying to get the
OP to consider "for" instead of "while" with exactly the same statements he
already had.

Ed.

Ed Morton

unread,
Feb 7, 2017, 10:35:15 AM2/7/17
to
Right, and it's not specific to undoing any kind of split() either, it's just
flattening an array into a string. The only messy part is what to do with the
two different possible separator specifications - a string to be applied between
all array elements vs an array of separator strings each to be applied between
pairs of array elements, e.g.

function arr2str(valsArr,sepsStr, idx,str) {
str = valsArr[1]
for (idx=2; idx<=length(valsArr); idx++) {
str = str sepsStr valsArr[idx]
}
return str
}

vs

function arr2str(valsArr,sepsArr, idx,str) {
str = sepsArr[0]
for (idx=1; idx<=length(valsArr); idx++) {
str = str valsArr[i] sepsArr[idx]
}
return str
}

The above is just to highlight needing to handle sepsStr vs sepsArr. For any
kind of array flattening function to be generally applicable you'd probably want
to traverse them using the `in` operator with whatever PROCINFO["sorted_in"]
setting is present (probably with an increasing numeric indices default) to
handle arrays of non-numeric indices and sparse arrays, e.g.:

function arr2str(valsArr,sepsStr, idx,str,cnt,oldOrder) {
oldOrder = PROCINFO["sorted_in"]
if ( PROCINFO["sorted_in"] == "" ) {
PROCINFO["sorted_in"] = "@ind_num_asc"
}
for (idx in valsArr) {
str = (cnt++ ? str sepsStr : "") valsArr[idx]
}
PROCINFO["sorted_in"] = oldOrder
return str
}

You may even want to use asorti() to create new contiguous arrays to hold the
indices of the valsArr and sepsArr in the 2nd case above. It'll take some thought...

Regards,

Ed.

Joe User

unread,
Feb 7, 2017, 1:27:40 PM2/7/17
to
This is what I put into my site.awk library:

function unpatsplit(field, sep , c, o) {
#
# Given two arrays created by patsplit() recombine into a string.
#
if (!isarray(field) || !isarray(sep)) return;
o = (0 in sep) ? sep[0] : ""
for (c=1; c<=length(field); c++) \
o = o ((c in field) ? field[c] : "") \
((c in sep) ? sep[c] : "");
return o
}

I try to check arguments, and I try not to modify arrays by referencing non-
existant indices.


Kaz Kylheku

unread,
Feb 7, 2017, 2:02:10 PM2/7/17
to
On 2017-02-07, Joe User <berg...@ardmore.net> wrote:
> This is what I put into my site.awk library:
>
> function unpatsplit(field, sep , c, o) {
> #
> # Given two arrays created by patsplit() recombine into a string.
> #
> if (!isarray(field) || !isarray(sep)) return;
> o = (0 in sep) ? sep[0] : ""

sep[0] --> "I believe arrays are zero-based."

> for (c=1; c<=length(field); c++) \
> o = o ((c in field) ? field[c] : "") \
> ((c in sep) ? sep[c] : "");
> return o
> }

fields from 1 to length(fields) --> "Changed my mind; one-based!"

Joe User

unread,
Feb 7, 2017, 2:38:51 PM2/7/17
to
Kaz Kylheku wrote:

> On 2017-02-07, Joe User <berg...@ardmore.net> wrote:
>> This is what I put into my site.awk library:
>>
<<snip>>
> sep[0] --> "I believe arrays are zero-based."
>
>> for (c=1; c<=length(field); c++) \
>> o = o ((c in field) ? field[c] : "") \
>> ((c in sep) ? sep[c] : "");
>> return o
>> }
>
> fields from 1 to length(fields) --> "Changed my mind; one-based!"

"I believe awk arrays are associative, not 0 or 1 based."

patsplit assigns sep[0] and field[1...]. See the man page.


Kaz Kylheku

unread,
Feb 7, 2017, 2:42:45 PM2/7/17
to
Ouch; that is scatter-brained.

Joe User

unread,
Feb 7, 2017, 2:59:16 PM2/7/17
to
Kaz Kylheku wrote:

> Ouch; that is scatter-brained.

No problem.

Ed Morton

unread,
Feb 7, 2017, 11:07:22 PM2/7/17
to
No it's necessary to accommodate what field splitting has always done which is
to ignore leading+trailing blanks when assigning fields with the default FS.
When you do that and want to save the field separators, whatever comes before
field 1 and after field $NF has to be stored somewhere in the separators array
so this:

split($0,flds,FS,seps) -> seps[0] flds[1] seps[1] ... flds[NF] seps[NF]

where seps[0] holds any white space that comes before $1 and seps[NF] holds any
white space that comes after $NF makes more sense than any alternative approach.

Ed.

Andrew Schorr

unread,
Feb 8, 2017, 8:40:25 AM2/8/17
to
On Tuesday, February 7, 2017 at 10:35:15 AM UTC-5, Ed Morton wrote:
> Right, and it's not specific to undoing any kind of split() either, it's just
> flattening an array into a string. The only messy part is what to do with the
> two different possible separator specifications - a string to be applied between
> all array elements vs an array of separator strings each to be applied between
> pairs of array elements, e.g.

The extension function can check the type of the 2nd argument to ascertain whether it is a scalar string or an array. If the 2nd argument is missing,
I think we had decided to use OFS. I'm not sure about all the array sorting stuff. The idea was to unsplit (or unpatsplit) a line, and in those cases, the array subscripts are consecutive, positive integers. Trying to support more scenarios could lead to messy, complicated, and slow code.

Regards,
Andy

Ed Morton

unread,
Feb 8, 2017, 9:44:18 AM2/8/17
to
On 2/8/2017 7:40 AM, Andrew Schorr wrote:
> On Tuesday, February 7, 2017 at 10:35:15 AM UTC-5, Ed Morton wrote:
>> Right, and it's not specific to undoing any kind of split() either, it's just
>> flattening an array into a string. The only messy part is what to do with the
>> two different possible separator specifications - a string to be applied between
>> all array elements vs an array of separator strings each to be applied between
>> pairs of array elements, e.g.
>
> The extension function can check the type of the 2nd argument to ascertain whether it is a scalar string or an array.

Good point, in user space we could use isarray().

> If the 2nd argument is missing, I think we had decided to use OFS.

Makes sense. Can you differentiate between missing and zero-or-null in a built
in function? Obviously if the user explicitly provides a null string it'd mean
they want nothing between each array field so ideally:

arr2str(a,"") -> a[1] a[2] ... a[n]

and similarly where var has never been used before:

arr2str(a,var) -> a[1] a[2] ... a[n]

vs when the arg is missing:

arr2str(a) -> a[1] OFS a[2] OFS ... a[n]

I don't know of a way in a user-defined function to differentiate between the
2nd and 3rd cases above (hint: being able to test which args are present in a
user-defined function would be a useful capability to have!).

> I'm not sure about all the array sorting stuff. The idea was to unsplit (or unpatsplit) a line, and in those cases, the array subscripts are consecutive, positive integers.
Trying to support more scenarios could lead to messy, complicated, and slow code.

Is there reason to think that recombining arrays with those characteristics
would be the most common use of this though? I'd have thought by far the most
common use of a flattening function would be to simplify printing arrays so
instead of writing, for example:

PROCINFO["sorted_in"] = "whatever"
sep = ""
for (i in arr) {
printf "%s%s", sep, arr[i]
sep = OFS
}
print ""

we could just write:

PROCINFO["sorted_in"] = "whatever"
print arr2str(arr)

and that'd be much more generally useful if we don't create the function to only
work for arrays with consecutive, positive integer indices. I wouldn't expect
the code to be particularly messy or slow and if it was too slow, people can
always roll their own just like today.

At the end of the day, though, this function is easy to write in user space
(unlike, for example strptime(), hint, hint) so idk if it's worth your while
providing it and cluttering up the language unnecessarily.

Regards,

Ed.

>
> Regards,
> Andy
>

Andrew Schorr

unread,
Feb 8, 2017, 10:18:56 AM2/8/17
to
On Wednesday, February 8, 2017 at 9:44:18 AM UTC-5, Ed Morton wrote:
> > If the 2nd argument is missing, I think we had decided to use OFS.
>
> Makes sense. Can you differentiate between missing and zero-or-null in a built
> in function?

Yes.

> I don't know of a way in a user-defined function to differentiate between the
> 2nd and 3rd cases above (hint: being able to test which args are present in a
> user-defined function would be a useful capability to have!).

It can be done. Please try this:

function f(x) {
print (x == "") && (x == 0)
}

BEGIN {
f()
f("")
}

> Is there reason to think that recombining arrays with those characteristics
> would be the most common use of this though?

That was the original request. It solves the problem of joining into a string an array indexed by integers from 1 to N. It may not solve all problems.

> I'd have thought by far the most
> common use of a flattening function would be to simplify printing arrays so
> instead of writing, for example:
>
> PROCINFO["sorted_in"] = "whatever"
> sep = ""
> for (i in arr) {
> printf "%s%s", sep, arr[i]
> sep = OFS
> }
> print ""
>
> we could just write:
>
> PROCINFO["sorted_in"] = "whatever"
> print arr2str(arr)
>
> and that'd be much more generally useful if we don't create the function to only
> work for arrays with consecutive, positive integer indices. I wouldn't expect
> the code to be particularly messy or slow and if it was too slow, people can
> always roll their own just like today.

Please trust me -- the code will be messier if we have to inspect the array indices to discern what type they are and then make decisions about how to sort them.

> At the end of the day, though, this function is easy to write in user space
> (unlike, for example strptime(), hint, hint) so idk if it's worth your while
> providing it and cluttering up the language unnecessarily.

I think your variant should be implemented in user space. In my experience, using split or patsplit, then tweaking the values, and then wanting to recreate the string is sufficiently common that a join extension function would be useful.

If you'd like to write a more generic version of this, please go ahead and give it a try. The same applies to strptime -- does this need to be in core gawk, or can it be in an extension library, and if it can be done in a library, and it's so needed, why hasn't somebody written the library? We welcome contributions to the gawkextlib project.

Regards,
Andy

Ed Morton

unread,
Feb 8, 2017, 10:41:35 AM2/8/17
to
On 2/8/2017 9:18 AM, Andrew Schorr wrote:
> On Wednesday, February 8, 2017 at 9:44:18 AM UTC-5, Ed Morton wrote:
>>> If the 2nd argument is missing, I think we had decided to use OFS.
>>
>> Makes sense. Can you differentiate between missing and zero-or-null in a built
>> in function?
>
> Yes.
>
>> I don't know of a way in a user-defined function to differentiate between the
>> 2nd and 3rd cases above (hint: being able to test which args are present in a
>> user-defined function would be a useful capability to have!).
>
> It can be done. Please try this:
>
> function f(x) {
> print (x == "") && (x == 0)
> }
>
> BEGIN {
> f()
> f("")
> }

That only differentiates between the 2 cases I mentioned we could differentiate
between. Try adding a call to f(var) where var is an uninitialized variable and
it will be treated the same as f() - those are the 2 cases I was asking if you
can differentiate between.
No, as I mentioned it's not hard to do it in user space so I don't think we
should provide it. I'm just saying IF you're going to provide it then here's
some things to consider.

> The same applies to strptime -- does this need to be in core gawk,

It needs to be in core gawk so it's available wherever it's opposite strftime()
is available without any extra arguments or build steps. But we've already been
down this path and I know where it ends and we can get by with
mktime()+split()+etc instead.

Ed.

Kenny McCormack

unread,
Feb 8, 2017, 10:46:07 AM2/8/17
to
In article <f412035a-86f5-4f8b...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
...
>> (unlike, for example strptime(), hint, hint) so ...
...
>If you'd like to write a more generic version of this, please go ahead and
>give it a try. The same applies to strptime -- does this need to be in
>core gawk, or can it be in an extension library, and if it can be done in
>a library, and it's

Ah, but that's the old subject coming up again.

strptime() is easy to write in user-space (by the definition of "userspace"
that I use - and which Andy ought to be using - but see below). I did it
in about 20 minutes a long time ago (and published it here).

>so needed, why hasn't somebody written the library? We welcome
>contributions to the gawkextlib project.

But, alas. I don't consider "gawkextlib" to be "user-space". "gawkextlib"
is a project owned and controlled by Andy (and perhaps others - I don't
know if anyone else is involved; in any case that's not material to my
argument).

So, while it is easy to write strptime() and use it in your own projects,
it doesn't need to be nor should it be part of "gawkextlib".

--
"Remember when teachers, public employees, Planned Parenthood, NPR and PBS
crashed the stock market, wiped out half of our 401Ks, took trillions in
TARP money, spilled oil in the Gulf of Mexico, gave themselves billions in
bonuses, and paid no taxes? Yeah, me neither."

Mike Sanders

unread,
Feb 8, 2017, 1:32:44 PM2/8/17
to
Thinking to myself: A definition of userspace?

Its really simple in my mind... When Joe shared
his idiomata (plural of idiom) he got some smart-ass
replies up thread. But once he falls in line like
a 'true awker' & puts away all that silly thinkin'
for himself stuff, he'll a-okay.

AFRICAN, n. A n*gger that votes our way.
- The Devil's Dictionary, Ambrose Bierce

--
later on,
Mike

http://busybox.hypermart.net

Andrew Schorr

unread,
Feb 9, 2017, 11:28:02 AM2/9/17
to
On Wednesday, February 8, 2017 at 10:46:07 AM UTC-5, Kenny McCormack wrote:
> >so needed, why hasn't somebody written the library? We welcome
> >contributions to the gawkextlib project.
>
> But, alas. I don't consider "gawkextlib" to be "user-space". "gawkextlib"
> is a project owned and controlled by Andy (and perhaps others - I don't
> know if anyone else is involved; in any case that's not material to my
> argument).
>
> So, while it is easy to write strptime() and use it in your own projects,
> it doesn't need to be nor should it be part of "gawkextlib".

The "should" part seems to be some kind of religious issue that I can't understand.

The point of gawkextlib is not to control anything; it is to make useful extensions available to gawk users in a central location. It's there to help. Use it or don't use it, as you please.

Regards,
Andy

P.S. And yes, gawkextlib has multiple developers. The tools are currently limited, but ultimately, it would be nice to have a setup where anybody could contribute an extension. The standard SourceForge tools don't offer a way to do that, and I don't think github gets us there either. So somebody may have to build a custom website to do this the right away. Maybe some day.


Marc de Bourget

unread,
Feb 9, 2017, 2:11:05 PM2/9/17
to
Le mercredi 8 février 2017 19:32:44 UTC+1, Mike Sanders a écrit :
> Thinking to myself: A definition of userspace?
>
> Its really simple in my mind... When Joe shared
> his idiomata (plural of idiom) he got some smart-ass
> replies up thread. But once he falls in line like
> a 'true awker' & puts away all that silly thinkin'
> for himself stuff, he'll a-okay.
>

So true!

Kenny McCormack

unread,
Feb 9, 2017, 2:11:57 PM2/9/17
to
In article <1e19c07b-464b-426d...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
>On Wednesday, February 8, 2017 at 10:46:07 AM UTC-5, Kenny McCormack wrote:
>> >so needed, why hasn't somebody written the library? We welcome
>> >contributions to the gawkextlib project.

Note: I get - and I understand - the rest of your fine post. I think we
both know where we each stand. But I wanted to comment on just the two
liens above.

You ask: (if it [*] is) so needed, why hasn't somebody written the library?

The answer to that is: *somebody*, namely me, *has*. It is (and has been
for quite some time) available at:

http://shell.xmission.com:PORT/strptime.zip

(where "PORT" is 65401)

[*] Where by "it", we mean a strptime() function for GAWK.

However, for my own personal reasons, it is not nor will it ever be part of
"gawkextlib".

--
Pensacola - the thinking man's drink.

Andrew Schorr

unread,
Feb 9, 2017, 6:21:02 PM2/9/17
to
On Thursday, February 9, 2017 at 2:11:57 PM UTC-5, Kenny McCormack wrote:
> The answer to that is: *somebody*, namely me, *has*. It is (and has been
> for quite some time) available at:
>
> http://shell.xmission.com:PORT/strptime.zip
>
> (where "PORT" is 65401)
>
> [*] Where by "it", we mean a strptime() function for GAWK.
>
> However, for my own personal reasons, it is not nor will it ever be part of
> "gawkextlib".

Super. So problem solved. There should be joy in Mudville. It's a shame you are unwilling to share it in a well-known location where others can find it more easily, but that's up to you.

Cheers,
Andy

Kenny McCormack

unread,
Feb 9, 2017, 6:30:56 PM2/9/17
to
In article <dcf6901c-44d0-4ca8...@googlegroups.com>,
Indeed. Quite so.

--
Shikata ga nai...

Mike Sanders

unread,
Feb 9, 2017, 7:02:22 PM2/9/17
to
Marc de Bourget <marcde...@gmail.com> wrote:

> So true!

Darn, 'I rant'... don't follow my example Marc.
Just wish that the use-net de rigueur of trampling
others for sport would ease up sometimes.

Folks ought to teach each other in the spirit
of learning (learning is fun right?) but instead,
everybody's afraid somehow :/

Back to work for me! You? Build something in awk
& be happy =)

Manuel Collado

unread,
Feb 12, 2017, 8:17:10 AM2/12/17
to
Here is a possible implementation of 'join()' in the 'userspace':

function join(field, sep, fixsep, mulsep, k, out) {

mulsep = isarray(sep) # flag for an array of separators
if (!mulsep) fixsep = sep # fixed scalar separator
if (fixsep==0 && fixsep=="") fixsep = OFS # default scalar separator

if (mulsep && 0 in sep) out = sep[0]
for (k=1; k in field; k++) {
out = out field[k]
if (mulsep && k in sep) {
out = out sep[k]
} else if ((k+1) in field) {
out = out fixsep
}
}
return out
}

HTH
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Ed Morton

unread,
Feb 12, 2017, 12:22:24 PM2/12/17
to
This would be simpler, clearer and more efficient:

function arr2str(flds, seps, numFlds, fldNr, outStr) {
numFlds = length(flds)
if ( isarray(seps) ) {
# an array of separators
outStr = (0 in seps ? seps[0] : "")
for (fldNr=1; fldNr<=numFlds; fldNr++) {
outStr = outStr flds[fldNr] (fldNr in seps ? seps[fldNr] : "")
}
}
else {
# fixed scalar separator
outStr = "
for (fldNr=1; fldNr<=numFlds; fldNr++) {
outStr = (fldNr>1 ? outStr seps : "") flds[fldNr]
}
}
return outStr
}

Regards,

Ed.

Ed Morton

unread,
Feb 12, 2017, 12:24:40 PM2/12/17
to
On 2/12/2017 11:22 AM, Ed Morton wrote:

> outStr = "

Of course that should be:

outStr = ""

and yes that statements not necessary but I like it for clarity and symmetry
with the array-of-seps case. Full corrected version:

Joe User

unread,
Feb 12, 2017, 2:12:33 PM2/12/17
to
Manuel Collado wrote:

> Here is a possible implementation of 'join()' in the 'userspace':

I have one suggestion for that excellent function.

I imagine commonly using 'join' to remove fields, like:

patsplit($0, fields, FS, seps)
delete fields[3]
delete fields[8]
s = join(fields, seps)

I think your function would truncate the output string at the first deleted
field.

This is not a complaint, just pointing out a possible improvement.

Iterating with 'for (k in field)' would fix that, with the external dependency
on PROCINFO["sorted_in"] to make sure the indices are accessed in numeric
order.


Andrew Schorr

unread,
Feb 13, 2017, 4:29:58 AM2/13/17
to
On Sunday, February 12, 2017 at 12:22:24 PM UTC-5, Ed Morton wrote:
> This would be simpler, clearer and more efficient:

You are missing OFS support.

Regards,
Andy

Andrew Schorr

unread,
Feb 13, 2017, 4:31:17 AM2/13/17
to
On Sunday, February 12, 2017 at 2:12:33 PM UTC-5, Joe User wrote:
> patsplit($0, fields, FS, seps)
> delete fields[3]
> delete fields[8]
> s = join(fields, seps)

In that situation, does one include separators for the missing fields?

Regards,
Andy

Marc de Bourget

unread,
Feb 13, 2017, 5:42:29 AM2/13/17
to
Le dimanche 12 février 2017 20:12:33 UTC+1, Joe User a écrit :

> Iterating with 'for (k in field)' would fix that, with the external dependency
> on PROCINFO["sorted_in"] to make sure the indices are accessed in numeric
> order.

BTW, as mentioned earlier, I still think PROCINFO["sorted_in"] is not needed here.
If all indices are positive integers, GAWK sorts them numerically with "for...in".
Maybe someone who knows the GAWK source can confirm this.

Kenny McCormack

unread,
Feb 13, 2017, 6:29:56 AM2/13/17
to
In article <4d812961-3c5b-49d5...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
Interesting.

Obviously, the standards-obsessed will point out that even if it (seems to)
work, it is not guaranteed so you can't rely on it. Equally clearly, if
you are using (current versions of) GAWK, why *not* set PROCINFO? Why take
the chance? And (as noted below), if you are not using a current version
of GAWK (or TAWK, which has always had this as a documented feature), you
probably won't get this (serendipitous) sorting anyway.

FWIW, I tested with this program (command line):

$ gawk 'BEGIN { srand();for (i=1; i<20; i++) A[int(rand()*1000)];for (i in A) print i}'

Using an old version of gawk, they come out in random (not sorted) order.
Using a current version of gawk, they come out in sorted) order.
(And, yes, I tested it several times, and it did always come out sorted.
That's not proof, of course...)

A couple of other notes:
1) You really should set PROCINFO anyway, because somebody else might
have set it (i.e., if you're trying to make this routine generic
(include-able in other programs)). I.e., if the caller has set
PROCINFO to something other value (say, to get alphabetic sorting),
then you'll want to save the old value, set it to what you need,
and then set it back.
2) That all said (putting on the usual newsgroup-smarty-pants mode),
the more interesting question is the one that I think you're
actually raising. Which is: Why does this actually work? Is there
something about the current GAWK array subscript hashing algorithm
that makes this work? Why does it work now when it didn't work in
the past?

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/CLCtopics

Ed Morton

unread,
Feb 13, 2017, 9:28:21 AM2/13/17
to
Sorry, I don't understand - what do you mean by that? I was just providing an
improved version of Manuels function and I think what I provided would behave
the same way his did. I also don't understand what OFS has to do with converting
an array into a string using specified separators. Are you thinking that if
someone doesn't provide a "seps" argument at all then we should use OFS as the
separator? I'd disagree with that in favor of letting that case fall into the
"it's not an array" leg and using a null string separator (i.e. no separator).

Ed.

> Regards,
> Andy
>

Marc de Bourget

unread,
Feb 13, 2017, 11:56:26 AM2/13/17
to
I think GAWK developpers want to enshure that array sorting works as expected
and the same way as real (zero based) arrays are processed in other languages.

Marc de Bourget

unread,
Feb 13, 2017, 12:02:56 PM2/13/17
to
Hi Ed, I have tested your code with the OP's example, seems to work great.

Joe User

unread,
Feb 13, 2017, 12:32:26 PM2/13/17
to
I vote NO. To just delete fields, only include one sep[] per existing
field[].


Joe User

unread,
Feb 13, 2017, 1:17:14 PM2/13/17
to
charlemagn...@gmail.com wrote:

> I'm a big fan of patsplit() in gawk and find it curious there is no
> unpatsplit() as it's essential. It's fairly easy to write an unpatsplit()
> but I've never seen one in 3rd party libraries or anywhere.
>snip<

So, after everyone has had his say, this seems to work:

function cat_array(field, sep , fixsep, mulsep, k, out, osort) {
#
# Joins an array of values into a string, with a (possible)
# array of seperators, fixed seperator, or the default seperator OFS.
# Can be used immediately following patsplit() or split()
# To use no field seperator, use cat_array(field, "").
#
osort = PROCINFO["sorted_in"] # method for index sorting
if (osort) PROCINFO["sorted_in"] = "@ind_num_asc"; # iterate field[k] in order
mulsep = isarray(sep) # flag for an array of separators
if (!mulsep) fixsep = sep; # fixed scalar separator
if (fixsep==0 && fixsep=="") fixsep = OFS; # default scalar separator
if (mulsep && 0 in sep) out = sep[0];
for (k in field) {
out = out field[k]
if (mulsep && k in sep) {
out = out sep[k]
} else if ((k+1) in field) {
out = out fixsep
}
}
if (osort) PROCINFO["sorted_in"] = osort; # restored method for index sorting
return out
}

Kenny McCormack

unread,
Feb 13, 2017, 2:37:32 PM2/13/17
to
In article <8496a$58a1f825$adf2c12d$13...@API-DIGITAL.COM>,
Joe User <ax...@yahoo.com> wrote:
...
>So, after everyone has had his say, this seems to work:
>
>function cat_array(field, sep , fixsep, mulsep, k, out, osort) {
> #
> # Joins an array of values into a string, with a (possible)
> # array of seperators, fixed seperator, or the default seperator OFS.
> # Can be used immediately following patsplit() or split()
> # To use no field seperator, use cat_array(field, "").
> #
> osort = PROCINFO["sorted_in"] # method for index sorting

It is probably slightly superior to check to see if PROCINFO[] has been set
at all, to avoid creating an empty element:

if ("sorted_in" in PROCINFO) osort = PROCINFO["sorted_in"] # method for index sorting

Incidentally, all of the following values of PROCINFO[] seem to be
equivalent (but of course, this is not guaranteed):
1) unset (no "sorted_in" element)
2) "" (present, but empty string)
3) "@unsorted"

--
There are two kinds of Republicans: Billionaires and suckers.
Republicans: Please check your bank account and decide which one is you.

Joe User

unread,
Feb 13, 2017, 4:38:30 PM2/13/17
to
Kenny McCormack wrote:

> It is probably slightly superior to check to see if PROCINFO[] has been
> set at all, to avoid creating an empty element:

Closer to perfection:

function cat_array(field, sep , fixsep, mulsep, k, out, osort) {
#
# Joins an array of values into a string, with a (possible)
# array of seperators, fixed seperator, or the default seperator OFS.
# Can be used immediately following patsplit() or split()
# To use no field seperator, use cat_array(field, "").
#
if (! isarray(field)) return field; # Argument should be array
osort = ("sorted_in" in PROCINFO ? PROCINFO["sorted_in"] : "") # method for index sorting

Ed Morton

unread,
Feb 14, 2017, 12:58:23 AM2/14/17
to
On 2/13/2017 3:38 PM, Joe User wrote:
> Kenny McCormack wrote:
>
>> It is probably slightly superior to check to see if PROCINFO[] has been
>> set at all, to avoid creating an empty element:
>
> Closer to perfection:
>
> function cat_array(field, sep , fixsep, mulsep, k, out, osort) {
> #
> # Joins an array of values into a string, with a (possible)
> # array of seperators, fixed seperator, or the default seperator OFS.
> # Can be used immediately following patsplit() or split()
> # To use no field seperator, use cat_array(field, "").
> #
> if (! isarray(field)) return field; # Argument should be array

If this function is called with a non-array first argument the user should get
an error as they HAVE made a mistake, the function should not return a string
and hide the error.

> osort = ("sorted_in" in PROCINFO ? PROCINFO["sorted_in"] : "") # method for index sorting
> if (osort) PROCINFO["sorted_in"] = "@ind_num_asc"; # iterate field[k] in order

So the default sorting order will be random (hash) if sorted_in isn't set? I
don't think that's as useful as if the default order was ind_num_asc and we use
sorted_in if it was set so the caller can control the order.

> mulsep = isarray(sep) # flag for an array of separators
> if (!mulsep) fixsep = sep; # fixed scalar separator
> if (fixsep==0 && fixsep=="") fixsep = OFS; # default scalar separator

why would we want to test the separator and use OFS if not initialized? We can't
tell when a function is being called with no argument

foo(arr)

vs an uninitialized variable:

foo(arr,var)

so don't try to guess what the user meant when they called the function - just
use whatever is passed in. There's nothing fancy about OFS, it's just a string,
so if they want to use OFS as a separator they can just specify that:

foo(arr,OFS)

> if (mulsep && 0 in sep) out = sep[0];

That assumes sep[] starts at zero and field[] starts at a value greater than
zero. That's not necessary.

> for (k in field) {
> out = out field[k]
> if (mulsep && k in sep) {
> out = out sep[k]
> } else if ((k+1) in field) {
> out = out fixsep
> }
> }

The above is less efficient and less clear than it could be, just do the
"mulsep" test before the loop not every time through it. Yes, you'll duplicate
the for loop line but that's better than the alterantive.

> if (osort) PROCINFO["sorted_in"] = osort; # restored method for index sorting
> return out
> }


This function will clearly, efficiently, and consistently put "seps" values
between the corresponding "flds" values for any array indices using any sorting
algorithm specified in PROCINFO (numerically ascending indices by default):

function arr2str(flds, seps, clrSortedIn, idx, fldNr, outStr) {
if ( ! ("sorted_in" in PROCINFO) ) {
PROCINFO["sorted_in"] = "@ind_num_asc"
clrSortedIn = 1
}

if ( isarray(seps) ) {
# an array of separators
for (idx in flds) {
outStr = outStr (fldNr++ && (idx in seps) ? seps[idx] : "")
flds[idx]
}
}
else {
# fixed scalar separator
for (idx in flds) {
outStr = outStr (fldNr++ ? seps : "") flds[idx]
}
}

if ( clrSortedIn ) {
delete PROCINFO["sorted_in"]
}

return outStr
}

and if you want to use it to undo the result of a split() just add the preceding
and succeeding seps[] values outside of the function call:

str = seps[0] arr2str(flds,seps) seps[length(flds)]

Regards,

Ed.

Ed Morton

unread,
Feb 14, 2017, 1:21:11 AM2/14/17
to
function arr2str(flds, seps, clrSortedIn, idx, prevIdx, fldNr, outStr) {
> if ( ! ("sorted_in" in PROCINFO) ) {
> PROCINFO["sorted_in"] = "@ind_num_asc"
> clrSortedIn = 1
> }
>
> if ( isarray(seps) ) {
> # an array of separators
> for (idx in flds) {
> outStr = outStr (fldNr++ && (idx in seps) ? seps[idx] : "")
> flds[idx]
outStr = outStr (fldNr++ && (prevIdx in seps) ? seps[prevIdx]
: "") flds[idx]
prevIdx = idx

Andrew Schorr

unread,
Feb 14, 2017, 4:34:53 AM2/14/17
to
On Monday, February 13, 2017 at 9:28:21 AM UTC-5, Ed Morton wrote:
> Sorry, I don't understand - what do you mean by that? I was just providing an
> improved version of Manuels function and I think what I provided would behave
> the same way his did.

Manuel's function refers to OFS, and yours does not. So how can it have the same behavior?

> I also don't understand what OFS has to do with converting
> an array into a string using specified separators.

Split is to normal input record processing as join is to $0 reconstitution. When you assign to a value of $n, $0 is recreated using OFS. So that's why OFS is the default separator unless the user explicitly overrides it.

> Are you thinking that if
> someone doesn't provide a "seps" argument at all then we should use OFS as the
> separator? I'd disagree with that in favor of letting that case fall into the
> "it's not an array" leg and using a null string separator (i.e. no separator).

It's of course a matter of taste. In user-space, one can do whatever one wants. If I implement a join extension, I'll have to choose how it behaves. I think it should be consistent with $0 reconstitution.

Regards,
Andy

Andrew Schorr

unread,
Feb 14, 2017, 4:37:56 AM2/14/17
to
On Monday, February 13, 2017 at 6:29:56 AM UTC-5, Kenny McCormack wrote:
> 2) That all said (putting on the usual newsgroup-smarty-pants mode),
> the more interesting question is the one that I think you're
> actually raising. Which is: Why does this actually work? Is there
> something about the current GAWK array subscript hashing algorithm
> that makes this work? Why does it work now when it didn't work in
> the past?

I think we have covered this in the past. The current version of gawk has some optimizations for representing arrays with integer subscripts. A side-effect is that the results are sorted in some cases. Good code should NOT rely upon this. It could change in a future version. It is also untrue for arrays containing negative integer subscripts:

bash-4.2$ gawk 'BEGIN {for (i = -1; i <= 1; i++) x[i]; for (i in x) print i}'
0
-1
1

Regards,
Andy

Kenny McCormack

unread,
Feb 14, 2017, 6:44:58 AM2/14/17
to
In article <0302204b-f9a4-4c87...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
...
>It is also untrue for arrays containing
>negative integer subscripts:
>
>bash-4.2$ gawk 'BEGIN {for (i = -1; i <= 1; i++) x[i]; for (i in x) print i}'
>0
>-1
>1

Indeed. Marc's original claim explicitly said that it worked when all the
subscripts were positive integers.

--
"Women should not be enlightened or educated in any way. They should be
segregated because they are the cause of unholy erections in holy men.

-- Saint Augustine (354-430) --

Ed Morton

unread,
Feb 14, 2017, 11:36:52 AM2/14/17
to
On 2/14/2017 3:34 AM, Andrew Schorr wrote:
> On Monday, February 13, 2017 at 9:28:21 AM UTC-5, Ed Morton wrote:
>> Sorry, I don't understand - what do you mean by that? I was just providing an
>> improved version of Manuels function and I think what I provided would behave
>> the same way his did.
>
> Manuel's function refers to OFS, and yours does not. So how can it have the same behavior?

Yes, you're right, I deleted that line because this (simplified) code cannot work:

function join(field, sep, ...) {
if (sep==0 && sep=="") sep = OFS # default scalar separator
}

Given this call to the above:

join(theArr,theSep)

when "theSep" is an uninitialized variable being used in the context of an array
or a string as it is, it's not array so it's a string and so should be treated
as a null string but the above code would mistakenly treat it as OFS since a
user space function cannot distinguish between an absent argument and an
argument that's an uninitialized variable. That is IMHO simply 100% wrong but
you've got to admit it's misleading/confusing at best and completely unnecessary
when you can just remove the line:

if (sep==0 && sep=="") sep = OFS # default scalar separator

and let the caller pass in OFS as the sep argument if/when they want it. All
clear, simple, and obvious.

>
>> I also don't understand what OFS has to do with converting
>> an array into a string using specified separators.
>
> Split is to normal input record processing as join is to $0 reconstitution. When you assign to a value of $n, $0 is recreated using OFS. So that's why OFS is the default separator unless the user explicitly overrides it.

No. We need to have that side-effect behavior when doing $1=$1 because we have
no way to specify the separator when performing that operation otherwise and
it's consistent with calling split() with no separator argument. That's not the
case when writing a function to combine array elements - just use what is
specified on the arg list.

>> Are you thinking that if
>> someone doesn't provide a "seps" argument at all then we should use OFS as the
>> separator? I'd disagree with that in favor of letting that case fall into the
>> "it's not an array" leg and using a null string separator (i.e. no separator).
>
> It's of course a matter of taste. In user-space, one can do whatever one wants. If I implement a join extension, I'll have to choose how it behaves. I think it should be consistent with $0 reconstitution.

Clearly I disagree. One cannot do whatever one wants because what one wants is a
way to distinguish between an absent argument and an uninitialized variable. I
think it's more important to make "join()" consistent with "split()" than with
reconstructing $0 and when you call split() with an uninitialized variable as
the separator it does NOT assume you really want it to use FS it simply uses the
null string. This is the consistency we need (assume var is an uninitialized
variable):

split(str,flds,var) == split(str,flds,"")
join(arr,var) == join(arr,"")

Not this inconsistency:

split(str,flds,var) == split(str,flds,"")
join(arr,var) == join(arr,OFS)

since we CANNOT implement what we'd like in a user space join function:

split(str,flds,var) == split(str,flds,"")
join(arr,var) == join(arr,"")
split(arr,flds) == split(str,flds,FS)
join(arr) == join(arr,OFS)

since there's no way within join() to differentiate between an uninitialized
variable argument and no argument.

Here's another thing to consider - we CANNOT call split() or patsplit() in a way
that uses FS implicitly and populates a separator array so why try to do it with
join()? What I mean is we can't write:

split(str,flds,,seps)

we must write:

split(str,flds,FS,seps)

instead. Since we can't use FS implicitly with any split() function why are we
trying to use OFS implicitly with a join() function? Keep it simple...

Regards,

Ed.



>
> Regards,
> Andy
>

Manuel Collado

unread,
Feb 15, 2017, 9:28:06 AM2/15/17
to
El 14/02/2017 17:36, Ed Morton escribió:
> [...]
> ... we CANNOT implement what we'd like in a user space join function:

I assume this means "... what YOU would like ...".

>
> split(str,flds,var) == split(str,flds,"")
> join(arr,var) == join(arr,"")
> split(arr,flds) == split(str,flds,FS)
> join(arr) == join(arr,OFS)

The last is probably the most important behavior:

join(arr) == join(arr,OFS)

should always hold.

>
> since there's no way within join() to differentiate between an
> uninitialized variable argument and no argument.

Ok. In the user's space a missing argument and a null value argument are
the same. So the very nature of AWK is that:

join(arr) == join(arr, var) # uninitialized var

To let them behave differently, join() has to be implemented as an
extension. Then it will have the extra capabilities of a true built-in
function.

Regards.

Ed Morton

unread,
Feb 15, 2017, 9:53:53 AM2/15/17
to
On 2/15/2017 8:28 AM, Manuel Collado wrote:
> El 14/02/2017 17:36, Ed Morton escribió:
>> [...]
>> ... we CANNOT implement what we'd like in a user space join function:
>
> I assume this means "... what YOU would like ...".

No, it means what we ALL have stated we'd like (and which is consistent with
*split()) which is for this function to use OFS as the separator if and only if
the separator argument is missing.

>
>>
>> split(str,flds,var) == split(str,flds,"")
>> join(arr,var) == join(arr,"")
>> split(arr,flds) == split(str,flds,FS)
>> join(arr) == join(arr,OFS)
>
> The last is probably the most important behavior:
>
> join(arr) == join(arr,OFS)
>
> should always hold.
>
>>
>> since there's no way within join() to differentiate between an
>> uninitialized variable argument and no argument.
>
> Ok. In the user's space a missing argument and a null value argument are the
> same. So the very nature of AWK is that:
>
> join(arr) == join(arr, var) # uninitialized var
>
> To let them behave differently, join() has to be implemented as an extension.
> Then it will have the extra capabilities of a true built-in function.

Right and on the flip side if it's going to be implemented as a user space
function then it should not assume an unset variable separator argument should
be replaced with OFS, as I've been arguing.

Ed.
>
> Regards.

Kenny McCormack

unread,
Feb 15, 2017, 10:26:01 AM2/15/17
to
In article <o81ohj$1uka$1...@gioia.aioe.org>,
Manuel Collado <m.co...@domain.invalid> wrote:
...
(Ed says)
>>since there's no way within join() to differentiate between an
>>uninitialized variable argument and no argument.

(And you concur)
>Ok. In the user's space a missing argument and a null value argument are
>the same. So the very nature of AWK is that:
>
> join(arr) == join(arr, var) # uninitialized var
(Note: I refer to this functionality below as "OFS support")

>To let them behave differently, join() has to be implemented as an
>extension. Then it will have the extra capabilities of a true built-in
>function.

Comments:

1) I'm not sure that "OFS support" is all that important, although I
guess it would be nice if it can be done without too much trouble.
2) Amazingly, you can distinguish the two cases. Observe:
$ gawk4 'function fiz(foo) { return foo == 0 && foo == "" }
BEGIN { print fiz();print fiz("");print fiz(0)}'
1
0
0
$
The seemingly silly expression: foo == 0 && foo == ""
which is mathematically meaningless, does, in fact cater to the
uninitialized case. Function fiz() returns 1 (true) when called with
no args. It returns 0 if passed with most other args (but see PS
note below).
3) TAWK has a couple of nifty functions, argcount() and argval(), which
speak well to the case at hand. These would be nice additions to
GAWK.

P.S. I just noticed that if 'bar' is an uninitialized variable, then
fiz(bar) also returns 1. Not sure what this does to the overall argument.

P.P.S. I agree that to really do this right, an extension lib function is
needed, but I don't think Ed considers that to be "user-space". I get the
impression that, eventually, Andy is going to favor us with a good
extension lib solution to this.

--
Trump has normalized hate.

The media has normalized Trump.

Ed Morton

unread,
Feb 15, 2017, 11:04:12 AM2/15/17
to
On 2/15/2017 9:26 AM, Kenny McCormack wrote:
> In article <o81ohj$1uka$1...@gioia.aioe.org>,
> Manuel Collado <m.co...@domain.invalid> wrote:
> ...
> (Ed says)
>>> since there's no way within join() to differentiate between an
>>> uninitialized variable argument and no argument.
>
> (And you concur)
>> Ok. In the user's space a missing argument and a null value argument are
>> the same. So the very nature of AWK is that:
>>
>> join(arr) == join(arr, var) # uninitialized var
> (Note: I refer to this functionality below as "OFS support")
>
>> To let them behave differently, join() has to be implemented as an
>> extension. Then it will have the extra capabilities of a true built-in
>> function.
>
> Comments:
>
> 1) I'm not sure that "OFS support" is all that important, although I
> guess it would be nice if it can be done without too much trouble.
> 2) Amazingly, you can distinguish the two cases. Observe:
> $ gawk4 'function fiz(foo) { return foo == 0 && foo == "" }
> BEGIN { print fiz();print fiz("");print fiz(0)}'

Sigh. Once again, for the 3rd or 4th time in this thread, those are not the 2
cases under discussion. We are all very well aware of that common test for an
uninitialized variable.

> 1
> 0
> 0
> $
> The seemingly silly expression: foo == 0 && foo == ""
> which is mathematically meaningless, does, in fact cater to the
> uninitialized case. Function fiz() returns 1 (true) when called with
> no args. It returns 0 if passed with most other args (but see PS
> note below).
> 3) TAWK has a couple of nifty functions, argcount() and argval(), which
> speak well to the case at hand. These would be nice additions to
> GAWK.
>
> P.S. I just noticed that if 'bar' is an uninitialized variable, then
> fiz(bar) also returns 1. Not sure what this does to the overall argument.

That IS the entire argument - that there is no way within a user-defined
function to distinguish between a missing argument and an uninitialized variable
as an argument.

Ed.

Kaz Kylheku

unread,
Feb 15, 2017, 11:24:59 AM2/15/17
to
Not so fast! This clever trick doesn't actually answer the question
"I was called with an explicitly passed argument".

In fact, it has a somewhat more useful behavior. Try this one:

function lowlevel(arg)
{
print arg == 0 && arg == ""
}

function highlevel(arg)
{
lowlevel(arg)
}

BEGIN {
lowlevel()
lowlevel(0)
lowlevel("")
}

Output:

1
0
0

Even though lowlevel's caller passes an argument, that argument is
itself undefined. The undefinedness of highlevel's arg propagates to the
lower level function.

This is useful because it means that your trick can be used to
delegate argument defaulting from high level interface functions
to lower level implementation functions.

If you make a join(input, sep) which can be usefully called as join(x)
then any wrapper functions which rely on join can also have a default
argument corresponding to y quite easily.

function joinwrapper(a, b, c, sep)
{
# ...
join(whatever, sep)
}


joinwrapper doesn't itself have to contain boilerplate code which
evaluates sep == 0 && sep == "" and calls join in two different ways.

Ed Morton

unread,
Feb 15, 2017, 11:32:29 AM2/15/17
to
On 2/15/2017 10:04 AM, Ed Morton wrote:
> On 2/15/2017 9:26 AM, Kenny McCormack wrote:
>> In article <o81ohj$1uka$1...@gioia.aioe.org>,
>> Manuel Collado <m.co...@domain.invalid> wrote:
>> ...
>> (Ed says)
>>>> since there's no way within join() to differentiate between an
>>>> uninitialized variable argument and no argument.
>>
>> (And you concur)
>>> Ok. In the user's space a missing argument and a null value argument are
>>> the same. So the very nature of AWK is that:
>>>
>>> join(arr) == join(arr, var) # uninitialized var
>> (Note: I refer to this functionality below as "OFS support")
>>
>>> To let them behave differently, join() has to be implemented as an
>>> extension. Then it will have the extra capabilities of a true built-in
>>> function.
>>
>> Comments:
>>
>> 1) I'm not sure that "OFS support" is all that important, although I
>> guess it would be nice if it can be done without too much trouble.

FWIW I agree with the above. I really don't care if we fill in an OFS when no
argument is provided or not. There's an excellent reason to use FS when no
separator is provide to split() (if we HAD to provide FS as the separator then
you'd need to manipulate FS to add escaping since it'd be treated as a dynamic
instead of constant regexp) but no good reason to do anything special when no
separator is provided to a join function since that argument is always just a
literal string. I personally think for clarity I'd always provide OFS instead of
skipping that argument but YMMV I suppose and it really doesn't matter either way.

What I do care about is that we should not use OFS when a separator argument IS
provided, even if that argument is an uninitialized variable.

Ed.

Manuel Collado

unread,
Feb 15, 2017, 11:41:53 AM2/15/17
to
El 15/02/2017 16:26, Kenny McCormack escribió:
> [...]
> Comments:
>
> 1) I'm not sure that "OFS support" is all that important, although I
> guess it would be nice if it can be done without too much trouble.

It is important if join() should be consistent with split() as much as
possible.

> 2) Amazingly, you can distinguish the two cases. Observe:
> $ gawk4 'function fiz(foo) { return foo == 0 && foo == "" }
> BEGIN { print fiz();print fiz("");print fiz(0)}'
> 1
> 0
> 0
> $
> The seemingly silly expression: foo == 0 && foo == ""
> which is mathematically meaningless, does, in fact cater to the
> uninitialized case. Function fiz() returns 1 (true) when called with
> no args. It returns 0 if passed with most other args (but see PS
> note below).
> [...]
> P.S. I just noticed that if 'bar' is an uninitialized variable, then
> fiz(bar) also returns 1. Not sure what this does to the overall argument.

Well, the problem is the split() specification. An omitted third
argument means to use FS, while a <null> third argument means to split
the string into individual characters.

I think this is inconsistent. A NULL VALUE third argument should have
the same effect than an omitted third argument. It is a NULL STRING
third argument which should means to split into characters.

Alas! The current behavior of split() is carved in stone, and cannot be
changed. The most we can do is to consider a <null> third argument a bad
practice corner case that should be avoided. I.e., a null string should
be explicitly given to split into individual characters.

> 3) TAWK has a couple of nifty functions, argcount() and argval(), which
> speak well to the case at hand. These would be nice additions to
> GAWK.

The extension API of gawk already provides these facilities.

Kenny McCormack

unread,
Feb 15, 2017, 11:42:00 AM2/15/17
to
In article <o81u2d$4ie$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>Sigh. Once again, for the 3rd or 4th time in this thread, those are not the 2
>cases under discussion. We are all very well aware of that common test for an
>uninitialized variable.

For what it is worth, I'd never see it before reading this thread.

Anyway, is that even such a big deal? I mean, why *would* you pass in an
uninitialized variable as an arg? Put on your best pedagogic hat and
answer this question:

Which is better form?

A) somefunc(x)
B) x="";somefunc(x)

where, in case A), x has never been assigned a value. Wouldn't you reduce
the grade of the student who turned in case A)? Wouldn't you give the
student who turned in case B a pat on the head?

The point being that my function returns 0 in case B.
And case A is bad form.

--
I've been watching cat videos on YouTube. More content and closer to
the truth than anything on Fox.

Kaz Kylheku

unread,
Feb 15, 2017, 11:44:03 AM2/15/17
to
On 2017-02-15, Kenny McCormack <gaz...@shell.xmission.com> wrote:
If you don't program with undefined variables, such that they occur
only by accident, not much.

If you want to be able to write join(arr, undef_var) such that it
means the same as join(arr, ""), whereas join(arr) means
something else, like join(arr, OFS), then you're out of luck.

If you mostly code with defined variables, except for certain
idioms like seen[$0]++ and whatnot, you're okay too.

The unifying principle here is that an undefined value of the
second argument of join has a single, well-defined meaning.

You have effectively extended the language with a new "context".
Normally, an undefined value can be in an arithmetic context,
producing zero. Or else in a string context, producing "".
Now you have a "join context": undefined as second arg to
join produces OFS.

Someone who isn't aware of "join context" will read join(x, y)
and assume that the regular rules apply: y is either blank
or zero depending on whether join uses it arithmetically
or string-wise.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Kaz Kylheku

unread,
Feb 15, 2017, 11:52:44 AM2/15/17
to
Given that, oh, the + operator in 42 + x will use 0 when x
is undefined, the idea that join(arr, x) must not substitute
OFS when x is undefined isn't very compelling.

Kenny McCormack

unread,
Feb 15, 2017, 11:53:19 AM2/15/17
to
In article <o820ce$g42$1...@gioia.aioe.org>,
Manuel Collado <m.co...@domain.invalid> wrote:
...
>> 3) TAWK has a couple of nifty functions, argcount() and argval(), which
>> speak well to the case at hand. These would be nice additions to
>> GAWK.
>
>The extension API of gawk already provides these facilities.

You think I don't know that?

But Ed doesn't consider extension libs to be "user-space".
And we're mostly doing all this for Ed's benefit.

P.S. No, actualy, the extension lib facility *doesn't* provide it for
(Ed's definition of) user-space functions. It only provides it for
functions written in C (i.e., in an extension lib).

--
BigBusiness types (aka, Republicans/Conservatives/Independents/Liberatarians/whatevers)
don't hate big government. They *love* big government as a means for them to get
rich, sucking off the public teat. What they don't like is *democracy* - you know,
like people actually having the right to vote and stuff like that.

Kaz Kylheku

unread,
Feb 15, 2017, 12:02:16 PM2/15/17
to
On 2017-02-15, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <o81u2d$4ie$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
> ...
>>Sigh. Once again, for the 3rd or 4th time in this thread, those are not the 2
>>cases under discussion. We are all very well aware of that common test for an
>>uninitialized variable.
>
> For what it is worth, I'd never see it before reading this thread.
>
> Anyway, is that even such a big deal? I mean, why *would* you pass in an
> uninitialized variable as an arg?

Because use of uninitialized vars is idiomatic, at least in small
Awk programs, and you like that about Awk?

> Put on your best pedagogic hat and
> answer this question:
>
> Which is better form?
>
> A) somefunc(x)
> B) x="";somefunc(x)

Of course undefined vars are garbage for software development.
But they make for terse awk one-liners.

The kicker is that some people might want join() to be useful
in such terse one-liners.

The thing is, with the undef -> OFS substitution, we cannot
say that it is *not* useful for terse one-liners. It *can*
be exploited to shorten some code.

The main counterargument boils down to a failure to
mimic the behavior of split(), which distinguishes
a missing arg from a present, undefined one.

Andrew Schorr

unread,
Feb 15, 2017, 12:43:23 PM2/15/17
to
On Wednesday, February 15, 2017 at 10:26:01 AM UTC-5, Kenny McCormack wrote:
> P.P.S. I agree that to really do this right, an extension lib function is
> needed, but I don't think Ed considers that to be "user-space". I get the
> impression that, eventually, Andy is going to favor us with a good
> extension lib solution to this.

I have it on my to-do list, but it will be tough if we can't all agree on sensible behavior. Since extension API functions can easily distinguish between an absent argument and a null argument, this issue is moot. I'm more concerned with how to handle arrays where some indices have been deleted...

Regards,
Andy

Ed Morton

unread,
Feb 15, 2017, 1:12:21 PM2/15/17
to
FWIW I have no problem at all with an extension function mapping an un-provided
argument to OFS or a null string or whatever else anyone thinks is reasonable.
OFS is as good a value to use as any in that case and it does match $1=$1 behavior.

IMHO this is the functionality we would want from an extension function:

function arr2str(flds, seps, clrSortedIn, idx, prevIdx, fldNr, outStr) {
if ( ! ("sorted_in" in PROCINFO) ) {
PROCINFO["sorted_in"] = "@ind_num_asc"
clrSortedIn = 1
}

if ( isarray(seps) ) {
# an array of separators
for (idx in flds) {
outStr = outStr (fldNr++ && (prevIdx in seps) ? seps[prevIdx] :
"") flds[idx]
prevIdx = idx
}
}
else {
# fixed scalar separator
seps = (magic_argument_present_test == true ? seps : OFS)
for (idx in flds) {
outStr = outStr (fldNr++ ? seps : "") flds[idx]
}
}

if ( clrSortedIn ) {
delete PROCINFO["sorted_in"]
}

return outStr
}

so it simply loops through the indices of "flds" in whatever order the caller
specifies with PRIOCINFO["sorted_in"] (@ind_num_asc by default), inserting the
"seps" value after each correspondingly indexed "flds" value (null string if
index absent from seps array). The above will work no matter if the user wants
to sort on numeric indices or string values or call a user-defined function or
anything else.

If you WANT to use the above to undo the result of a *split() just call the
above then add the preceding and succeeding seps[] values outside of the
function call:

str = seps[0] arr2str(flds,seps) seps[length(flds)]

but please don't build that functionality/assumption into the function because
there is no way in general to identify the "seps" index that comes before or
after the first "flds" index (consider how to get a pre/post seps value if
PROCINFO["sorted_in"]="@val_str_desc" or similar) and trying to special case it
for numeric indices with @ind_num_asc would be a completely unnecessary hack
given how utterly trivial it is for a caller to add the pre/post seps references
outside of the function call.

Note that that suggestion is also consistent with $1=$1 behavior as that action
strips leading/trailing spaces ("separators") off $0 with the default FS:

$ echo ' a b ' | awk '{$1=$1; print "<"$0">"}'
<a b>

Regards,

Ed.

Ed Morton

unread,
Feb 15, 2017, 1:27:56 PM2/15/17
to
On 2/15/2017 10:41 AM, Kenny McCormack wrote:
> In article <o81u2d$4ie$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
> ...
>> Sigh. Once again, for the 3rd or 4th time in this thread, those are not the 2
>> cases under discussion. We are all very well aware of that common test for an
>> uninitialized variable.
>
> For what it is worth, I'd never see it before reading this thread.

It's there, multiple times... re-read it if you like (or not, either way...).

> Anyway, is that even such a big deal? I mean, why *would* you pass in an
> uninitialized variable as an arg? Put on your best pedagogic hat and
> answer this question:
>
> Which is better form?

There is no universal "better form". Which piece of code is "better" than some
other piece of code depends on the context in which it is developed and to be
used. It's only in 2 identical contexts with identical requirements that you can
decide which code is "better".

>
> A) somefunc(x)
> B) x="";somefunc(x)

A is idiomatic awk, B is not, therefore B is better (in some contexts). Unless
the script is large (whatever that means) I don't want to see every string
variable in an awk script unnecessarily initialized to "" and I don't believe
for a second that you do so either.

> where, in case A), x has never been assigned a value. Wouldn't you reduce
> the grade of the student who turned in case A)? Wouldn't you give the
> student who turned in case B a pat on the head?

No, I'd explain to the case B student that this is awk, not C, and the
extraneous code he wrote is just taking up space and cycles.

> The point being that my function returns 0 in case B.
> And case A is bad form.

No, case A is idiomatic awk.

Ed.

Ed Morton

unread,
Feb 15, 2017, 1:29:59 PM2/15/17
to
On 2/15/2017 12:27 PM, Ed Morton wrote:
> On 2/15/2017 10:41 AM, Kenny McCormack wrote:
>> In article <o81u2d$4ie$1...@dont-email.me>,
>> Ed Morton <morto...@gmail.com> wrote:
>> ...
>>> Sigh. Once again, for the 3rd or 4th time in this thread, those are not the 2
>>> cases under discussion. We are all very well aware of that common test for an
>>> uninitialized variable.
>>
>> For what it is worth, I'd never see it before reading this thread.
>
> It's there, multiple times... re-read it if you like (or not, either way...).
>
>> Anyway, is that even such a big deal? I mean, why *would* you pass in an
>> uninitialized variable as an arg? Put on your best pedagogic hat and
>> answer this question:
>>
>> Which is better form?
>
> There is no universal "better form". Which piece of code is "better" than some
> other piece of code depends on the context in which it is developed and to be
> used. It's only in 2 identical contexts with identical requirements that you can
> decide which code is "better".
>
>>
>> A) somefunc(x)
>> B) x="";somefunc(x)
>
> A is idiomatic awk, B is not, therefore B is better (in some contexts). Unless

A is idiomatic awk, B is not, therefore _A_ is better (in some contexts).
Unless

I wish there was some way to edit these posts! Maybe it's time to retire usenet
and all jump onto stackoverflow....

Ed Morton

unread,
Feb 15, 2017, 1:43:28 PM2/15/17
to
On 2/15/2017 11:43 AM, Andrew Schorr wrote:
<snip>
> Since extension API functions can easily distinguish between an absent argument and a null argument

How hard would it be to provide an extension function that provides the
capability to distinguish between un-initialized vs un-provided arguments to
user-defined functions?

I think that'd be far more useful than the join() function we're discussing
since anyone can easily write a sensible join() function in user space but we
currently CANNOT write a user space function that behaves differently with
un-initialized vs un-provided arguments.

Ed.

Joe User

unread,
Feb 15, 2017, 2:03:46 PM2/15/17
to
Ed Morton wrote:

> This function will clearly, efficiently, and consistently put "seps"
> values between the corresponding "flds" values for any array indices using
> any sorting algorithm specified in PROCINFO (numerically ascending indices
> by default):
>
> function arr2str(flds, seps, clrSortedIn, idx, fldNr, outStr) {
<snip>

The function you supplied is a completely reasonable way to do things.

Andrew Schorr

unread,
Feb 16, 2017, 4:24:39 AM2/16/17
to
On Wednesday, February 15, 2017 at 1:43:28 PM UTC-5, Ed Morton wrote:
> How hard would it be to provide an extension function that provides the
> capability to distinguish between un-initialized vs un-provided arguments to
> user-defined functions?

I'm not sure exactly what you have in mind, but I don't think it's possible. Can you please elaborate?

> I think that'd be far more useful than the join() function we're discussing
> since anyone can easily write a sensible join() function in user space but we
> currently CANNOT write a user space function that behaves differently with
> un-initialized vs un-provided arguments.

You are starting to convince me that the join extension is a bad idea, since your idea of what it should do departs significantly from what I had in mind.

Regards,
Andy

Andrew Schorr

unread,
Feb 16, 2017, 4:26:19 AM2/16/17
to
On Wednesday, February 15, 2017 at 1:12:21 PM UTC-5, Ed Morton wrote:
> IMHO this is the functionality we would want from an extension function:
...

That differs significantly from what I had in mind. Perhaps it's best not to do it at all and instead let people define their own functions.

Regards,
Andy

Manuel Collado

unread,
Feb 16, 2017, 5:29:16 AM2/16/17
to
Hi, all.

I regret the tone of this long discussion. The spirit of Free Software
is that anybody can develop a tool, and offer it for free if he/she
wants to.

If the offered tool doesn't suit your taste, just don't use it. But
please don't blame anybody for trying to help other people.

Ed Morton

unread,
Feb 16, 2017, 10:58:12 AM2/16/17
to
Manuel - no-one is blaming anyone for anything, we're just trying to figure out
what a "join()" tool should do before Andy decides if he'll provide one or not.
We do not want gawk cluttered up with extensions no-one will use or to miss
opportunities to get generally useful tools instead of specialized ones for a
similar purpose.

Ed.

Ed Morton

unread,
Feb 16, 2017, 11:16:03 AM2/16/17
to
On 2/16/2017 3:24 AM, Andrew Schorr wrote:
> On Wednesday, February 15, 2017 at 1:43:28 PM UTC-5, Ed Morton wrote:
>> How hard would it be to provide an extension function that provides the
>> capability to distinguish between un-initialized vs un-provided arguments to
>> user-defined functions?
>
> I'm not sure exactly what you have in mind, but I don't think it's possible. Can you please elaborate?

I fairly frequently want to be able to write a function that has a default value
it uses when an argument is missing, just like many builtin functions do. To
write that function I'd need to be able to write:

function foo(x) {
if (ispresent(x)) {
print "present"
}
else {
print "absent"
}
}

so I could get this behavior (where var is an unintialized variable):

BEGIN {
foo(var) # -> "present"
foo() # -> "absent"
}

We have no way to write that function today as there is no "ispresent()"
function to test if an argument was provided to the function or not. We used to
have a similar problem with distinguishing between arrays and scalars before
"isarray()" was provided.

>
>> I think that'd be far more useful than the join() function we're discussing
>> since anyone can easily write a sensible join() function in user space but we
>> currently CANNOT write a user space function that behaves differently with
>> un-initialized vs un-provided arguments.
>
> You are starting to convince me that the join extension is a bad idea, since your idea of what it should do departs significantly from what I had in mind.

That's fine but I do think there's value in the function I proposed because then
we could just write :

print str2arr(flds)

instead of manually writing this loop or similar which is a staple of awk scripts:

for (i=1; i<=length(flds); i++) {
printf "%s%s", flds[i], (i<length(flds) ? OFS : ORS)
}

Note that the above also covers the case where we just want to print the fields
as we could now do:

split($0,arr)
print str2arr(arr)

instead of this or similar:

for (i=1; i<=NF; i++) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}

so we're talking about reducing code that occurs in a HUGE number of scripts.

The function I described can also be used for trivially recombining the result
of any *split() if/when desired:

split(str,flds,FS,seps)
str = seps[0] str2arr(flds,seps) seps[length(flds)]

which I think was the original intent of the function you were considering
providing.

Thanks for listening!

Ed.
>
> Regards,
> Andy
>

Ed Morton

unread,
Feb 16, 2017, 11:31:27 AM2/16/17
to
On 2/16/2017 10:15 AM, Ed Morton wrote:

> print str2arr(flds)

and of course that should be named arr2str() throughout. I keep forgetting I
can't edit these posts and posting it without carefully reading the damn thing
for typos first. I'm going to repost the corrected version. sorry for the noise.


Ed Morton

unread,
Feb 16, 2017, 11:39:38 AM2/16/17
to
On 2/16/2017 3:24 AM, Andrew Schorr wrote:
> On Wednesday, February 15, 2017 at 1:43:28 PM UTC-5, Ed Morton wrote:
>> How hard would it be to provide an extension function that provides the
>> capability to distinguish between un-initialized vs un-provided arguments to
>> user-defined functions?
>
> I'm not sure exactly what you have in mind, but I don't think it's possible. Can you please elaborate?

Updated from previous similar post, please disregard that previous one, sorry.

I fairly frequently want to be able to write a function that has a default value
it uses when an argument is missing as is the case with the function we've been
discussing and just like many builtin functions do. To write that function I'd
need to be able to write:

function foo(x) {
if (ispresent(x)) {
print "present"
}
else {
print "absent"
}
}

so I could get this behavior (where var is an unintialized variable):

BEGIN {
foo(var) # -> "present"
foo() # -> "absent"
}

We have no way to write that function today as there is no "ispresent()"
function to test if an argument was provided to the function or not. We used to
have a similar problem with distinguishing between arrays and scalars before
"isarray()" was provided.

>
>> I think that'd be far more useful than the join() function we're discussing
>> since anyone can easily write a sensible join() function in user space but we
>> currently CANNOT write a user space function that behaves differently with
>> un-initialized vs un-provided arguments.
>
> You are starting to convince me that the join extension is a bad idea, since your idea of what it should do departs significantly from what I had in mind.

That's fine but I do think there's value in the function I proposed because then
we could just write :

print arr2str(flds)

instead of manually writing this loop or similar which is a staple of awk scripts:

for (i=1; i<=length(flds); i++) {
printf "%s%s", flds[i], (i<length(flds) ? OFS : ORS)
}

Note that the above also covers the case where we just want to print the fields
as we could now do:

split($0,arr)
print arr2str(arr)

instead of this or similar:

for (i=1; i<=NF; i++) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}

so we're talking about reducing code that occurs in a HUGE number of scripts.

The function I described can also be used for trivially recombining the result
of any *split() if/when desired:

split(str,flds,FS,seps)
str = seps[0] arr2str(flds,seps) seps[length(flds)]

which I think was the original intent of the function you were considering
providing.

btw I'm using the name arr2str() rather than join() as arr2str() is extremely
clear and specific to flattening one specific array (this function focuses on
the flds array, the seps array is collateral and optional) whereas join()
implies merging of 2 or more equally important data streams. What are you
join()-ing if you only have 1 array? There's no definition of a "join" operation
I'm aware of that doesn't require at least 2 similar data streams so IMHO join()
would be a bad name for this function.

Kenny McCormack

unread,
Feb 16, 2017, 12:30:05 PM2/16/17
to
In article <o84j4j$8sm$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>I fairly frequently want to be able to write a function that has a default
>value it uses when an argument is missing, just like many builtin
>functions do. To write that function I'd need to be able to write:
>
>function foo(x) {
> if (ispresent(x)) {
> print "present"
> }
> else {
> print "absent"
> }
>}
>
>so I could get this behavior (where var is an unintialized variable):
>
>BEGIN {
> foo(var) # -> "present"
> foo() # -> "absent"
>}

If I may intercede here, I think I can explain this for Andy's benefit.

I concur with Ed's explanation above, and would add that I had previously
posted something similar - that is, I provided a function defined as:

function fiz(foo) { return foo == 0 && foo == "" }

and showed how if you call it without an argument, i.e., print fiz(), it
returns 1 (indicating that the variable foo is uninitialized, while if you
pass it any literal value, it returns 0. However, Ed then responded that
the problem with fiz() as defined above is that if you call it with an
uninitialized variable, it behaves as if you had passed it no variable at
all. I.e.,
print fiz()
and
print fiz(bar)

both return 1 if bar has not been assigned a value prior to the call. I
think (and correct me if I am wrong), Ed's whole point is that he would
like to be able to distinguish between the above two cases (that is, fiz()
and fiz(bar)).

Now, going further afield, I noted that I think this is kind of
silly/irrelevant, because how often do you pass an uninitialized variable
to a function? Manuel responded that doing so is, of course, bad
programming style in a "real" program, but that it is common in AWK
"one-liners". Now, having thought about it for a day or so, here is where
I disagree with Manuel. Suppose I am writing a one-liner (shell command)
and want to split a string by single characters. Surely, I will write:

$ gawk 'BEGIN {split("ABCDEFGHI",A,"");for (i in A) print i,A[i]}'

Are you seriously suggesting that any sensible person would write:

$ gawk 'BEGIN {split("ABCDEFGHI",A,x);for (i in A) print i,A[i]}'

instead?
(relying on the uninitialized variable x to pass "" to the split() function)

I think not. So, in the final analysis, I do think this whole concern (of
Ed/Manuel) really is much-ado-about-nothing. But, still, I do find it
interesting from a theoretical point-of-view. Hence my continued posting
in this thread.

Next, leaving aside the political side of this (the "What is user-space?"
question) since that is fodder for a whole 'nother post, let me just say
that Ed would probably be satisfied if you (Andy) wrote a join() function
as an extension library function, and were able to get it into the gawk
distribution.

Now, let me be clear on this so that there is no chance of misunderstanding.
I am *NOT* saying it has to be in the gawk core (i.e., in the main gawk
executable), but it does have to be in the distribution - so that like,
say, inplace and filefuncs, it gets installed on the typical system when
the system administrator does either "make install" or "apt-get install gawk"
(or similar for your distribution or whatever they do on Windows systems).
And thus the typical gawk program can simply do "-l join" or @load "join"
and be good to go.

The point - and again, let me be as clear as possible - is that if it were
standardized by being in the distribution, Ed would probably be happy (or
happy enough) even if it its behavior wasn't 100% to his liking. The fact
that it is standardized would allow us to finally put this baby to bed, and
then we could go on to argue/discuss other things.

And, finally, note that, now that Arnold no longer reads/posts to this
newsgroup, you, Andy, are the only person here who is (or might be) capable
of doing so. So, we are entrusting this task to you...

--
> No, I haven't, that's why I'm asking questions. If you won't help me,
> why don't you just go find your lost manhood elsewhere.

CLC in a nutshell.

Kenny McCormack

unread,
Feb 16, 2017, 12:51:04 PM2/16/17
to
In article <o84nis$ks6$1...@news.xmission.com>,
Kenny McCormack <gaz...@shell.xmission.com> wrote:
...
>Next, leaving aside the political side of this (the "What is user-space?"
>question) since that is fodder for a whole 'nother post, let me just say
>that Ed would probably be satisfied if you (Andy) wrote a join() function
>as an extension library function, and were able to get it into the gawk
>distribution.

Let me also add, as a clarification/amplification of one of my earlier
posts, that I still do think it would be good if GAWK were to implement, in
the core language, the argcount() and argval() functions from TAWK. These
would, inter alias, solve Ed's main problem in this thread.

To clarify, in TAWK:
argcount() returns the number of args passed to the current function.

argval(N) returns the value of the N'th arg passed to the current function.

Thus, a user-defined function can quickly tell how many args were passed in.
Furthermore, using argval(), you can process an arbitrary number of args.

Thus, for example, a max() function could be written (in TAWK) as:

--- Cut Here ---
function max() {
local ac = argcount(),mx = argval(1), i, tmp

if (ac == 0) return "Invalid call - no args at all!"
for (i=2; i<=ac; i++)
if ((tmp = argval(i)) > mx)
mx = tmp
return mx
}
BEGIN {
print max()
print max(1)
print max(-1,-5)
print max(-1,5,-5)
}
--- Cut Here ---

and invoked as: awkw -w -f max

The point of the -w will be clear if you are familiar with TAWK.

--
Modern Conservative: Someone who can take time out from demanding more
flag burning laws, more abortion laws, more drug laws, more obscenity
laws, and more police authority to make warrantless arrests to remind
us that we need to "get the government off our backs".

Ed Morton

unread,
Feb 16, 2017, 1:44:50 PM2/16/17
to
On 2/16/2017 11:51 AM, Kenny McCormack wrote:
> In article <o84nis$ks6$1...@news.xmission.com>,
> Kenny McCormack <gaz...@shell.xmission.com> wrote:
> ...
>> Next, leaving aside the political side of this (the "What is user-space?"
>> question) since that is fodder for a whole 'nother post, let me just say
>> that Ed would probably be satisfied if you (Andy) wrote a join() function
>> as an extension library function, and were able to get it into the gawk
>> distribution.
>
> Let me also add, as a clarification/amplification of one of my earlier
> posts, that I still do think it would be good if GAWK were to implement, in
> the core language, the argcount() and argval() functions from TAWK. These
> would, inter alias, solve Ed's main problem in this thread.
>
> To clarify, in TAWK:
> argcount() returns the number of args passed to the current function.
>
> argval(N) returns the value of the N'th arg passed to the current function.
>
> Thus, a user-defined function can quickly tell how many args were passed in.

Yes since any optional argument is always at the end of the args list that would
be better functionality to have than a simple "ispresent()" check on an argument
and would cover that functionality. It would have many uses, including allowing
us to write general-purpose debugging functions in user space to simply print
all the args passed to a function. I'm surprised argval is a function instead of
an array for symmetry to ARGV but nbd.

Ed.

Kenny McCormack

unread,
Feb 16, 2017, 5:06:05 PM2/16/17
to
In article <o84rri$an6$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>Yes since any optional argument is always at the end of the args list
>that would be better functionality to have than a simple "ispresent()"
>check on an argument and would cover that functionality. It would have
>many uses, including allowing us to write general-purpose debugging
>functions in user space to simply print all the args passed to a
>function. I'm surprised argval is a function instead of an array for
>symmetry to ARGV but nbd.

I like that! And you wouldn't need a separate way of getting the count,
since you could just do: length(arr)

But note that proper implementation will take some thought. You can't just
make it a global variable, since functions can call other functions. It
would have to magically spring into existence for each function.

Maybe have a function "getargs", that populates an array. So, you would do:

function foo() { getargs(A);for (i in A) print i,A[i] }

This is not perfect, but maybe something like that.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/DanaC

Digi

unread,
Feb 21, 2017, 9:50:50 AM2/21/17
to
hello

1. gsub(/[.]|[-]/,... ?

more correct will be: gsub(/[\.\-]/,.....

2. i call the function that you are talking about: _retab(). you can find it inside the file that i share at http://www.filedropper.com/arr

for user-area this is most effective way to merge two arrays with the sequential numeric-indexed elements.

patsplit(t, A, /.../, B)
d = _retab(A, B))
t == d

the function also have extra parameters that you can easily understand by exploring the source.
also shared file contain another functions like _retab()
_reta(A)
_retas(A,"separator")

ps. just ignore the size of the shared source ;)

D

Janis Papanagnou

unread,
Feb 21, 2017, 11:03:33 AM2/21/17
to
On 21.02.2017 15:50, Digi wrote:
> hello
>
> 1. gsub(/[.]|[-]/,... ?
>
> more correct will be: gsub(/[\.\-]/,.....

There is a comparative for 'correct'?

So this one is "most correct" then?

gsub(/[.-]/,.....

Or why is one "more correct" than the other?

Janis

> [...]

Digi

unread,
Feb 21, 2017, 3:49:33 PM2/21/17
to
it's faster. your variation is more correct.

Andrew Schorr

unread,
Feb 22, 2017, 5:26:57 PM2/22/17
to
On Thursday, February 16, 2017 at 5:06:05 PM UTC-5, Kenny McCormack wrote:
> Ed Morton wrote:
> ...
> >Yes since any optional argument is always at the end of the args list
> >that would be better functionality to have than a simple "ispresent()"
> >check on an argument and would cover that functionality. It would have
> >many uses, including allowing us to write general-purpose debugging
> >functions in user space to simply print all the args passed to a
> >function. I'm surprised argval is a function instead of an array for
> >symmetry to ARGV but nbd.
>
> I like that! And you wouldn't need a separate way of getting the count,
> since you could just do: length(arr)
>
> But note that proper implementation will take some thought. You can't just
> make it a global variable, since functions can call other functions. It
> would have to magically spring into existence for each function.
>
> Maybe have a function "getargs", that populates an array. So, you would do:
>
> function foo() { getargs(A);for (i in A) print i,A[i] }
>
> This is not perfect, but maybe something like that.

Thanks for the discussion. If these ideas were easy to implement, I'd take a swing at it. But my cursory inspection of the codebase did not convince me that it wouldn't be pretty messy to address these issues. I'm not convinced that it's terribly important to be able to tell the difference between an argument that wasn't passed and the case where the passed argument was an uninitialized variable. I've written tens of thousands of lines of AWK, and this has never been problematic for me. In terms of full variadic function support, I think this would be cool, but, as I mentioned, it's probably a pain to implement. This has also never been a huge problem for me. I think the AWKish way to deal with this is to pass an array with a variable number of elements instead of passing a variable number of arguments. Of course, anybody is always welcome to fork the code and patch in these features. If there's an elegant solution, perhaps the developers could be convinced to adopt the patch.

Regards,
Andy

Kenny McCormack

unread,
Feb 22, 2017, 6:14:44 PM2/22/17
to
In article <3b73bc96-f24f-4707...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
...
>Thanks for the discussion. If these ideas were easy to implement, I'd take
>a swing at it. But my cursory inspection of the codebase did not convince
>me that it wouldn't be pretty messy to address these issues. I'm not
>convinced that it's terribly important to be able to tell the difference
>between an argument that wasn't passed and the case where the passed
>argument was an uninitialized variable. I've written tens of thousands of
>lines of AWK, and this has never been problematic for me. In terms of full
>variadic function support, I think this would be cool, but, as I
>mentioned, it's probably a pain to implement. This has also never been a
>huge problem for me. I think the AWKish way to deal with this is to pass
>an array with a variable number of elements instead of passing a variable
>number of arguments. Of course, anybody is always welcome to fork the code
>and patch in these features. If there's an elegant solution, perhaps the
>developers could be convinced to adopt the patch.

Comments:

1) Upon more thought, I think I'd stick with the original idea of
argcount() and argval(N). The array idea is slightly weird.
2) Yes, I agree it is not all that necessary - we've certainly managed
to live our lives until now without it - but it would be nice.
3) It seems to me that the trick to this is to figure out a way to get
the caller's arg list. I.e., for function B, called from A, to be
able to inspect the args that were passed to A. Then it would be
straightforward to write argcount() and argval() as extension
library functions.

Now, in an extension library, we have a function declared like
this:

awk_bool_t (*api_get_argument)(awk_ext_id_t id, size_t count,
awk_valtype_t wanted,
awk_value_t *result);

This is actually hidden by a macro, but the point is that the
ext_id obviously points the function at a particular parameter
block. It seems to me that if I could somehow get that similar
pointer "uplevel", so to speak, then I could report on my caller's
args.

Further, in my short analysis of this, it seems to me that the
debugger does this sort of functionality. I looked at debug.c some
and it seems like a little reverse engineering of that would be
fruitful.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Pearls

Andrew Schorr

unread,
Feb 22, 2017, 7:28:32 PM2/22/17
to
On Wednesday, February 22, 2017 at 6:14:44 PM UTC-5, Kenny McCormack wrote:
> 3) It seems to me that the trick to this is to figure out a way to get
> the caller's arg list. I.e., for function B, called from A, to be
> able to inspect the args that were passed to A. Then it would be
> straightforward to write argcount() and argval() as extension
> library functions.
>
> Now, in an extension library, we have a function declared like
> this:
>
> awk_bool_t (*api_get_argument)(awk_ext_id_t id, size_t count,
> awk_valtype_t wanted,
> awk_value_t *result);
>
> This is actually hidden by a macro, but the point is that the
> ext_id obviously points the function at a particular parameter
> block. It seems to me that if I could somehow get that similar
> pointer "uplevel", so to speak, then I could report on my caller's
> args.
>
> Further, in my short analysis of this, it seems to me that the
> debugger does this sort of functionality. I looked at debug.c some
> and it seems like a little reverse engineering of that would be
> fruitful.

I didn't spend that long thinking about how to do this, but you're welcome to take a shot at it. I don't think the extensions are really going "uplevel" though. The stack frame is for the external function call itself. And the code path for implementing AWK functions is quite different. Overall, it's a question of cost vs benefit. I don't think we really NEED this, and I think the implementation is probably messy (but I'm not sure), and then there's always the issue of polluting the namespace and making gawk ever more complicated and difficult to maintain.

Regards,
Andy
0 new messages