fast method for concatenating strings in array or $0?

Josef Frank

unread,

Jun 5, 2015, 10:00:51 AM6/5/15

to

To filter out some columns from a text table I presently do the following:

# start awk script
BEGIN {
... # generate array "selected_columns" as filter list
... # by reading in from text file or something else, e.g.:
split("1 210 340", selected_columns)
}
{
outputline = ""
for (i = 1; i <= NF; i++) {
if (i in selected_columns) {
line = line "\t" $i
}
}
print outputline
}
# end awk script

where "selected_columns" denotes an array whose keys have been created
to contain just the numbers of the fields I want to be output. This list
can easily contain about 1e4 or more items

The approach above means that I have to repeatedly concatenate the
present content of the line as processed so far with the next field in
question after testing it for being in the selection list. I assume this
means continuous reallocation as the string grows. At least time
measurements suggest quadratic behaviour (see http://snag.gy/gaAt9.jpg)
supporting this assumption.

Question: Is there some method available that would allow to avoid
growing the output string field by field in pure awk (maybe one that
also works in other implementations, not only gawk) and would provide a
mechanism like python's or tcl's "join" functions:
"\t".join([fields[i] for i in selected_columns])

(thus outsourcing the work to C and allocating the memory only once; see
https://paolobernardi.wordpress.com/2012/11/06/python-string-concatenation-vs-list-join/)

Best Josef

Ed Morton

unread,

Jun 5, 2015, 10:12:30 AM6/5/15

to

No. The only alternative to constructing a string from the identified fields and
printing it later is to print each field as it's identified.

Ed.

glen herrmannsfeldt

unread,

Jun 5, 2015, 2:35:25 PM6/5/15

to

Josef Frank <josef...@gmx.li> wrote:

(snip)

> The approach above means that I have to repeatedly concatenate the
> present content of the line as processed so far with the next field in
> question after testing it for being in the selection list. I assume this
> means continuous reallocation as the string grows. At least time
> measurements suggest quadratic behaviour (see http://snag.gy/gaAt9.jpg)
> supporting this assumption.

The general solution to such problems is "divide and conquer".

> Question: Is there some method available that would allow to avoid
> growing the output string field by field in pure awk (maybe one that
> also works in other implementations, not only gawk) and would provide a
> mechanism like python's or tcl's "join" functions:
> "\t".join([fields[i] for i in selected_columns])

If you want to concatenate N strings, concatenate sqrt(n) strings
into sqrt(n) longer strings, then concatenate those strings.

This can be done recursively, but most likely one level is enough.
(You can round of sqrt(n) and it still works.)

This reduces the reallocations from O(N) to O(sqrt(N))

-- glen

Kaz Kylheku

unread,

Jun 5, 2015, 4:57:00 PM6/5/15

to

On 2015-06-05, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> If you want to concatenate N strings, concatenate sqrt(n) strings
> into sqrt(n) longer strings, then concatenate those strings.
>
> This can be done recursively, but most likely one level is enough.
> (You can round of sqrt(n) and it still works.)
>
> This reduces the reallocations from O(N) to O(sqrt(N))

If just catenate fixed-sized groups of strings (such as pairs,
then pairs of catenated pairs and so on), you get it down to logarithmic
time. (Think: similar to merge sort).

You don't need recursion, just iteration: take the array of N strings, and join
groups of up to K strings (K is some selected constant) to form an array of ceil(N/K)
strings. If the resulting array has exactly one string, you are done.
Otherwise, repeat.

Since the array shrinks, it can be done in place in one pass from the
bottom of the array on up.

Andrew Schorr

unread,

Jun 5, 2015, 11:28:40 PM6/5/15

to

On Friday, June 5, 2015 at 10:00:51 AM UTC-4, Josef Frank wrote:
> Question: Is there some method available that would allow to avoid
> growing the output string field by field in pure awk (maybe one that
> also works in other implementations, not only gawk) and would provide a
> mechanism like python's or tcl's "join" functions:
> "\t".join([fields[i] for i in selected_columns])

One could imagine writing an extension function that joins together the elements of an array. So the awk code would look like this:

BEGIN {
nf = split("1 210 340", selected_columns)
}

{
delete x
for (i = 1; i <= nf; i++)
x[i] = $selected_columns[i]
print array_join(x, 1, nf)
}

where array_join concatenates the 1..nf elements of the array x.

Or another approach might be:

BEGIN {
nf = split("1 210 340", selected_columns)
}

{
split($0, f)
print array_join_selected(f, selected_columns)
}

where array_join_selected concatenates the elements of array f that are in the selected_columns array.

The implementation of those functions is left as an exercise for the reader. But I believe this can achieve essentially the same performance as Python.

The challenge here is really to design a good general-purpose join function that is most applicable. That requires a bit more thought. But I believe the problem can be solved in an extension. A contribution would be welcome.

Regards,
Andy

Andrew Schorr

unread,

Jun 7, 2015, 9:44:48 AM6/7/15

to

On Friday, June 5, 2015 at 11:28:40 PM UTC-4, Andrew Schorr wrote:
> The implementation of those functions is left as an exercise for the reader. But I believe this can achieve essentially the same performance as Python.
>
> The challenge here is really to design a good general-purpose join function that is most applicable. That requires a bit more thought. But I believe the problem can be solved in an extension. A contribution would be welcome.

I'll sweeten the pot. If we can form a consensus on the best API, I'll code it up. So how should this work? One idea:

array_join(f, [sep])

where f is an array with integer subscripts ranging from 1 through N and string values. If "sep" is omitted, a single space would be used by default.

I think that should be fairly easy to implement in C.

Regards,
Andy

Janis Papanagnou

unread,

Jun 7, 2015, 10:03:16 AM6/7/15

to

You are talking about a "sequence of strings (stored in array) to single
string" concatenation, right? - So why not the straightforward design to
just concatenate the strings, and if one wants a separator to specify it
explicitly using that optional 'sep' argument.

Janis

Kenny McCormack

unread,

Jun 7, 2015, 10:14:53 AM6/7/15

to

In article <ml1iv2$e5o$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...

>You are talking about a "sequence of strings (stored in array) to single
>string" concatenation, right? - So why not the straightforward design to
>just concatenate the strings, and if one wants a separator to specify it
>explicitly using that optional 'sep' argument.

Isn't that what he said?

--

Some of the more common characteristics of Asperger syndrome include:

* Inability to think in abstract ways (eg: puns, jokes, sarcasm, etc)
* Difficulties in empathising with others
* Problems with understanding another person's point of view
* Hampered conversational ability
* Problems with controlling feelings such as anger, depression
and anxiety
* Adherence to routines and schedules, and stress if expected routine
is disrupted
* Inability to manage appropriate social conduct
* Delayed understanding of sexual codes of conduct
* A narrow field of interests. For example a person with Asperger
syndrome may focus on learning all there is to know about
baseball statistics, politics or television shows.
* Anger and aggression when things do not happen as they want
* Sensitivity to criticism
* Eccentricity
* Behaviour varies from mildly unusual to quite aggressive
and difficult

Janis Papanagnou

unread,

Jun 7, 2015, 10:16:08 AM6/7/15

to

On 07.06.2015 15:44, Andrew Schorr wrote:
>

> I'll sweeten the pot. If we can form a consensus on the best API, I'll
> code it up. So how should this work? One idea:
>
> array_join(f, [sep])
>
> where f is an array with integer subscripts ranging from 1 through N and
> string values. If "sep" is omitted, a single space would be used by
> default.
>
> I think that should be fairly easy to implement in C.

WRT "fairly easy"; have you considered how to handle 'f' if it's no flat
array but a multi-dimentional (and potentially inhomogeneous) one?

Janis

>
> Regards, Andy
>
>
>

Janis Papanagnou

unread,

Jun 7, 2015, 10:22:42 AM6/7/15

to

On 07.06.2015 16:14, Kenny McCormack wrote:
> In article <ml1iv2$e5o$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>> You are talking about a "sequence of strings (stored in array) to single
>> string" concatenation, right? - So why not the straightforward design to
>> just concatenate the strings, and if one wants a separator to specify it
>> explicitly using that optional 'sep' argument.
>
> Isn't that what he said?

My question was: Why a space as default? - I always find it a bit strange
if the [presumed] intent - to do a plain concatenation - would be a special
case where one needs to write: array_join(f, "") while array_join(f)
would silently add some separator. - Yes, that's just a detail, but since
we've been asked to discuss design I'd have liked to hear the rationale for
Andrew's suggestion.

Janis

Joe User

unread,

Jun 7, 2015, 11:13:25 AM6/7/15

to

On Sun, 07 Jun 2015 06:44:46 -0700, Andrew Schorr wrote:

> I'll sweeten the pot. If we can form a consensus on the best API, I'll
> code it up. So how should this work? One idea:
>
> array_join(f, [sep])

In my site library, I have these two functions:

cat_array(a, delimiter, maxi)

cat_arrayi(a, delimiter)

If delimiter is not supplied, a space is used.

If maxi is used, indices greater than maxi are not considered.

cat_arrayi concatenates all array indices.

--
The last stage of grief isn't
acceptance. It's exploitation.

-- Lewis Black

Ed Morton

unread,

Jun 7, 2015, 11:48:35 AM6/7/15

to

The function should be named "join()" and it should do the opposite of
"split()". In other words if I can do:

str = "a.b;c"
numElts = split(str,flds,/[[:punct:]]/,seps)

to create flds and seps from str, then I should be able to do:

numElts = join(str,flds,seps)

to recreate "str" from flds and seps.

The rules for what to do based on the values of the 3rg arg "seps" are:

If it is missing then use the value of OFS as the separator, just like
recompiling $0 when we assign to a field.
If it is a string then use that string as the separator between all elements.
If it is an array then use those array elements as the separators in order.

Regards,

Ed.

Ed Morton

unread,

Jun 7, 2015, 12:18:40 PM6/7/15

to

Here's a mockup of how join() should work:

$ cat tst.awk
function join(str,arr,sep, cnt) {
if (isarray(sep)) {
str = sep[0]
for (cnt=1; cnt in arr; cnt++) {
str = str arr[cnt] sep[cnt]
}
}
else {
sep = (sep=="" ? OFS : sep)
for (cnt=1; cnt in arr; cnt++) {
str = (cnt>1? str sep : "") arr[cnt]
}
}
STR = str
return cnt+0
}
BEGIN {
OFS = "-"

split("a.b;c",flds,/[[:punct:]]/,seps)

join(str,flds,seps)
print STR

join(str,flds,"")
print STR

join(str,flds,"<->")
print STR
}

$ awk -f tst.awk
a.b;c
a-b-c
a<->b<->c

The above uses a global var STR to simulate being able to populate the arg str
from within the function, and it uses sep=="" to simulate being able to test for
a missing argument.

I'm in 2 minds about the value of having join() return the number of elements in
the array vs having it just return the string (so the synopsis would then be
"str=join(flds,seps)" ) - beyond the symmetry with split(), I think the main
value in having it return a count might just be that it'd then be able to return
a negative value if some error occurred such as someone passing a string instead
of an array as the first arg.

Ed.

Ed Morton

unread,

Jun 7, 2015, 12:38:35 PM6/7/15

to

After more thought about how existing functions work and going through some
scenarios, I don't believe there is any value in having join() return a count
even for reporting error cases (since they'd all produce actual awk failures and
error messages) so I'm now proposing

str = join(flds,seps)

with the rules above for how seps is handled:

If it is missing then use the value of OFS as the separator, just like
recompiling $0 when we assign to a field.
If it is a string then use that string as the separator between all elements.
If it is an array then use those array elements as the separators in order.

e.g. with sep=="" simulating the case of seps being absent:

$ cat tst.awk
function join(arr,sep, str, cnt) {

if (isarray(sep)) {
str = sep[0]
for (cnt=1; cnt in arr; cnt++) {
str = str arr[cnt] sep[cnt]
}
}
else {
sep = (sep=="" ? OFS : sep)
for (cnt=1; cnt in arr; cnt++) {
str = (cnt>1? str sep : "") arr[cnt]
}
}

return str

}
BEGIN {
OFS = "-"

split("a.b;c",flds,/[[:punct:]]/,seps)

print join(flds,seps)
print join(flds,"")
print join(flds,"<->")
}
$

$ awk -f tst.awk
a.b;c
a-b-c
a<->b<->c

Regards,

Ed.

Kenny McCormack

unread,

Jun 7, 2015, 3:02:23 PM6/7/15

to

In article <ml1k3h$ns1$1...@news.m-online.net>,

Yes, I get it now.
Good point - there should not be any default separator.

But, conversely, Ed actually makes a pretty good point that the default
should be OFS - as it is in most other similar situations.

--
For instance, Standard C says that nearly all extensions to C are prohibited. How
silly! GCC implements many extensions, some of which were later adopted as part of
the standard. If you want these constructs to give an error message as
“required” by the standard, you must specify ‘--pedantic’, which was
implemented only so that we can say “GCC is a 100% implementation of the
standard”, not because there is any reason to actually use it.

Josef Frank

unread,

Jun 7, 2015, 3:19:24 PM6/7/15

to

On 6/5/2015 16:12 Ed Morton wrote:

> No. The only alternative to constructing a string from the identified
> fields and printing it later is to print each field as it's identified.

Actually this was the original approach I used. I just thought I could
save some time by eliminating some I/O function call overhead by first
building up the output string in memory instead of printing any field
one by one.

The funny part is: It barely works! Below 5,000 elements almost only
startup time matters, above that it's mostly saving less than 10% -- up
to a certain limit (e.g if using substrings of five characters this
limit is reached after about 250,000 concatenated elements). Beyond that
the naive version -- just printing each field as its processed -- wins
with a margin that increases with number of elements, supported by awk
output buffering and the disk caching provided by the OS, as this
behaves linearly.

Here some measurements (measurements in µs. Just ignore fake precision.
Only the first two or three digits are reliable)

array length time with time printing
concatenated every field
output directly

0 31813 31076
1,000 32915 33989
2,000 34317 34854
5,000 37982 40399
10,000 43199 46708
20,000 52683 59929
50,000 84844 97340
100,000 138034 159136
200,000 267914 284408
300,000 433229 402483
500,000 879778 653846
1,000,000 2686592 1326441
2,000,000 9103545 2481502
3,000,000 19225792 3672013

Best
Josef

Ed Morton

unread,

Jun 7, 2015, 6:10:34 PM6/7/15

to

On 6/7/2015 2:02 PM, Kenny McCormack wrote:
> In article <ml1k3h$ns1$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 07.06.2015 16:14, Kenny McCormack wrote:
>>> In article <ml1iv2$e5o$1...@news.m-online.net>,
>>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> ...
>>>> You are talking about a "sequence of strings (stored in array) to single
>>>> string" concatenation, right? - So why not the straightforward design to
>>>> just concatenate the strings, and if one wants a separator to specify it
>>>> explicitly using that optional 'sep' argument.
>>>
>>> Isn't that what he said?
>>
>> My question was: Why a space as default? - I always find it a bit strange
>> if the [presumed] intent - to do a plain concatenation - would be a special
>> case where one needs to write: array_join(f, "") while array_join(f)
>> would silently add some separator. - Yes, that's just a detail, but since
>> we've been asked to discuss design I'd have liked to hear the rationale for
>> Andrew's suggestion.
>>
>> Janis
>>
>
> Yes, I get it now.
> Good point - there should not be any default separator.
>
> But, conversely, Ed actually makes a pretty good point that the default
> should be OFS - as it is in most other similar situations.
>

... and for symmetry with split() which splits using FS by default. In case it's
not obvious, I think symmetry with split() is key to a successful join() function.

Ed.

Janis Papanagnou

unread,

Jun 8, 2015, 1:11:59 AM6/8/15

to

On 08.06.2015 00:10, Ed Morton wrote:
> On 6/7/2015 2:02 PM, Kenny McCormack wrote:
>> In article <ml1k3h$ns1$1...@news.m-online.net>,
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>>> In article <ml1iv2$e5o$1...@news.m-online.net>,
>>>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>>> ...
>>>>> You are talking about a "sequence of strings (stored in array) to single
>>>>> string" concatenation, right? - So why not the straightforward design to
>>>>> just concatenate the strings, and if one wants a separator to specify it
>>>>> explicitly using that optional 'sep' argument.
>>>>

>>> My question was: Why a space as default? - I always find it a bit strange
>>> if the [presumed] intent - to do a plain concatenation - would be a special
>>> case where one needs to write: array_join(f, "") while array_join(f)
>>> would silently add some separator. - Yes, that's just a detail, but since
>>> we've been asked to discuss design I'd have liked to hear the rationale for
>>> Andrew's suggestion.
>>

>> Yes, I get it now.
>> Good point - there should not be any default separator.
>>
>> But, conversely, Ed actually makes a pretty good point that the default
>> should be OFS - as it is in most other similar situations.

I considered that point as well, but...

> ... and for symmetry with split() which splits using FS by default. In case
> it's not obvious, I think symmetry with split() is key to a successful join()
> function.

...in my book, the intent of such a definition would be to have some
semantically inverse function; but that isn't the case. We could have
strings a[1]="Hello world" and a[2]="Hi there" in the array and after
join and split it would be "Hello", "world", "Hi", "there" instead.

It certainly depends on from which function set context we look at it.
Do we want (as the original posting asks for) a string concatenation?
Or do we want some flattening/unflattening function - note my question
about the implementation provided multi-dimensional arrays - where we'd
have a set of complmentary functions. Or this (space or FS) suggestion
which is somewhere in between.

I have no strong opinion against the FS variant, but, all that said, I
can't really say I'd consider it a neat design.

Janis

Ed Morton

unread,

Jun 8, 2015, 10:14:31 AM6/8/15

to

You'd only have that if you chose to use " " as the separator on the join and
split. If you chose to use RS or some other separator that doesn't exist in the
array contents instead then you'd have:

a[1]="Hello world"; a[2]="Hi there"
-> join(a,RS) -> s="Hello world\nHi there"
-> split(s,a,RS) -> a[1]="Hello world"; a[2]="Hi there"

I do get your point that if you use the default args for those functions and
they have their default values and the array contents contained those values
then you would create a different array than you started with but that's
completely consistent with what happens today when records are recompiled:

$1="Hello world"; $2="Hi there" -> NF=2, $0="Hello world Hi there"
$0=$0 -> NF=4; $1="Hello"; $2="world"; $3="Hi"; $4="there"

>
> It certainly depends on from which function set context we look at it.
> Do we want (as the original posting asks for) a string concatenation?
> Or do we want some flattening/unflattening function - note my question
> about the implementation provided multi-dimensional arrays - where we'd
> have a set of complmentary functions. Or this (space or FS) suggestion
> which is somewhere in between.

I don't think join() should do anything with multi-dimensional arrays. split()
doesn't create them, it would complicate/muddy the API, and the usefulness is
limited. Maybe if someday someone decides a multi_split() function that takes 2
separators is useful then we could create a matching multi_join function that
does the inverse but until then....

Ed.

Andrew Schorr

unread,

Jun 8, 2015, 7:41:03 PM6/8/15

to

Thanks everyone for the interesting discussion. I think Ed's idea that the join function should be the inverse of split has a certain elegance to it. But in that case, shouldn't the default separator between elements be FS rather than OFS? My initial suggestion of using a space as the default separator was based on the Python join function's behavior, but I agree that was a bad idea. For awk, FS or OFS make more sense as the default.

Regarding the handling of arrays containing sub-arrays: I think that's an error. The argument array in such a case would violate the function's specification. The array must contain scalar values with sequential integer indices increasing from 1. The only question is whether to issue a warning message and return an empty string, or to throw a fatal error?

We could also have different variants in the join library if there are good cases to be made for multiple functions with different approaches.

Regards,
Andy

Ed Morton

unread,

Jun 8, 2015, 11:51:58 PM6/8/15

to

On 6/8/2015 6:41 PM, Andrew Schorr wrote:
> Thanks everyone for the interesting discussion. I think Ed's idea that the join function should be the inverse of split has a certain elegance to it. But in that case, shouldn't the default separator between elements be FS rather than OFS? My initial suggestion of using a space as the default separator was based on the Python join function's behavior, but I agree that was a bad idea. For awk, FS or OFS make more sense as the default.

Andy - no because FS can be a regexp so THEN what would you do and as well as
being symmetrical with split() it's intuitive for join() when combining array
elements into a string to be consistent with the way fields are combined into $0
when you assign to a field.

awk reading a record splits a string on FS into $1, etc.
=> split(string,arr) splits a string on FS into arr[1], etc.

awk doing $1=$1 combines $1, etc. into a string using OFS as separator
=> join(arr) combines arr[1], etc. into a string using OFS as separator

Thanks for taking this on!

Ed.

Ed Morton

unread,

Jun 9, 2015, 12:01:36 AM6/9/15

to

On 6/8/2015 6:41 PM, Andrew Schorr wrote:

> Thanks everyone for the interesting discussion. I think Ed's idea that the join function should be the inverse of split has a certain elegance to it. But in that case, shouldn't the default separator between elements be FS rather than OFS? My initial suggestion of using a space as the default separator was based on the Python join function's behavior, but I agree that was a bad idea. For awk, FS or OFS make more sense as the default.
>
> Regarding the handling of arrays containing sub-arrays: I think that's an error. The argument array in such a case would violate the function's specification. The array must contain scalar values with sequential integer indices increasing from 1. The only question is whether to issue a warning message and return an empty string, or to throw a fatal error?

Oh and for that I'd throw a fatal error just like you would for trying to use an
array in a string context. Best I can tell it'd be just a plain old coding error
and nothing useful would come of having it return a NULL string and keep
running. Unless someone else has a use-case....

> We could also have different variants in the join library if there are good cases to be made for multiple functions with different approaches.

Right - multi_split() and multi_join() if a need arose....

Ed.

> Regards,
> Andy
>

Andrew Schorr

unread,

Jun 10, 2015, 11:09:00 AM6/10/15

to

On Monday, June 8, 2015 at 11:51:58 PM UTC-4, Ed Morton wrote:
> Andy - no because FS can be a regexp so THEN what would you do and as well as
> being symmetrical with split() it's intuitive for join() when combining array
> elements into a string to be consistent with the way fields are combined into $0
> when you assign to a field.

That's a good point. So OFS it is. When I have some free time, I'll take a stab at implementing this.

Regards,
Andy

Ed Morton

unread,

Jun 10, 2015, 10:22:50 PM6/10/15

to

Great, and will the synopsis be:

str = join(fldsArr[,sepStr|sepsArr])

with the rules for how the 2nd arg is handled:

If it is missing then use the value of OFS as the separator.
If it is a string then use that string as the separator.
If it is an array then use those array elements as the separators.

With the intent of that last one being to recompile a string that
split(str,fldsArr,fieldSep,sepsArr) started with so it'd concatenate
seps[0] flds[1] seps[1] ... flds[N] seps[N]?

Regards,

Ed.

Andrew Schorr

unread,

Jun 11, 2015, 7:17:04 PM6/11/15

to

On Wednesday, June 10, 2015 at 10:22:50 PM UTC-4, Ed Morton wrote:
> Great, and will the synopsis be:
>
> str = join(fldsArr[,sepStr|sepsArr])
>
> with the rules for how the 2nd arg is handled:
>
> If it is missing then use the value of OFS as the separator.
> If it is a string then use that string as the separator.
> If it is an array then use those array elements as the separators.
>
> With the intent of that last one being to recompile a string that
> split(str,fldsArr,fieldSep,sepsArr) started with so it'd concatenate
> seps[0] flds[1] seps[1] ... flds[N] seps[N]?

I think that makes the most sense. Does anybody have a different suggestion for how this should work?

Regards,
Andy

Joep van Delft

unread,

Jun 12, 2015, 4:48:20 PM6/12/15

to

On Thu, 11 Jun 2015 16:17:02 -0700 (PDT)
Andrew Schorr <asc...@telemetry-investments.com> wrote:

> Does anybody have a different
> suggestion for how this should work?

Just wanted to say that I care about this new feature, and that
Ed's proposal seems to elegantly strike the balance between 'too much'
and 'too little'.

Thanks,

Joep

Andrew Schorr

unread,

Jun 25, 2015, 10:38:35 PM6/25/15

to

On Friday, June 12, 2015 at 4:48:20 PM UTC-4, Joep van Delft wrote:
> Just wanted to say that I care about this new feature, and that
> Ed's proposal seems to elegantly strike the balance between 'too much'
> and 'too little'.

Agreed. And I haven't forgotten about this. I started to code it up, but then realized that the extension API is a bit lacking for doing this optimally. So we may want to enhance the extension API a bit to make this easier to implement. I will update once I have more info.

Regards,
Andy

Andrew Schorr

unread,

Feb 16, 2017, 12:54:05 PM2/16/17

to

On Wednesday, June 10, 2015 at 10:22:50 PM UTC-4, Ed Morton wrote:
> Great, and will the synopsis be:
>
> str = join(fldsArr[,sepStr|sepsArr])
>
> with the rules for how the 2nd arg is handled:
>
> If it is missing then use the value of OFS as the separator.
> If it is a string then use that string as the separator.
> If it is an array then use those array elements as the separators.
>
> With the intent of that last one being to recompile a string that
> split(str,fldsArr,fieldSep,sepsArr) started with so it'd concatenate
> seps[0] flds[1] seps[1] ... flds[N] seps[N]?

This behavior does not match the more recent discussion about unpatsplit. So you can see why I am confused...

Regards,
Andy

Kenny McCormack

unread,

Feb 16, 2017, 1:03:19 PM2/16/17

to

In article <0ed55eb3-66c9-4a37...@googlegroups.com>,

Two comments:
1) It doesn't really matter (what the exact specifications for this
proposed function are). As I argue in a previous post (about an
hour ago), what Ed really wants is something standardized. I'm
sure that whatever you come up with, will be fine.
2) It is normal that there will be "specfication drift" in a thread
like this. I think different people have proposed various versions
of a specification and even the same people have proposed, over
time, different specifications.

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

Ed Morton

unread,

Feb 16, 2017, 2:01:32 PM2/16/17

to

The above describes what an unpatsplit() would do but I now realize an
unpatsplit() is just one possible application of a more widely applicable
arr2str() function with which anyone can define their own trivial user-space
function:

function unpatsplit(flds,seps) {
return seps[0] arr2str(flds,seps) seps[length(flds)]
}

if they really want an unpatsplit() function, but that's probably not necessary
to have any more given how trivial the code it contains.

We can also do so much more with a general purpose arr2str() than just recombine
an array that's the result of a previous *split(), e.g. to print the contents of
an array ordered by the numerically decreasing contents (not indices) of the
array, separated by commas using the arr2str() function I defined becomes simply:

PROCINFO["sorted_in"]="@val_num_desc"
print arr2str(arr,",")

Want to print an array by increasing string indices with OFS between values?
That's trivial too:

PROCINFO["sorted_in"]="@ind_str_asc"
print arr2str(arr)

Like I say, vastly more applications than just an un*split().

You could make a case for an optional 3rd arg for the sorting order like
existing gawk sorting functions have so you don't NEED to populate PROCINFO to
specify a sorting order:

print arr2str(arr,",","@val_num_desc")
print arr2str(arr,OFS,"@ind_str_asc")

That would be a very sensible enhancement to what's been discussed so far.

Regards,

Ed.
>
> Regards,
> Andy
>