gawk gensub - using backref sub-expressions in a function?

Janis Papanagnou

unread,

Apr 1, 2013, 7:10:56 AM4/1/13

to

In GNU awk I want to substitute all matching sub-strings by a calculated
value of each sub-expression, as outlined here:

buf = gensub (/([0-9]+)\/([0-9]+)/, func(\\1,\\2), "g", buf)

This does not work, it gives syntax errors at the backref positions; it
seems that backrefs are only available if part of a substitution string.
Or am I missing something? Is such a replacement possible at all in gawk
without looping and doing each match separately?

Janis

pk

unread,

Apr 1, 2013, 7:20:29 AM4/1/13

to

Writing it as follows does not give error, but it doesn't do what it's
expected to do either (as a side note, it appears you cannot have a
function called "func"):

echo 'a 10/23 f' | gawk '

function f(a, b) {return a + b}

{ buf = $0
buf = gensub (/([0-9]+)\/([0-9]+)/, f("\\1","\\2"), "g", buf)
print buf
}'

The above prints "a 0 f" but I guess the expected result should be "a 33 f".
A quick test shows that f() _is_ invoked, but it receives "\1" and "\2" as
arguments, so it seems awk invokes it before doing any match and
replacement, which kind of makes sense, since if you do

do_something(10, 44, calculate(1,2))

obviously calculate() must be evaluated before do_something() is invoked.

Janis Papanagnou

unread,

Apr 1, 2013, 7:38:12 AM4/1/13

to

On 01.04.2013 13:20, pk wrote:
> On Mon, 01 Apr 2013 13:10:56 +0200, Janis Papanagnou
> <janis_pa...@hotmail.com> wrote:
>
>> In GNU awk I want to substitute all matching sub-strings by a calculated
>> value of each sub-expression, as outlined here:
>>
>> buf = gensub (/([0-9]+)\/([0-9]+)/, func(\\1,\\2), "g", buf)
>>
>> This does not work, it gives syntax errors at the backref positions; it
>> seems that backrefs are only available if part of a substitution string.
>> Or am I missing something? Is such a replacement possible at all in gawk
>> without looping and doing each match separately?
>
> Writing it as follows does not give error, but it doesn't do what it's
> expected to do either (as a side note, it appears you cannot have a
> function called "func"):

Oh! Actually I didn't try the name 'func', just wanted to make the example
most clear by that name. (BTW; is not being able to use the name "func" a
bug? I don't recall to have seen it as reserved word or syntactic keyword.)

>
> echo 'a 10/23 f' | gawk '
>
> function f(a, b) {return a + b}
>
> { buf = $0
> buf = gensub (/([0-9]+)\/([0-9]+)/, f("\\1","\\2"), "g", buf)
> print buf
> }'
>
> The above prints "a 0 f" but I guess the expected result should be "a 33 f".
> A quick test shows that f() _is_ invoked, but it receives "\1" and "\2" as
> arguments, so it seems awk invokes it before doing any match and
> replacement, which kind of makes sense, since if you do
>
> do_something(10, 44, calculate(1,2))
>
> obviously calculate() must be evaluated before do_something() is invoked.

Indeed, it makes no sense otherwise.

And there's no eval to work around it. So the question remains: Is such a

replacement possible at all in gawk without looping and doing each match

separately? - I suppose the answer is "No!", then.

Thanks!

Janis

Robert Figura

unread,

Apr 1, 2013, 7:51:07 AM4/1/13

to

On Mon, 01 Apr 2013 13:38:12 +0200
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> On 01.04.2013 13:20, pk wrote:
> > (as a side note, it appears you cannot have a
> > function called "func"):

> Oh! Actually I didn't try the name 'func', just wanted to make the example
> most clear by that name. (BTW; is not being able to use the name "func" a
> bug? I don't recall to have seen it as reserved word or syntactic keyword.)

It's a working shortcut for "function". I got it somewhere from the manual.

func f() { print }
{ f() }

Kind Regards,
Robert Figura

cha...@cs.tu-berlin.de

unread,

Apr 1, 2013, 8:06:38 AM4/1/13

to

Janis Papanagnou <janis_pa...@hotmail.com> wrote:

: In GNU awk I want to substitute all matching sub-strings by a calculated

s=gensub(/\[!([^;]+);([^]]+)\]/,anchor("","\\1","\\2"),"g",s)

works for me as expected, at least from 3.1.7 to 4.0.74, on FreeBSD.

I think "string" is crucial here? Or is it that gensub essentially just
operates on the (string) result of the supplied function?

HTH

- --------------------------------chelImQo'----------------------------------- -
Sebastian F. Mix, Irenenstrasse 21a, D-10317 Berlin, Tel: ++4930 521 1034, /(a\
++176 511 92 357 cha...@furry.de GCode3.12 GCS/S d?- s+:- a E--- C+(+) \p)/
USX+ P- L- W++ N+++ w--- M- !V PS+++ Y+ PGP+ 5+ X++ R-- b++(+) e+ h+ r-- y*

Robert Figura

unread,

Apr 1, 2013, 9:10:38 AM4/1/13

to

On Mon, 1 Apr 2013 12:06:38 +0000 (UTC)
cha...@cs.tu-berlin.de wrote:

> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
> : In GNU awk I want to substitute all matching sub-strings by a calculated
> : value of each sub-expression, as outlined here:
>
> : buf = gensub (/([0-9]+)\/([0-9]+)/, func(\\1,\\2), "g", buf)
>
> : This does not work, it gives syntax errors at the backref positions; it
> : seems that backrefs are only available if part of a substitution string.
> : Or am I missing something? Is such a replacement possible at all in gawk
> : without looping and doing each match separately?
>
> s=gensub(/\[!([^;]+);([^]]+)\]/,anchor("","\\1","\\2"),"g",s)
>
> works for me as expected, at least from 3.1.7 to 4.0.74, on FreeBSD.
>
> I think "string" is crucial here? Or is it that gensub essentially just
> operates on the (string) result of the supplied function?

couldn't find anchor() in my 4.0.1 gawk manual. meanwhile i found this:

$ cat ~/t.awk
func regMap(buf, reg, f) {
s = buf
r = ""
while(s && match(s, reg, m)) {
if(!m[0,"length"]) {
# avoid endless loop on zero length matches: pass a char
r = r substr(s, 1, 1)
s = substr(s, 2)
continue
}
r = r substr(s, 1, m[0,"start"]-1) @f(m)
s = substr(s, m[0,"start"] + m[0,"length"])
}
return r s
}
func f(m) {
a = m[1]
b = m[2]
return a b b a;
}
{ print regMap($0, "(a*)(b*)", "f") }

$ echo 'foazbzaabbk' | awk -f ~/t.awk
foaazbbzaabbbbaak

The drawback is that you have to give the regex as string. Since, in
gawk, regex aren't available as datatypes one would need to change
gawk's parser to make something like this work:

{ print regMap($0, PROTECT(/(a*)(b*)/), "f") }
# with PROTECT() returning a string representation

Sorry if that is too gawk centric, the algo does not require gawk's "@"
notation but i don't see how to do functors in vanilla in a single
process.

Kind Regards
- Robert Figura

Ed Morton

unread,

Apr 1, 2013, 10:45:13 AM4/1/13

to

"No" is the answer - you need a loop with match(). I haven't looked at your
question in much detail so I could be way off but I suspect using the added
array arg for gawk match() might make your life easier.

Ed.

cha...@cs.tu-berlin.de

unread,

Apr 1, 2013, 8:05:14 PM4/1/13

to

Robert Figura <nc-fi...@netcologne.de> wrote:
: On Mon, 1 Apr 2013 12:06:38 +0000 (UTC)

: cha...@cs.tu-berlin.de wrote:

: > Janis Papanagnou <janis_pa...@hotmail.com> wrote:
: >
: > : In GNU awk I want to substitute all matching sub-strings by a calculated
: > : value of each sub-expression, as outlined here:
: >
: > : buf = gensub (/([0-9]+)\/([0-9]+)/, func(\\1,\\2), "g", buf)
: >
: > : This does not work, it gives syntax errors at the backref positions; it
: > : seems that backrefs are only available if part of a substitution string.
: > : Or am I missing something? Is such a replacement possible at all in gawk
: > : without looping and doing each match separately?
: >
: > s=gensub(/\[!([^;]+);([^]]+)\]/,anchor("","\\1","\\2"),"g",s)
: >
: > works for me as expected, at least from 3.1.7 to 4.0.74, on FreeBSD.
: >
: > I think "string" is crucial here? Or is it that gensub essentially just
: > operates on the (string) result of the supplied function?

: couldn't find anchor() in my 4.0.1 gawk manual. meanwhile i found this:

Just an example of a user-defined function, in this case one to
construct a html-anchor: (remove space after '<', tin is allergic
to html tags, it seems)

#construct an anchor
function anchor(name, href, s){
return "< a" (name?(" name=\"" name "\""):"") (href?(" href=\"" href "\""):"") ">" s "< /a>"
}

which, when supplied with s being

Like [!blub.jpg;this!]

in the gensub invocation as above yields s being

Like < a href="blub.jpg">this!< /a>

(minus two spaces, see above) as expected.

: $ cat ~/t.awk
: func regMap(buf, reg, f) {
: s = buf
: r = ""
: while(s && match(s, reg, m)) {
: if(!m[0,"length"]) {
: # avoid endless loop on zero length matches: pass a char
: r = r substr(s, 1, 1)
: s = substr(s, 2)
: continue
: }
: r = r substr(s, 1, m[0,"start"]-1) @f(m)
: s = substr(s, m[0,"start"] + m[0,"length"])
: }
: return r s
: }
: func f(m) {
: a = m[1]
: b = m[2]
: return a b b a;
: }
: { print regMap($0, "(a*)(b*)", "f") }

: $ echo 'foazbzaabbk' | awk -f ~/t.awk
: foaazbbzaabbbbaak

A nice one, if a bit dangerous in the general case with regard to m. Using

function regMap(buf, reg, f, m){

instead should fix this.

: The drawback is that you have to give the regex as string. Since, in

: gawk, regex aren't available as datatypes one would need to change
: gawk's parser to make something like this work:

: { print regMap($0, PROTECT(/(a*)(b*)/), "f") }
: # with PROTECT() returning a string representation

These kinds of problems showing up come with the territory, I think: awk
not being a full-fledged programming language. The question is how much
solving them may cost. Some may have pretty simple solutions, like
modularization/name spaces, this one probably might be solved most
cleanly with a preprocessor or as a side effect of introducing eval(),
and some may just turn out to be too hairy internally like signal
handling.

Robert Figura

unread,

Apr 2, 2013, 8:07:26 AM4/2/13

to

Hi,

cha...@cs.tu-berlin.de wrote:

> Robert Figura <nc-fi...@netcologne.de> wrote:
> : cha...@cs.tu-berlin.de wrote:
> : > Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> : > : In GNU awk I want to substitute all matching sub-strings by a calculated
> : > : value of each sub-expression, as outlined here:
> : >
> : > : buf = gensub (/([0-9]+)\/([0-9]+)/, func(\\1,\\2), "g", buf)
> : >
> : > : This does not work, it gives syntax errors at the backref positions; it
> : > : seems that backrefs are only available if part of a substitution string.
> : > : Or am I missing something? Is such a replacement possible at all in gawk
> : > : without looping and doing each match separately?
> : >
> : > s=gensub(/\[!([^;]+);([^]]+)\]/,anchor("","\\1","\\2"),"g",s)

The function anchor() will be called before gensub(). So anchor() may
return a string with backrefs, which your implementation seems to do:

> function anchor(name, href, s){
> return "< a" (name?(" name=\"" name "\""):"") (href?(" href=\"" href "\""):"") ">" s "< /a>"
> }

The two C?A:B statements will always evaluate the A part because "\\1"
is true. Dead code, i'd take a hint that something won't work as
expected.

> A nice one, if a bit dangerous in the general case with regard to m. Using
>
> function regMap(buf, reg, f, m){

function regMap(buf, reg, f, m, r) {

Sure. Thanks :)

> These kinds of problems showing up come with the territory, I think: awk
> not being a full-fledged programming language.

Isn't that a bit harsh?

> this one probably might be solved most
> cleanly with a preprocessor

Preprocessor. Didn't think about that one here. Maybe because my
guts wouldn't want to be involved in parsing regex. I wonder if there's
a complete-ish awk parser written in awk out there...

Janis Papanagnou

unread,

Apr 5, 2013, 7:13:01 AM4/5/13

to

After re-thinking about the statement I have to (sort of) correct it a
bit.

It makes no sense in the context of any general function "do_something"
(to take the name from your example). But in the context of a specific
function (like gensub), a function that supports back-references, it
would also make sense that this function would apply a lazy evaluation
concept for the replacement expression. This is currently not supported
by gawk (and probably never will be), but I think it would be possible,
it would make sense, and it would be helpful.

Janis

Aharon Robbins

unread,

Apr 5, 2013, 7:25:49 AM4/5/13

to

Hi.

In article <kjmbjs$cef$1...@speranza.aioe.org>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>It makes no sense in the context of any general function "do_something"
>(to take the name from your example). But in the context of a specific
>function (like gensub), a function that supports back-references, it
>would also make sense that this function would apply a lazy evaluation
>concept for the replacement expression.

No, not really. Gensub takes a replacement string as the argument
specifying how to build the result, just like sub and gsub do. Gawk is
not the shell, nor is it perl, nor does it desire to be either of them.

>This is currently not supported by gawk (and probably never will be),

Right on both.

>but I think it would be possible,

Just because something is possible doesn't mean it's a good idea.

And in this case, it is actually quite difficult, since it means runtime
(re-)evaluation of an expression that has already been parsed. It's
not something I would care to try, since it would VERY badly warp the
code for something that could be accomplished in a more straightforward
manner with awk code.

>it would make sense,

Only in a language with dynamic evaluation everywhere, like lisp, shell,
perl, or whatever. Not in [g]awk.

>and it would be helpful.

I think only marginally so, to be honest; the ROI for the real work
needed is too low.

Sorry to rain on your parade... :-)

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon
D.N. Shimshon 9978500 ISRAEL

Janis Papanagnou

unread,

Apr 5, 2013, 8:44:09 AM4/5/13

to

Am 05.04.2013 13:25, schrieb Aharon Robbins:
> Hi.
>
> In article <kjmbjs$cef$1...@speranza.aioe.org>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> It makes no sense in the context of any general function "do_something"
>> (to take the name from your example). But in the context of a specific
>> function (like gensub), a function that supports back-references, it
>> would also make sense that this function would apply a lazy evaluation
>> concept for the replacement expression.
>
> No, not really. Gensub takes a replacement string as the argument
> specifying how to build the result, just like sub and gsub do. Gawk is
> not the shell, nor is it perl, nor does it desire to be either of them.
>
>> This is currently not supported by gawk (and probably never will be),
>
> Right on both.
>
>> but I think it would be possible,
>
> Just because something is possible doesn't mean it's a good idea.

You think - apart from implementation difficulties - it is a bad idea?
Given the application I outlined upthread I have to disagree. YMMV :-)

>
> And in this case, it is actually quite difficult,

Yes, I have thought so. And I did not expect that to get implemented.

The point of my posting was to correct a statement that I originally
agreed to; whether such an operational semantics would make sense or
not.

> since it means runtime
> (re-)evaluation of an expression that has already been parsed.

I am aware that it may not fit well. On the other hand I'd think that
with an intermediate-code interpreter (as present in gawk) it may be
even easier to implement than with a straight compiler.

Given the gawk specific version gensub that already supports non-regular
features like back-references, having another semantic extension in the
context of that function would also not have undesired side-effects; the
effects would anyway be local to a non-standard function.

But okay, I don't want to beat a dead horse.

> It's
> not something I would care to try, since it would VERY badly warp the
> code

Yes, I thought so.

> for something that could be accomplished in a more straightforward
> manner with awk code.

You mean re-looping again and again over subexpressions? - Well... :-/

>
>> it would make sense,
>
> Only in a language with dynamic evaluation everywhere, like lisp, shell,
> perl, or whatever. Not in [g]awk.

It's certainly eassier in those languages, sure.

>
>> and it would be helpful.
>
> I think only marginally so, to be honest; the ROI for the real work
> needed is too low.
>
> Sorry to rain on your parade... :-)

Don't worry :-) As said, I did't mean that to get implemented (even
though I'd have appreciated it ;-). Just wanted to correct my statement.

On a related note, Arnold... - a question I am not sure whether it had
already been mentioned here...

The back-references implementation may require a non-finite automata.
Are there two matching implementations in gawk, one that use the classic
fast matching, and one that use some backtracking for the back-refs,
selected depending on the contents of the actual regexp expression, or
is there only one algorithm implemented?

Janis

>
> Arnold
>

Aharon Robbins

unread,

Apr 5, 2013, 9:32:04 AM4/5/13

to

In article <kjmgun$s9p$1...@speranza.aioe.org>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> since it means runtime
>> (re-)evaluation of an expression that has already been parsed.
>
>I am aware that it may not fit well. On the other hand I'd think that
>with an intermediate-code interpreter (as present in gawk) it may be
>even easier to implement than with a straight compiler.
>
>Given the gawk specific version gensub that already supports non-regular
>features like back-references, having another semantic extension in the
>context of that function would also not have undesired side-effects; the
>effects would anyway be local to a non-standard function.

They're not related. The backreference access has to do with the way
the regexp evaluator works, and is not connected to how the parser
and intermediate-code interpreter work. Getting the values from the
backreferences out of gensub and into some other piece of code to run in
order to generate the final replacement string would be terribly painful.

>> for something that could be accomplished in a more straightforward
>> manner with awk code.
>
>You mean re-looping again and again over subexpressions? - Well... :-/

Yes, exactly.

>On a related note, Arnold... - a question I am not sure whether it had
>already been mentioned here...
>
>The back-references implementation may require a non-finite automata.

Actually, in gawk's case it doesn't, since they are not used as
part of the regexp, but only in the replacement part.

>Are there two matching implementations in gawk, one that use the classic
>fast matching, and one that use some backtracking for the back-refs,
>selected depending on the contents of the actual regexp expression, or
>is there only one algorithm implemented?

There are indeed two matchers, one that uses a dfa matcher, and another
that uses backtracking if needed. (See Russ Cox's excellent [but fairly
technical] papers on regexp matching for very interesting discussions.)

The dfa matcher returns only a "yes it matched, no it didn't match"
answer, whereas the other matcher returns where it matched information,
and is the one used for sub, gsub, and gensub.

The cases where "did it match" is sufficient are things like

/pattern/ { action }
if (/pattern/)
if (str ~ pattern)

The full matcher is needed for just about everything else:

- Field splitting when FS is a regexp
- Record splitting when RS is a regexp
- The match function (setting RSTART and RLENGTH)
- sub, gsub, gensub
- split with a regular expression
- mayber others I'm not remembering off the top of my head

Even in the most of the latter cases, gawk will run the dfa matcher
first to see if a match is even there, before running the full matcher
to extract the start and end of the match, since the dfa matcher is much
much faster.

I hope this answers your question.