awk lookahead/behind regex

Peng Yu

unread,

Nov 9, 2011, 9:35:24 AM11/9/11

to

Hi,

I don't see the gawk manual describes lookahead/behind regex. Is it
available in gnu awk?

--
Regards,
Peng

pk

unread,

Nov 9, 2011, 9:36:56 AM11/9/11

to

On Wed, 9 Nov 2011 06:35:24 -0800 (PST), Peng Yu <peng...@gmail.com>
wrote:

> Hi,
>
> I don't see the gawk manual describes lookahead/behind regex. Is it
> available in gnu awk?

No. If you really need them (which isn't always the case), use Perl or
another language that supports them.

Ed Morton

unread,

Nov 9, 2011, 10:28:55 AM11/9/11

to

Peng Yu <peng...@gmail.com> wrote:

> Hi,
>
> I don't see the gawk manual describes lookahead/behind regex. Is it
> available in gnu awk?
>

Not in any search pattern, but in gensub() you can refer to a previously
match RE segment using \\1 etc. e.g.:

$ echo "abxycd" | gawk '$0=gensub(/(b.)(.c)/,"| \\2 \\1 |","g")'
a| yc bx |d

$ echo "abxycd" | gawk '$0=gensub(/b./,"| & |","g")'
a| bx |ycd

If you tell us what you want to do with some sample input and expected
output, maybe we can make a suggestion...

Regards,

Ed.

Posted using www.webuse.net

pk

unread,

Nov 9, 2011, 10:35:57 AM11/9/11

to

On Wed, 09 Nov 2011 15:28:55 GMT, "Ed Morton" <morto...@gmail.com> wrote:

> Peng Yu <peng...@gmail.com> wrote:
>
> > Hi,
> >
> > I don't see the gawk manual describes lookahead/behind regex. Is it
> > available in gnu awk?
> >
>
> Not in any search pattern, but in gensub() you can refer to a previously
> match RE segment using \\1 etc. e.g.:

Yes but that's not lookahead/lookbehind. Lookaround (the generic name for
that) is where you have a zero-width assertion about what comes before/after
a certain part of the expression, for example (using Perl notation)

/foo(?=bar)/

matches /foo/ but only when followed by bar (and "bar" is not part of the
match, as would instead be if we used /foobar/).

Ed Morton

unread,

Nov 9, 2011, 11:25:55 AM11/9/11

to

I just wasn't sure if that's really what the OP needed, though, as I've seen
questions like this when they really just needed to be able to access
matched RE segment(s) in the replacement string and may not be aware of
alternatives to get the output they want.

In the example you gave, whether or not you actually need lookahead would
depend on what you're going to do with the matched RE. If you just want to
replace "foo" when followed by "bar" with the word "MATCH" for example, then
something like this:

$ cat file
foobob
foobar
foozed
$
$ gawk -v re="foo" -v sfx="bar" 'match($0,re sfx) {
$0 = "MATCH" substr($0,RSTART+RLENGTH-length(sfx))
}1' file
foobob
MATCHbar
foozed

or simply:

gawk 'match($0,/foobar/) {$0 = "MATCH" substr($0,RSTART+length("foo"))}1' file

or any one of many other possible solutions MIGHT be just fine.

Off the top of my head, I can't think of a case where there isn't a
reasonable alternative to lookaround using existing awk constructs. I
daresay they exist but maybe the OPs case isn't one of them.

Ed.

Posted using www.webuse.net

pk

unread,

Nov 9, 2011, 11:45:04 AM11/9/11

to

As I said, sometimes lookaround isn't really needed, although people
rapidly get used to it and think they can do just about anything with it,
even when it's not necessary.

If you only need the boolean answer "match/no match", then lookaround really
adds nothing (in fact, it adds unnecessary overhead).

If you need to pull out matching parts (perhaps in loops), then lookaround
can be useful. In my experience, it is most useful in the negative version,
like /foo(?!bar)/ matches foo only if not followed by bar, which in awk
would require !/foobar/ && /foo/ (plus the labor to pull out the match with
match() and RSTART/RLENGTH etc., which isn't usually needed in Perl).

But he asked "do awk regex have lookaround", and the answer is "no". If he
shows a concrete example, perhaps we will see that it doesn't really need
it and can be solved with awk, although perhaps in a bit more awkward way
(no pun intended).

mic...@gortel.phys.ualberta.ca

unread,

Nov 9, 2011, 12:27:12 PM11/9/11

to

Ed Morton <morto...@gmail.com> wrote:

> $ gawk -v re="foo" -v sfx="bar" 'match($0,re sfx) {
> $0 = "MATCH" substr($0,RSTART+RLENGTH-length(sfx))
> }1' file

Careful with that as this is a pretty fragile. It will transform
"xyzfoobar" into "MATCHbar" too which is likely not what you want.
Moreover if 'sfx' stands for, say, "at least one digit followed by a
lowercase letter" then '-length(sfx)' can be right only by an accident.
Your example from an earlier message was much better.

Michal

Peng Yu

unread,

Nov 9, 2011, 4:46:56 PM11/9/11

to

On Nov 9, 9:28 am, "Ed Morton" <mortons...@gmail.com> wrote:

> Peng Yu <pengyu...@gmail.com> wrote:
>
> > Hi,
>
> > I don't see the gawk manual describes lookahead/behind regex. Is it
> > available in gnu awk?
>
> Not in any search pattern, but in gensub() you can refer to a previously
> match RE segment using \\1 etc. e.g.:
>
> $ echo "abxycd" | gawk '$0=gensub(/(b.)(.c)/,"| \\2 \\1 |","g")'
> a| yc bx |d
>
> $ echo "abxycd" | gawk '$0=gensub(/b./,"| & |","g")'
> a| bx |ycd
>
> If you tell us what you want to do with some sample input and expected
> output, maybe we can make a suggestion...

Suppose I have string of the format [0-9]+[a-zA-Z]. For example,
100M20K28U10K is one such string.

I want to extract the number before say a English letter, for example,
'K' or 'U' in this example. What is the best way to do in awk?

Kenny McCormack

unread,

Nov 9, 2011, 6:18:33 PM11/9/11

to

In article <204d7e4c-7ff9-4597...@y7g2000vbe.googlegroups.com>,

I think this will get you started:

split(str,A,/[A-Z]+/)

Or you might even be interested in FPAT (present in TAWK forever, now
available in GAWK flavor as well!)

--
Faced with the choice between changing one's mind and proving that there is
no need to do so, almost everyone gets busy on the proof.

- John Kenneth Galbraith -

Ed Morton

unread,

Nov 9, 2011, 6:38:52 PM11/9/11

to

On 11/9/2011 3:46 PM, Peng Yu wrote:
<snip>

> Suppose I have string of the format [0-9]+[a-zA-Z]. For example,
> 100M20K28U10K is one such string.
>
> I want to extract the number before say a English letter, for example,
> 'K' or 'U' in this example. What is the best way to do in awk?

I don't know if it's the BEST way as there are alternatives, but one way is:

$ cat file
100M20K28U10K

$ awk -v let="K" 'match($0,"[[:digit:]]+" let) {
print substr($0,RSTART,RLENGTH-1)}' file
20

$ awk -v let="U" 'match($0,"[[:digit:]]+" let) {
print substr($0,RSTART,RLENGTH-1)}' file
28

There's also:

$ awk -v let="K" '{ sub(let ".*",""); sub(/.*[[:alpha:]]/,"") }1' file
20

$ awk -v let="U" '{ sub(let ".*",""); sub(/.*[[:alpha:]]/,"") }1' file
28

Of course those just get the number before the first occurrence of a letter.
Maybe you want all of them? Then one way would be:

$ awk -v let="K" '{ n=split($0,a,let); for (i=1; i<n; i++) {
sub(/.*[[:alpha:]]/,"",a[i]); print a[i] } }' file
20
10

It all depends what you want...

Regards,

Ed.

mic...@gortel.phys.ualberta.ca

unread,

Nov 10, 2011, 8:35:54 PM11/10/11

to

Peng Yu <peng...@gmail.com> wrote:
>
> Suppose I have string of the format [0-9]+[a-zA-Z]. For example,
> 100M20K28U10K is one such string.
>
> I want to extract the number before say a English letter, for example,
> 'K' or 'U' in this example. What is the best way to do in awk?

Try

echo 100M20K28U10K | awk -F "[a-zA-Z]" '{print $1, $2, $3, $4}'

Also

echo 100M20K28U10K | awk '{print $0 + 0}'

will print '100' due to a way which awk, by definition, converts
strings to numbers. You can use that in a loop.

Michal

waja...@gmail.com

unread,

Mar 2, 2015, 12:41:34 AM3/2/15

to

Lookaround still make a significant usage if you were on the single line command thing such as processing a pipe data in bash, whereby using match() will give unnecessary length of command strips.