Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

awk regex, storing a match?

921 views
Skip to first unread message

rabbits77

unread,
Jan 25, 2009, 5:41:06 PM1/25/09
to
How would I express this perl regex in awk?
$ echo 12abcat | perl -e '$a=<STDIN>; $a =~ /(2.*c)/; print "$1\n";'
This returns
2abc
Which is the part of the string that matches the part of the regex in
parentheses. By wrapping part of the regex I am telling perl to store
the matching region of the string in a variable called $1.
Is it possible to do this in awk? That is, store the part of the string
that matches in a variable?

Ed Morton

unread,
Jan 25, 2009, 7:56:25 PM1/25/09
to

awk -v re='a|b' '
function extract(str,regexp)
{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
return RSTART
}
extract($0,re) { print RMATCH }
'

Ed.

Rajan

unread,
Jan 25, 2009, 10:09:46 PM1/25/09
to
"Ed Morton" <morto...@gmail.com> wrote in message
news:12f95179-cfd3-4412...@p2g2000prn.googlegroups.com...

Easier in gawk
$echo 12abcat | gawk -v re="2.*c" '{match($0,re,RMATCH); print RMATCH[0]}'

Ed Morton

unread,
Jan 25, 2009, 10:12:16 PM1/25/09
to

And just in case we're playing golf rather than looking for something
we can use in longer scripts:

$ echo 12abcat |
perl -e '$a=<STDIN>; $a =~ /(2.*c)/; print "$1\n";'

2abc
$ echo 12abcat |
awk 'match($0,/2.*c/){$0=substr($0,RSTART,RLENGTH)}1'
2abc

Regards,

Ed.

jaialai.t...@gmail.com

unread,
Jan 25, 2009, 11:02:46 PM1/25/09
to
Hey, that is cool!
But what about using the matched partin teh regex itself?
Like, how would you do this perl regex in awk?
$ echo ca12cat | perl -e '$a=<STDIN>; $a =~ /(ca)(12\1)/; print
"$2\n";'
12ca
the (12\1) means "12 followed by the stuff that was matched by the
*first* parenthesized part of the regex".
print $2 means "print out the stuff that was matched by the *second*
parenthesized part of the regex".


pk

unread,
Jan 26, 2009, 4:51:35 AM1/26/09
to
On Monday 26 January 2009 05:02, jaialai.t...@gmail.com wrote:

>> awk -v re='a|b' '
>> function extract(str,regexp)
>> { RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
>> return RSTART}
>>
>> extract($0,re) { print RMATCH }
>> '
> Hey, that is cool!
> But what about using the matched partin teh regex itself?
> Like, how would you do this perl regex in awk?
> $ echo ca12cat | perl -e '$a=<STDIN>; $a =~ /(ca)(12\1)/; print
> "$2\n";'
> 12ca
> the (12\1) means "12 followed by the stuff that was matched by the
> *first* parenthesized part of the regex".
> print $2 means "print out the stuff that was matched by the *second*
> parenthesized part of the regex".

AFAIK you can't in awk. Backreferences are not supported. GNU awk supports
backreferences for substitutions using the gensub() function, but you can't
pull them out like you do with perl, although you can put together hacks
like (not the same as your example)

$ echo ca12cat | gawk '{s=gensub(/(..)(..).*/,"\\2","g"); print s}'
12

but in any case using backreferences during the match itself is not
supported AFAICT.

Ed Morton

unread,
Jan 26, 2009, 9:09:35 AM1/26/09
to

That's not supported ditrectly in awk so you'd need something like
(untested):

awk -v re='ca' '


function extract(str,regexp)
{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
return RSTART
}

extract($0,re) && extract($0,re"12"RMATCH) { print RMATCH }
'

Regards,

Ed.

pk

unread,
Jan 26, 2009, 9:17:35 AM1/26/09
to
On Monday 26 January 2009 15:09, Ed Morton wrote:

> That's not supported ditrectly in awk so you'd need something like
> (untested):
>
> awk -v re='ca' '
> function extract(str,regexp)
> { RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
> return RSTART
> }
>
> extract($0,re) && extract($0,re"12"RMATCH) { print RMATCH }
> '

However careful with that one if RMATCH happens to contain regex
metacharacters!

Ed Morton

unread,
Jan 26, 2009, 9:48:42 AM1/26/09
to

Probably OT but - any idea what perl does in that situation?

Ed.

pk

unread,
Jan 26, 2009, 12:27:02 PM1/26/09
to

Apparently it does the same:

$ echo '.*cat' | perl -ne 'm/^(..)/;
$m=$1; # save what matched in $1 (".*")
m/($m)/; # try a new match with that...
$nm=$1; # and save what matched
print "first match: $m, new match: $nm\n";'
first match: .*, new match: .*cat

so $m is interpolated to .*, and that is used to do the match, which matches
the whole string of course.

But...perl has also the \Q...\E special regex metacharacters, which escape
everything in between to be regex-safe (something that would have to be
done manually in awk), so:

$ echo '.*cat' | perl -ne 'm/^(..)/;
$m=$1;
print "\Q$m\E\n";
m/(\Q$m\E)/;
$nm=$1;
print "first match: $m, new match: $nm\n";'
\.\*
first match: .*, new match: .*

Yeah, too easy that way :(

Aharon Robbins

unread,
Jan 26, 2009, 2:21:07 PM1/26/09
to

This is correct; no awk version that I know of supports this (I don't
know what tawk does).
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Janis Papanagnou

unread,
Jan 26, 2009, 7:58:23 PM1/26/09
to
Aharon Robbins wrote:
> In article <glk11r$ug8$1...@news.motzarella.org>, pk <p...@pk.invalid> wrote:
>
>>On Monday 26 January 2009 05:02, jaialai.t...@gmail.com wrote:
>>
>>but in any case using backreferences during the match itself is not
>>supported AFAICT.
>
>
> This is correct; no awk version that I know of supports this (I don't
> know what tawk does).

Which is understandable since such backreferences exceed chomsky-3
grammars (i.e. handling by finite state machines and expressed by
regular expressions).

I've been told that perl has even severe performance issues in some
application cases with backreferences.

Janis

r.p....@gmail.com

unread,
Jan 29, 2009, 3:45:39 PM1/29/09
to
>     Ed.- Hide quoted text -
>
> - Show quoted text -

echo 12abcat | awk 'match($0,/2.*c/){print substr
($0,RSTART,RLENGTH)}'

i.e., why use the one extra step of printing after reassigning?

Ed Morton

unread,
Jan 29, 2009, 3:55:33 PM1/29/09
to

Because the above will not print anything if the pattern is not found
while to me it looked like the perl statement the OP wanted to find an
awk equivalent for would print the input record whether it was
modified or not. Of course, the perl syntax is so cryptic it might
actually invoke nasal demons for all I know but I'm pretty sure it
will print SOMETHING....

Ed.

0 new messages