Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

convert to lowercase in regex only

85 views
Skip to first unread message

er1ch

unread,
Sep 21, 2012, 7:48:50 AM9/21/12
to
Hello group.

I know that the tolower("FoO") function converts each character in "FoO"
to its lowercase equivalent "foo".

What I'm trying to do is to apply this function to regexes, i.e. I want
to replace every string in <Html Brackets> into <lowercase html
characters>. This can be done with a simple sed command. Thus, I
inserted this system command:

system(" sed -e 's/<[^>]*>/\\L&\\E/g' 0.ZZZ ")

and it works for me, but I have an awk script ;)

Can this be done with awk means? I googled and experimented with no
result, i.e. either none or all chars in my file are tolower-ed.

Thanks in advance, erch


Ed Morton

unread,
Sep 21, 2012, 8:37:07 AM9/21/12
to
There's nothing builtin to the language to let you operate on the result of a
*sub() but you can just write a function to do it using match() and substr(), e.g.:

function mklow(old,regexp, tgt,new)
{
tgt = old
while ( match(tgt,regexp) ) {
new = new substr(tgt,1,RSTART-1) tolower(substr(tgt,RSTART,RLENGTH))
tgt = substr(tgt,RSTART+RLENGTH)
}
return new tgt
}
{ print mklow($0,"<[^>]+>") }

Regards,

Ed.





Ed Morton

unread,
Sep 21, 2012, 12:34:35 PM9/21/12
to
It feels to me like this is a fairly common question (in various flavours)
so here's a more general function showing how to update segments of a
string as identified by an RE and an operation:

$ cat tst.awk
function modre(string,regexp,op, head,curr,tail,val)
{
tail = string
while ( match(tail,regexp) ) {
curr = substr(tail,RSTART,RLENGTH)
if (op == "tolower") {
curr = tolower(curr)
}
else if (op == "toupper") {
curr = toupper(curr)
}
else if (op == "length") {
curr = length(curr)
}
else if (op == "int") {
curr = int(curr)
}
else if (op == "exp") {
curr = exp(curr)
}
else if (op ~ /^[-+*/]/) {
val = op
sub(/^./,"",val)
if (op ~ /^[+]/) {
curr = curr + val
}
else if (op ~ /^[-]/) {
curr = curr - val
}
else if (op ~ /^[*]/) {
curr = curr * val
}
else if (op ~ /^[/]/) {
curr = curr / val
}
}
else { # add operations above this line
curr = op
}
head = head substr(tail,1,RSTART-1) curr
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{
# string changes:
print 1, modre( $0, "<[^>]+>", "tolower" )
print 2, modre( $0, "<[^>]+>", "toupper" )
print 3, modre( $0, "<[^>]+>", "length" )
print 4, modre( $0, "<[^>]+>", "foo" )
print 5, modre( $0, "<[^>]+>" )

# numeric changes:
print 6, modre( $0, "[[:digit:]]+", "int" )
print 7, modre( $0, "[[:digit:]]+", "exp" )
print 8, modre( $0, "[[:digit:]]+", "+159" )

# both changes:
print 9, modre( modre($0,"[[:digit:]]+","*2"), "<[^>]+>","tolower")
}

$ cat file
here <abc> are 0003 <DeFgHi> string <KLMNO> and 002 numeric examples

$ awk -f tst.awk file
1 here <abc> are 0003 <defghi> string <klmno> and 002 numeric examples
2 here <ABC> are 0003 <DEFGHI> string <KLMNO> and 002 numeric examples
3 here 5 are 0003 8 string 7 and 002 numeric examples
4 here foo are 0003 foo string foo and 002 numeric examples
5 here are 0003 string and 002 numeric examples
6 here <abc> are 3 <DeFgHi> string <KLMNO> and 2 numeric examples
7 here <abc> are 20.0855 <DeFgHi> string <KLMNO> and 7.38906 numeric examples
8 here <abc> are 162 <DeFgHi> string <KLMNO> and 161 numeric examples
9 here <abc> are 6 <defghi> string <klmno> and 4 numeric examples

Obviously you should add/remove operations and change REs as you see fit.

Regards,

Ed.

Posted using www.webuse.net

er1ch

unread,
Sep 22, 2012, 12:07:52 PM9/22/12
to
Am Fri, 21 Sep 2012 16:34:35 +0000 schrieb Ed Morton:

> Ed Morton <morto...@gmail.com> wrote:
>
[...]
>> > I know that the tolower("FoO") function converts each character in
>> > "FoO" to its lowercase equivalent "foo".
>> >
>> > What I'm trying to do is to apply this function to regexes, i.e. I
>> > want to replace every string in <Html Brackets> into <lowercase html
>> > characters>. This can be done with a simple sed command. Thus, I
[...]
>> > Can this be done with awk means? I googled and experimented with no
[...]
>> There's nothing builtin to the language to let you operate on the
>> result
[...]

> It feels to me like this is a fairly common question (in various
> flavours) so here's a more general function showing how to update
> segments of a string as identified by an RE and an operation:

I really am grateful for your replies, they are more a tutorial than a
"simple help".

[...your code...]

> Obviously you should add/remove operations and change REs as you see
> fit.

Of course my code and whatever (if any) it does is my responsibility.
Thank you, this was very helpful.

erch

Janis Papanagnou

unread,
Sep 23, 2012, 4:34:39 AM9/23/12
to
On 21.09.2012 18:34, Ed Morton wrote:
[...]
>
> It feels to me like this is a fairly common question (in various flavours)
> so here's a more general function showing how to update segments of a
> string as identified by an RE and an operation:
>
> $ cat tst.awk
> function modre(string,regexp,op, head,curr,tail,val)
> {
> tail = string
> while ( match(tail,regexp) ) {
> curr = substr(tail,RSTART,RLENGTH)
> if (op == "tolower") {
> curr = tolower(curr)
> }
> else if (op == "toupper") {
> curr = toupper(curr)
> }
> else if (op == "length") {
> curr = length(curr)

I noticed that you used length() here, instead of curr = RLENGTH ; probably
for symmetry reasons. I asked myself whether awk stores the string length as
explicit attribute of the string variable or determines it by strlen(). Does
anybody know how common awks implement length() ?

Janis

> }
> else if (op == "int") {
> curr = int(curr)
> }
> else if (op == "exp") {
> curr = exp(curr)
> }
> else if (op ~ /^[-+*/]/) {
> val = op
> sub(/^./,"",val)
> if (op ~ /^[+]/) {
> curr = curr + val
> }
> else if (op ~ /^[-]/) {
> curr = curr - val
> }
> else if (op ~ /^[*]/) {
> curr = curr * val
> }
> else if (op ~ /^[/]/) {
> curr = curr / val
> }
> }
> else { # add operations above this line
> curr = op
> }
> head = head substr(tail,1,RSTART-1) curr
> tail = substr(tail,RSTART+RLENGTH)
> }
> return head tail
> }
[...]

Dave Gibson

unread,
Sep 23, 2012, 8:43:17 AM9/23/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 21.09.2012 18:34, Ed Morton wrote:

>> while ( match(tail,regexp) ) {
>> curr = substr(tail,RSTART,RLENGTH)

>> else if (op == "length") {
>> curr = length(curr)
>
> I noticed that you used length() here, instead of curr = RLENGTH ;
> probably for symmetry reasons. I asked myself whether awk stores
> the string length as explicit attribute of the string variable or
> determines it by strlen(). Does anybody know how common awks implement
> length() ?

stored length:

gawk
mawk

strlen:

bwk's awk
busybox awk

internal wide-character strlen-a-like:

heirloom awk

Ed Morton

unread,
Sep 23, 2012, 9:40:36 AM9/23/12
to
On 9/23/2012 3:34 AM, Janis Papanagnou wrote:
> On 21.09.2012 18:34, Ed Morton wrote:
> [...]
>>
>> It feels to me like this is a fairly common question (in various flavours)
>> so here's a more general function showing how to update segments of a
>> string as identified by an RE and an operation:
>>
>> $ cat tst.awk
>> function modre(string,regexp,op, head,curr,tail,val)
>> {
>> tail = string
>> while ( match(tail,regexp) ) {
>> curr = substr(tail,RSTART,RLENGTH)
>> if (op == "tolower") {
>> curr = tolower(curr)
>> }
>> else if (op == "toupper") {
>> curr = toupper(curr)
>> }
>> else if (op == "length") {
>> curr = length(curr)
>
> I noticed that you used length() here, instead of curr = RLENGTH ; probably
> for symmetry reasons.

Yes, I was just trying to establish a simple, obvious pattern and didn't even
think about RLENGTH.

I was also thinking you could do [most of] this with gawks indirect function
calls (http://www.gnu.org/software/gawk/manual/gawk.html#Indirect-Calls) but as
the manual says:

"Unfortunately, indirect function calls cannot be used with the built-in
functions."

which is pretty disappointing to say the least! Yes, I could write wrapper
functions for all of them but it's easier just to write the if/elses above.

Anyone know why we can't use indirect calls with the built in functions?

Ed.

Ed Morton

unread,
Sep 23, 2012, 11:33:32 AM9/23/12
to
On 9/22/2012 11:07 AM, er1ch wrote:
> Am Fri, 21 Sep 2012 16:34:35 +0000 schrieb Ed Morton:
>
>> Ed Morton <morto...@gmail.com> wrote:
>>
> [...]
>>>> I know that the tolower("FoO") function converts each character in
>>>> "FoO" to its lowercase equivalent "foo".
>>>>
>>>> What I'm trying to do is to apply this function to regexes, i.e. I
>>>> want to replace every string in <Html Brackets> into <lowercase html
>>>> characters>. This can be done with a simple sed command. Thus, I
<snip>
>> It feels to me like this is a fairly common question (in various
>> flavours) so here's a more general function showing how to update
>> segments of a string as identified by an RE and an operation:
>
> I really am grateful for your replies, they are more a tutorial than a
> "simple help".

Darn, now I'm embarrassed into writing it better :-). Below is how, given a bit
of thought, I'd really implement a function to modify strings matching an RE
based on a specified operation:

$ cat file
here <abc> are 0003 <DeFgHi> string <KLMNO> and 002 numeric examples
$
$ cat modre.awk
function modre(string,regexp,op,delta, head,tail,old,new)
{
tail = string
while ( match(tail,regexp) ) {
old = substr(tail,RSTART,RLENGTH)

if (op == "tolower") { new = tolower(old) }
else if (op == "toupper") { new = toupper(old) }
else if (op == "length") { new = length(old) }
else if (op == "int") { new = int(old) }
else if (op == "exp") { new = exp(old) }
else if (op == "+") { new = old + delta }
else if (op == "-") { new = old - delta }
else if (op == "*") { new = old * delta }
else if (op == "/") { new = old / delta }
else {
printf "ERROR: modre() invalid op \"%s\".\n", op | "cat>&2"
exit 1
}

head = head substr(tail,1,RSTART-1) new
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{
# string change examples:
print 1, modre( $0, "<[^>]+>", "tolower" )
print 2, modre( $0, "<[^>]+>", "toupper" )
print 3, modre( $0, "<[^>]+>", "length" )

# numeric change examples:
print 4, modre( $0, "[[:digit:]]+", "int" )
print 5, modre( $0, "[[:digit:]]+", "+", 159 )
print 6, modre( $0, "[[:digit:]]+", "*", 5 )

}

$ awk -f modre.awk file
1 here <abc> are 0003 <defghi> string <klmno> and 002 numeric examples
2 here <ABC> are 0003 <DEFGHI> string <KLMNO> and 002 numeric examples
3 here 5 are 0003 8 string 7 and 002 numeric examples
4 here <abc> are 3 <DeFgHi> string <KLMNO> and 2 numeric examples
5 here <abc> are 162 <DeFgHi> string <KLMNO> and 161 numeric examples
6 here <abc> are 15 <DeFgHi> string <KLMNO> and 10 numeric examples

It no longer supports assignment as you'd simply use gsub() for that, it only
supports cases where you need to perform some operation on the result of the match.

Regards,

Ed.


Robert Figura

unread,
Sep 24, 2012, 6:23:09 AM9/24/12
to
Ed Morton <morto...@gmail.com> wrote:

> Anyone know why we can't use indirect calls with the built in functions?

Here's just a *wild guess* as i recently *skimmed* gawk's source while
trying to find a way to run the END block from an extension:

Internals are implemented directly in the finite automaton (built from
the grammar) but functions are implemented as instruction lists which
are accompanied by some extra architecture. I further guess it has not
been fixed because that implies some of lone (ugly?) special cases in
the main loop.

Or i got it all wrong...
Kind Regards
- Robert Figura

--
/* mandlsig.c 0.42 (c) by Robert Figura */
I=1702;float O,o,i;main(l){for(;I--;putchar("oO .,\nt>neo.ckgel-t\
agidif@<ra urig FrtbeRo"[I%74?I>837&874>I?I^833:l%5:5]))for(O=o=l=
0;O*O+o*o<(16^l++);o=2*O*o+I/74/11.-1,O=i)i=O*O-o*o+I%74*.04-2.2;}

Aharon Robbins

unread,
Sep 24, 2012, 1:07:22 PM9/24/12
to
In article <20120924122309.4d...@netcologne.de>,
Robert Figura <nc-fi...@netcologne.de> wrote:
>Ed Morton <morto...@gmail.com> wrote:
>
>> Anyone know why we can't use indirect calls with the built in functions?

It is an implementation issue. Indirect function calls are fairly simple -
get the string value, look up the function, set things up to run the
function's byte code, and voila.

With a little thought, it *might* be possible to make indirect calls
to built-ins work, but it'd take some work, and I'm not sure I'm strong
enough in the force w.r.t. the new internals. Also, see below.

>Here's just a *wild guess* as i recently *skimmed* gawk's source while
>trying to find a way to run the END block from an extension:

Yow. Painful. In the current gawk-4.0-stable I don't think you can
do that. In master you can't do that either, but an extension can do
the moral equivalent of the C atexit() function.

>Internals are implemented directly in the finite automaton (built from
>the grammar) but functions are implemented as instruction lists which
>are accompanied by some extra architecture. I further guess it has not
>been fixed because that implies some of lone (ugly?) special cases in
>the main loop.

Just in the one case for indirect function calls.

An additional, significant, problem is that if you want do something like

fun = "gsub"
result = @fun(/pat/, repl, target)

you can't, because gsub() takes a regex literal, which has a different
meaning in a regular call, and the target is updated by reference, which
isn't possible for regular functions. It's a thorny issue.

For most of the built-ins, some sort of trampoline to make indirection
work is likely possible. (tolower, sin, strftime, etc.) Not that I'm
making any promises...

And FWIW, lots of goodness happening these days in the master branch. :-)

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Robert Figura

unread,
Sep 24, 2012, 3:58:34 PM9/24/12
to
arn...@skeeve.com (Aharon Robbins) wrote:
> Robert Figura <nc-fi...@netcologne.de> wrote:
> > trying to find a way to run the END block from an extension
>
> Yow. Painful. In the current gawk-4.0-stable I don't think you can
> do that. In master you can't do that either, but an extension can do
> the moral equivalent of the C atexit() function.

For some months now i am hallucinating gawk scripts inspired by plan9's
plumber[1]. At some point having exec() felt handy and i wrote a tiny
extension. Well, thanks for the hint, that might work for me.

> > Internals are implemented directly in the finite automaton (built from
> > the grammar) but functions are implemented as instruction lists which
> > are accompanied by some extra architecture. I further guess it has not
> > been fixed because that implies some of lone (ugly?) special cases in
> > the main loop.
>
> Just in the one case for indirect function calls.
>
> An additional, significant, problem is that if you want do something like
>
> fun = "gsub"
> result = @fun(/pat/, repl, target)
>
> you can't, because gsub() takes a regex literal, which has a different
> meaning in a regular call, and the target is updated by reference, which
> isn't possible for regular functions. It's a thorny issue.
>
> For most of the built-ins, some sort of trampoline to make indirection
> work is likely possible. (tolower, sin, strftime, etc.) Not that I'm
> making any promises...

Arrr. The regex arguments. And references. Sources for trouble and
desire.

> And FWIW, lots of goodness happening these days in the master branch. :-)

You seem to have been working on providing a better API lately, which
is nice. The term trampoline would mean "small wrappers"... That's also
a neat metaphor, not sure if i heard it before. I think i'm going to
keep an eye on gawk's repo [2].

Kind Regards
- Robert Figura

[1] "Plumbing and Other Utilities", Rob Pike
http://doc.cat-v.org/plan_9/4th_edition/papers/plumb
[2] git://git.sv.gnu.org/gawk.git

Aharon Robbins

unread,
Sep 24, 2012, 5:08:22 PM9/24/12
to
In article <20120924215834.95...@netcologne.de>,
Robert Figura <nc-fi...@netcologne.de> wrote:
>arn...@skeeve.com (Aharon Robbins) wrote:
>> Robert Figura <nc-fi...@netcologne.de> wrote:
>> > trying to find a way to run the END block from an extension
>>
>> Yow. Painful. In the current gawk-4.0-stable I don't think you can
>> do that. In master you can't do that either, but an extension can do
>> the moral equivalent of the C atexit() function.
>
>For some months now i am hallucinating gawk scripts inspired by plan9's
>plumber[1]. At some point having exec() felt handy and i wrote a tiny
>extension. Well, thanks for the hint, that might work for me.

I read that paper when it came out but it's been a long time. I seem to
remember thinking that it was sort of awk inspired (describe the data and
then what to do with it).

>> And FWIW, lots of goodness happening these days in the master branch. :-)
>
>You seem to have been working on providing a better API lately,

That's an understatement. :-)

>The term trampoline would mean "small wrappers"... That's also
>a neat metaphor, not sure if i heard it before.

It comes from compiler terminology / technology, or else from OS technology
for assembly code to handle an interrupt and then jump into C. I don't
remember.

>I think i'm going to keep an eye on gawk's repo [2].

Always a good idea. :-)

Thanks,
0 new messages