Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading hexadecimal numbers

98 views
Skip to first unread message

Janis Papanagnou

unread,
Sep 30, 2012, 5:35:01 PM9/30/12
to
I want to read in data that is a defined in a hexadecimal representation,
something like 070F (and then operate on the decimal values). In GNU awk
there's a special function that I can use; say, val = strtonum("0x"$1)
I didn't find anything like that in standard awk, so I would have to
implement the conversion by hand in an own function[*]; or has anyone
some neat idea that I did not think of?

Janis

[*] Quite trivial; but if avoidable I'd prefer some built-in instead of
an explicit loop and a mapping table.

pop

unread,
Sep 30, 2012, 8:34:50 PM9/30/12
to
Janis Papanagnou said the following on 9/30/2012 4:35 PM:
I tried several things in sprintf to no avail in the past... Here is a
conversion program I obtained many years ago that may give you a
starting point:

function str2int(nstr) {
value=0; radix=10; chr=1;
if( match(nstr,"0|-0")==1 ) radix=8;
if( match(nstr,"o|O")==1 ) {radix=8; chr=2;}
if( match(nstr,"0d|0D")==1 ) {radix=10; chr=3;}
if( match(nstr,"d|D")==1 ) {radix=10; chr=2;}
if( match(nstr,"0x|0X")==1 ) {radix=16; chr=3;}
if( match(nstr,"x|X")==1 ) {radix=16; chr=2;}
lwrdigit="0123456789abcdef";
sign=1;
for( loop=chr; (chr=tolower(substr(nstr,loop,1))) != ""; ++loop) {
if( index(lwrdigit,chr) ) {
digit = (index(lwrdigit,chr) -1);
} else if( chr == "-" ) {
sign *= -1;
continue;
} else break;
value=(value*=radix)+digit;
}
value*=sign;
return(value);
}


HTH
pop is Mark

Grant

unread,
Sep 30, 2012, 8:45:40 PM9/30/12
to
I use the table lookup, not neat, gets the job done ;)

Grant.

pop

unread,
Sep 30, 2012, 8:45:50 PM9/30/12
to
pop said the following on 9/30/2012 7:34 PM:
> Janis Papanagnou said the following on 9/30/2012 4:35 PM:
>> I want to read in data that is a defined in a hexadecimal representation,
>> something like 070F (and then operate on the decimal values). In GNU awk
>> there's a special function that I can use; say, val = strtonum("0x"$1)
>> I didn't find anything like that in standard awk, so I would have to
>> implement the conversion by hand in an own function[*]; or has anyone
>> some neat idea that I did not think of?
>>
>> Janis
>>
>> [*] Quite trivial; but if avoidable I'd prefer some built-in instead of
>> an explicit loop and a mapping table.
>>
here is another, more generic, from a magazine (I think):

function mystrtonum(str, ret, chars, n, i, k, c)
{
if (str ~ /^0[0-7]*$/) {
n = length(str)
ret = 0
for (i = 1; i <= n; i++) {
c = substr(str, i, 1)
if ((k = index("01234567", c)) > 0) {
k--
}
ret = ret * 8 + k
}
} else {
if (str ~ /^0[xX][[:xdigit:]]+/) {
str = substr(str, 3)
n = length(str)
ret = 0
for (i = 1; i <= n; i++) {
c = substr(str, i, 1)
c = tolower(c)
if ((k = index("0123456789", c)) > 0) {
k--
} else {
if ((k = index("abcdef", c)) > 0) {
k += 9
}
}
ret = ret * 16 + k
}
} else {
if (str ~ /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|\
([.][0-9]+([Ee][-+]?[0-9]+)?))$/) {
ret = str + 0
} else {
ret = "NAN"
}
}
}
return ret
}


pop->Mark

Janis Papanagnou

unread,
Sep 30, 2012, 9:08:56 PM9/30/12
to
On 01.10.2012 02:45, pop wrote:
> pop said the following on 9/30/2012 7:34 PM:
>> Janis Papanagnou said the following on 9/30/2012 4:35 PM:
>>> I want to read in data that is a defined in a hexadecimal representation,
>>> something like 070F (and then operate on the decimal values). In GNU awk
>>> there's a special function that I can use; say, val = strtonum("0x"$1)
>>> I didn't find anything like that in standard awk, so I would have to
>>> implement the conversion by hand in an own function[*]; or has anyone
>>> some neat idea that I did not think of?
>>>
>>> Janis
>>>
>>> [*] Quite trivial; but if avoidable I'd prefer some built-in instead of
>>> an explicit loop and a mapping table.
>>>
> here is another, more generic, from a magazine (I think):

Thanks. I think for my purpose something simpler would suffice, say, like

function h (arg, res, s)
{
for (s=arg; length(s); s=substr(s,2))
res = res*16 + ht[substr(s,1,1)]
return res
}

BEGIN {
n=split ("0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F",t,",")
for (i=1; i<=n; i++) ht[t[i]]=i-1
}

{ print $1, h($1) }


Janis

Ed Morton

unread,
Oct 1, 2012, 10:11:45 AM10/1/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> On 01.10.2012 02:45, pop wrote:
> > pop said the following on 9/30/2012 7:34 PM:
> >> Janis Papanagnou said the following on 9/30/2012 4:35 PM:
> >>> I want to read in data that is a defined in a hexadecimal representation,
> >>> something like 070F (and then operate on the decimal values). In GNU awk
> >>> there's a special function that I can use; say, val = strtonum("0x"$1)
> >>> I didn't find anything like that in standard awk, so I would have to
> >>> implement the conversion by hand in an own function[*]; or has anyone
> >>> some neat idea that I did not think of?
> >>>
> >>> Janis
> >>>
> >>> [*] Quite trivial; but if avoidable I'd prefer some built-in instead of
> >>> an explicit loop and a mapping table.
> >>>
> > here is another, more generic, from a magazine (I think):
>
> Thanks. I think for my purpose something simpler would suffice, say, like
>
> function h (arg, res, s)
> {
> for (s=arg; length(s); s=substr(s,2))
> res = res*16 + ht[substr(s,1,1)]
> return res
> }
>
> BEGIN {
> n=split ("0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F",t,",")
> for (i=1; i<=n; i++) ht[t[i]]=i-1
> }
>
> { print $1, h($1) }
>

It could be a bit simpler still:

function h (arg, res, s)
{
for (s=arg; length(s); s=substr(s,2))
res = res*16 + index("123456789ABCDEF",substr(s,1,1))
return res
}

{ print $1, h($1) }

Regards,

Ed.

Posted using www.webuse.net

Janis Papanagnou

unread,
Oct 1, 2012, 10:47:51 AM10/1/12
to
Indeed.

Now let's add one more function to make it more stable WRT arguments...

function h (arg, res, s)
{
for (s=toupper(arg); length(s); s=substr(s,2))
res = res*16 + index("123456789ABCDEF",substr(s,1,1))
return res
}

Not bulletproof (we would have to check index()), but nice and compact.

Janis

Ed Morton

unread,
Oct 1, 2012, 11:23:09 AM10/1/12
to
You could just check in the loop condition that s only contains one or more
valid characters instead of checking if it has a length:

function h (arg, res, s)
{
for (s=toupper(arg); s~/^[0123456789ABCDEF]+$/; s=substr(s,2))
res = res*16 + index("123456789ABCDEF",substr(s,1,1))
return res
}

{ print $1, h($1) }

If the function returns an empty string then the input was invalid. I thought
about making the test for [:xdigit:] but that wouldn't work with nawk (unlike
the rest of the function) so it didn't seem worthwhile.

Janis Papanagnou

unread,
Oct 1, 2012, 11:58:35 AM10/1/12
to
Am 01.10.2012 17:23, schrieb Ed Morton:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
>> [...]
>
> You could just check in the loop condition that s only contains one or more
> valid characters instead of checking if it has a length:

It's always nice to see how awk programs can be improved with minimal
changes. :-)

Janis

>
> function h (arg, res, s)
> {
> for (s=toupper(arg); s~/^[0123456789ABCDEF]+$/; s=substr(s,2))
> res = res*16 + index("123456789ABCDEF",substr(s,1,1))
> return res
> }
>
> { print $1, h($1) }
>
> [...]

Ed Morton

unread,
Oct 1, 2012, 1:58:46 PM10/1/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> Am 01.10.2012 17:23, schrieb Ed Morton:
> > Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> >
> >> [...]
> >
> > You could just check in the loop condition that s only contains one or more
> > valid characters instead of checking if it has a length:
>
> It's always nice to see how awk programs can be improved with minimal
> changes. :-)

Yes it is, and I learned a few things along the way. One was that the empty
string "" matches the first character of any other string according to index():

$ awk 'BEGIN{ print index("123","1") }'
1
$ awk 'BEGIN{ print index("123","") }'
1
$ awk 'BEGIN{ print index("123","4") }'
0

I expected a 0 return since "1" != "" so I wouldn't expect them to both exist at
exactly the same position in "123". Seems to me like the sub-string "1" exists
at position 1 and the first empty substring "" occurs before it:

$ awk 'BEGIN{ foo="123"; sub("1","X",foo); print foo }'
X23
$ awk 'BEGIN{ foo="123"; sub("","X",foo); print foo }'
X123

so therefore that'd be position 0.

Manuel Collado

unread,
Oct 1, 2012, 2:32:04 PM10/1/12
to
El 01/10/2012 19:58, Ed Morton escribió:
>[...]
> Yes it is, and I learned a few things along the way. One was that the empty
> string "" matches the first character of any other string according to index():
>
> $ awk 'BEGIN{ print index("123","1") }'
> 1
> $ awk 'BEGIN{ print index("123","") }'
> 1
> $ awk 'BEGIN{ print index("123","4") }'
> 0
>
> I expected a 0 return since "1" != "" so I wouldn't expect them to both exist at
> exactly the same position in "123". Seems to me like the sub-string "1" exists
> at position 1 and the first empty substring "" occurs before it:
>
> $ awk 'BEGIN{ foo="123"; sub("1","X",foo); print foo }'
> X23
> $ awk 'BEGIN{ foo="123"; sub("","X",foo); print foo }'
> X123
>
> so therefore that'd be position 0.

No surprise for me. My assumption is that, because the empty string has
0 length, it occurs at exactly the same position of the character that
follows (or precedes) it.

I.e., the empty string as RE can match any virtual empty string before
of after any real character of a given string.

Regards,
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado



Manuel Collado

unread,
Oct 1, 2012, 2:36:29 PM10/1/12
to
Ooops! Should be just "... that follows it ..."

Janis Papanagnou

unread,
Oct 1, 2012, 2:50:52 PM10/1/12
to
While I also don't find it too astonishing, but because of the special
handling that GNU awk has WRT splitting on "" (empty string patterns),
I'd be interested whether non-GNU awk's all behave the same way with
index().

The book of A., K. and W. just says: "The function index(s,t) returns
the leftmost position where the string t begins in s, or zero if t does
not occur in s."; which would fit.

Janis

Ed Morton

unread,
Oct 1, 2012, 3:29:36 PM10/1/12
to
I think you were right the first time:

$ awk 'BEGIN{ foo="123"; gsub("","-",foo); print foo }'
-1-2-3-

as that'd explain the trailing "-".

Ed.

>
> >
> > I.e., the empty string as RE can match any virtual empty string before
> > of after any real character of a given string.
> >
> > Regards,
>


Posted using www.webuse.net

Anton Treuenfels

unread,
Oct 1, 2012, 6:57:00 PM10/1/12
to

"Ed Morton" <morto...@gmail.com> wrote in message
news:201210011...@webuse.net...
function h(arg, res, s) {

if ( match(arg, /^[0-9A-F]+/i) ) {
for ( s = 1, s <= RLENGTH; s++ )
res = res * 16 + index( "123456789ABCDEF",
toupper(substr(arg,s,1)) )
return res
}
return "NaN"
}

Although I'm not certain I understand the reluctance just to use toupper()
on 'arg' and be done with it. I suppose another "local" variable could be
added and the uppercase version of 'arg' stored in that (or keep in in 's'
and use the new local as the index value). Or even - heresy! - use RSTART as
a writeable variable...

s = toupper( arg )
do {
res = res * 16 + index( "123456789ABCDEF", substr(s, RSTART,
1) )
} while ( ++RSTART <= RLENGTH )

Do look odd, don't it?

- Anton Treuenfels




Grant

unread,
Oct 1, 2012, 7:27:11 PM10/1/12
to
In the context of url parameter decoding, lifted from running code:

# urldecode module
# Copyright (C) 2008 Grant Coady <gr...@bugsplatter.id.au> GPLv2
#
BEGIN {
for (i = 0; i < 16; i++) {
hexc[sprintf("%c", i + (i > 9 ? 55 : 48))] = i
}
for (i = 32; i < 127; i++) {
++charset[sprintf("%c", i)]
}
}
function urldecode(s, a, b, c, d, i)
{
d = ""
for (i = 1; i <= length(s); i++) {
c = substr(s, i, 1)
if (c == "%") {
a = toupper(substr(s, ++i, 1))
b = toupper(substr(s, ++i, 1))
c = sprintf("%c", hexc[a] * 16 + hexc[b])
}
else {
sub(/+/, " ", c)
}
d = d (c in charset ? c : " ")
}
return d
}
# read and decode url query string to global url_query_parm[]
function url_decode_query( a, i) # trashes $0 record buffer
{
$0 = ENVIRON["QUERY_STRING"]
gsub(/&/, " ") # get key=value pair fields
for (i = 1; i <= NF; i++) {
split($i, a, "=") # separate key, value, then decode
url_query_parm[a[1]] = urldecode(a[2])
}
}
...
Written and tested with gawk 3.1.7 on slackware-11.0.

Grant.

Janis Papanagnou

unread,
Oct 2, 2012, 3:10:03 AM10/2/12
to
On 02.10.2012 00:57, Anton Treuenfels wrote:
>
> "Ed Morton" <morto...@gmail.com> wrote in message
> news:201210011...@webuse.net...
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
[...]
No reluctance here; we already used toupper() on arg. (see above)

> I suppose another "local" variable could be added
> and the uppercase version of 'arg' stored in that (or keep in in 's' and use
> the new local as the index value). Or even - heresy! - use RSTART as a

Using RSTART is interesting as well.

Janis

Ed Morton

unread,
Oct 2, 2012, 8:08:51 AM10/2/12
to
On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
<snip>
> function h(arg, res, s) {
>
> if ( match(arg, /^[0-9A-F]+/i) ) {

Hang on. Referee!!! What's that "i" at the end of the RE - to signify the case
should be ignored? I've never seen that before and I can't find any reference to
it in the gawk manual or the POSIX standard but I see it works with gawk 4.0.0:

$ echo "aBc" | awk '/b/'
$ echo "aBc" | awk '/b/i'
aBc

Is that a gawk thing or a POSIX thing or what?

You need a "$" at the end of the RE or an arg of "123XYZ" would match.

> for ( s = 1, s <= RLENGTH; s++ )
> res = res * 16 + index( "123456789ABCDEF", toupper(substr(arg,s,1)) )

That would call toupper() once for every character in arg, it'd obviously be
much more efficient to call it once before the loop.

> return res
> }
> return "NaN"
> }
>
> Although I'm not certain I understand the reluctance just to use toupper() on
> 'arg' and be done with it.

There's no reluctance AFAIK.

I suppose another "local" variable could be added and
> the uppercase version of 'arg' stored in that (or keep in in 's' and use the new
> local as the index value). Or even - heresy! - use RSTART as a writeable
> variable...

I don't think I'd do that, I'm too easily confused!

> s = toupper( arg )
> do {
> res = res * 16 + index( "123456789ABCDEF", substr(s, RSTART, 1) )
> } while ( ++RSTART <= RLENGTH )
>
> Do look odd, don't it?

Yes.

Ed.
>
> - Anton Treuenfels
>
>
>
>

Ed Morton

unread,
Oct 2, 2012, 8:28:07 AM10/2/12
to
On 10/2/2012 7:08 AM, Ed Morton wrote:
> On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
> <snip>
>> function h(arg, res, s) {
>>
>> if ( match(arg, /^[0-9A-F]+/i) ) {
>
> Hang on. Referee!!! What's that "i" at the end of the RE - to signify the case
> should be ignored? I've never seen that before and I can't find any reference to
> it in the gawk manual or the POSIX standard but I see it works with gawk 4.0.0:
>
> $ echo "aBc" | awk '/b/'
> $ echo "aBc" | awk '/b/i'
> aBc
>
> Is that a gawk thing or a POSIX thing or what?

A quick test shows that appending "i" to the RE for a caseless match in

gawk 3.0.4 - works
nawk - works
/usr/xpg4/bin/awk - gives a syntax error

Interesting....

Ed.

Ed Morton

unread,
Oct 2, 2012, 8:41:09 AM10/2/12
to
On 10/2/2012 7:08 AM, Ed Morton wrote:
> On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
> <snip>
>> function h(arg, res, s) {
>>
>> if ( match(arg, /^[0-9A-F]+/i) ) {
>
> Hang on. Referee!!! What's that "i" at the end of the RE - to signify the case
> should be ignored? I've never seen that before and I can't find any reference to
> it in the gawk manual or the POSIX standard but I see it works with gawk 4.0.0:
>
> $ echo "aBc" | awk '/b/'
> $ echo "aBc" | awk '/b/i'
> aBc
>
> Is that a gawk thing or a POSIX thing or what?

A quick test shows that appending "i" to the RE for a caseless match in

gawk 3.0.4 - works
gawk --posix - works
nawk - works
/usr/xpg4/bin/awk - gives a syntax error

Interestingly (to me!):

1) I normally think of /usr/xpg4/bin/awk as a POSIX awk but it doesn't accept
"/.../i" while gawk --posix does.
2) The case-insensitivity section of the gawk book
(http://www.gnu.org/software/gawk/manual/gawk.html#Case_002dsensitivity) doesn't
mention it, nor does the POSIX standard
(http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html).

Regards,

Ed.

none Aharon Robbins

unread,
Oct 2, 2012, 9:38:28 AM10/2/12
to

The Baggins is tricksy!
-- Gollum

In article <k4elgk$rr0$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
>On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
><snip>
>> function h(arg, res, s) {
>>
>> if ( match(arg, /^[0-9A-F]+/i) ) {
>
>Hang on. Referee!!! What's that "i" at the end of the RE - to signify the case
>should be ignored? I've never seen that before and I can't find any
>reference to
>it in the gawk manual or the POSIX standard

That's because it is indeed not there.

It is a perl-ism NOT IMPLEMENTED by gawk, nor by any other awk.

>but I see it works with gawk 4.0.0:

No. It's not doing what you think it is.

>$ echo "aBc" | awk '/b/'
>$ echo "aBc" | awk '/b/i'
>aBc

Let's take /b/i and break it down into its component parts:

/b/ # Does expression match /b/ ? Result is zero in this case
# INVISIBLE concatenation operator
i # An uninitialized variable, equal to ""

Now, rewrite this as

0 ""

and you get

"0" # not a null string, nor a zero value (see the gawk doc)

In other words, the expression result is true!!

That's why the string was printed, since the default action is to print.

Concatenation should have had a real operator. But it's too late now.
Besides, this is the kind of stuff which gives the language lawyers
their kicks. :-)

You may now shake your heads in wonder and enjoy the rest of your day.

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Ed Morton

unread,
Oct 2, 2012, 10:23:02 AM10/2/12
to
Doh! I should have tested with something that didn't match either case too!

Thanks for the explanation. Since it's just string concatenation, any idea why
/usr/xpg4/bin/awk gives a syntax error for it:

$ echo "aBc" | /usr/xpg4/bin/awk '/f/i'
/usr/xpg4/bin/awk: syntax error Context is:
>>> /f/i <<<

I noticed that these don't produce the same result as each other:

$ echo "aBc" | gawk '(/f/)i'
aBc
$ echo "aBc" | /usr/xpg4/bin/awk '(/f/)i'
$

which led me to try this:

$ echo "aBc" | gawk '"0"'
aBc
$ echo "aBc" | /usr/xpg4/bin/awk '"0"'
$
$ echo "aBc" | /usr/xpg4/bin/awk '/f/"0"'
/usr/xpg4/bin/awk: syntax error Context is:
>>> /f/"0" <<<
$ echo "aBc" | /usr/xpg4/bin/awk '1"0"'
aBc
$ echo "aBc" | /usr/xpg4/bin/awk '0"0"'
$

Again, interesting....

Regards,

Ed.

>
> Concatenation should have had a real operator. But it's too late now.
> Besides, this is the kind of stuff which gives the language lawyers
> their kicks. :-)
>
> You may now shake your heads in wonder and enjoy the rest of your day.
>
> Arnold


Posted using www.webuse.net

Janis Papanagnou

unread,
Oct 2, 2012, 11:27:35 AM10/2/12
to
Am 02.10.2012 15:38, schrieb none Aharon Robbins:
> The Baggins is tricksy!
> -- Gollum
>
> In article <k4elgk$rr0$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
>> On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
>> <snip>
>>> function h(arg, res, s) {
>>>
>>> if ( match(arg, /^[0-9A-F]+/i) ) {
>>
>> Hang on. Referee!!! What's that "i" at the end of the RE - to signify the case
>> should be ignored? I've never seen that before and I can't find any
>> reference to
>> it in the gawk manual or the POSIX standard
>
> That's because it is indeed not there.
>
> It is a perl-ism NOT IMPLEMENTED by gawk, nor by any other awk.
>
>> but I see it works with gawk 4.0.0:
>
> No. It's not doing what you think it is.
>
>> $ echo "aBc" | awk '/b/'
>> $ echo "aBc" | awk '/b/i'
>> aBc
>
> Let's take /b/i and break it down into its component parts:
>
> /b/ # Does expression match /b/ ? Result is zero in this case
> # INVISIBLE concatenation operator
> i # An uninitialized variable, equal to ""

LOL - that's along the line of something I posted here a while ago

awk 'Doh! Uh-oh, oh no; no! (not again)'

which can also be decomposed until one understands what it does.

Janis ;-)

Janis Papanagnou

unread,
Oct 2, 2012, 11:33:47 AM10/2/12
to
Am 02.10.2012 16:23, schrieb Ed Morton:
> Aharon Robbins <arnold@chumley.(none)> wrote:
>
[...]
>>
>> Let's take /b/i and break it down into its component parts:
>>
>> /b/ # Does expression match /b/ ? Result is zero in this case
>> # INVISIBLE concatenation operator
>> i # An uninitialized variable, equal to ""
>>
>> Now, rewrite this as
>>
>> 0 ""
>>
>> and you get
>>
>> "0" # not a null string, nor a zero value (see the gawk doc)
>>
>> In other words, the expression result is true!!
>>
>> That's why the string was printed, since the default action is to print.
>
> Doh! I should have tested with something that didn't match either case too!
>
> Thanks for the explanation. Since it's just string concatenation, any idea why
> /usr/xpg4/bin/awk gives a syntax error for it:

I'd suppose that /.../ is not guaranteed to be a part of an expression
that evaluates to a boolean value; it could as well be a syntactical
construct of the language. I wouldn't be astonished about that, and I
would avoid such constructs.

Curious; what does XPG4 awk do with a construct like...?

{ if (/abc/) print "okay" }


Janis

>
> $ echo "aBc" | /usr/xpg4/bin/awk '/f/i'
> /usr/xpg4/bin/awk: syntax error Context is:
>>>> /f/i <<<
>
[...]


Ed Morton

unread,
Oct 2, 2012, 11:59:06 AM10/2/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
<snip>
> Curious; what does XPG4 awk do with a construct like...?
>
> { if (/abc/) print "okay" }

The right thing, same as gawk:

$ echo "abc" | /usr/xpg4/bin/awk '{ if (/abc/) print "okay" }'
okay
$

$ echo "abc" | gawk '{ if (/abc/) print "okay" }'
okay
$

$ echo "def" | /usr/xpg4/bin/awk '{ if (/abc/) print "okay" }'
$

$ echo "def" | gawk '{ if (/abc/) print "okay" }'
$

Ed.

Posted using www.webuse.net

Anton Treuenfels

unread,
Oct 2, 2012, 1:16:41 PM10/2/12
to

"Ed Morton" <morto...@gmail.com> wrote in message
news:k4elgk$rr0$1...@dont-email.me...
> On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
> <snip>
>> function h(arg, res, s) {
>>
>> if ( match(arg, /^[0-9A-F]+/i) ) {
>
> Hang on. Referee!!! What's that "i" at the end of the RE - to signify the
> case should be ignored? I've never seen that before and I can't find any
> reference to it in the gawk manual or the POSIX standard but I see it
> works with gawk 4.0.0:
>
> $ echo "aBc" | awk '/b/'
> $ echo "aBc" | awk '/b/i'
> aBc
>
> Is that a gawk thing or a POSIX thing or what?

Um, it's a TAWK thing I took to be more general than it really is,
apparently. It can go away if 'arg' is subjected to toupper() before the
match.

TAWK also has an 's' qualifier for "shortest possible match" instead of the
default "longest possible match", but I've not yet had occasion to use it.

> You need a "$" at the end of the RE or an arg of "123XYZ" would match.

Mmm, don't think so. At least, not the whole string. Actually when I've done
this in my own code I use:

/[1-9A-F][0-9A-F]*/

which neatly skips over any leading zeroes and also ignores any radix
specifier such as prefix "0x" (C) or "$" (Motorola assembly) or suffix "H"
(Intel assembly). But then I've already guaranteed what I'm matching is some
form of hexadecimal notation, so it's pretty safe by that point (if it
doesn't match, then the value portion of the string must be all "0"
characters, so the result is zero).

>> for ( s = 1, s <= RLENGTH; s++ )
>> res = res * 16 + index( "123456789ABCDEF",
>> toupper(substr(arg,s,1)) )
>
> That would call toupper() once for every character in arg, it'd obviously
> be much more efficient to call it once before the loop.
>
>> return res
>> }
>> return "NaN"
>> }
>>
>> Although I'm not certain I understand the reluctance just to use
>> toupper() on
>> 'arg' and be done with it.
>
> There's no reluctance AFAIK.

Um, I may not have been terribly clear, sorry. Most of the code I saw
assigned the result to 's' instead of back to 'arg'. That necessitated a
local variable which seemed unnecessary to me.

> I suppose another "local" variable could be added and
>> the uppercase version of 'arg' stored in that (or keep in in 's' and use
>> the new
>> local as the index value). Or even - heresy! - use RSTART as a writeable
>> variable...
>
> I don't think I'd do that, I'm too easily confused!

*grin*

Janis Papanagnou

unread,
Oct 2, 2012, 1:54:29 PM10/2/12
to
On 02.10.2012 19:16, Anton Treuenfels wrote:
>>>
>>> Although I'm not certain I understand the reluctance just to use toupper() on
>>> 'arg' and be done with it.
>>
>> There's no reluctance AFAIK.
>
> Um, I may not have been terribly clear, sorry. Most of the code I saw assigned
> the result to 's' instead of back to 'arg'. That necessitated a local variable
> which seemed unnecessary to me.

This is a different issue, though.

And, yes, I wouldn't do that because it make the code less clear. YMMV,
of course. Probably less an issue if you'd not call it 'arg' then, but
something that indicates that the variable is mangled and changed while
passed through the function. I guess it just depends whether that's to
be considered a good thing to do, or not.
The gain of saving space for a scalar variable is certainly irrelevant.

Janis

> [...]


Ed Morton

unread,
Oct 2, 2012, 2:26:58 PM10/2/12
to
Anton Treuenfels <teamt...@yahoo.com> wrote:

>
> "Ed Morton" <morto...@gmail.com> wrote in message
> news:k4elgk$rr0$1...@dont-email.me...
> > On 10/1/2012 5:57 PM, Anton Treuenfels wrote:
> > <snip>
> >> function h(arg, res, s) {
> >>
> >> if ( match(arg, /^[0-9A-F]+/i) ) {
> >
> > Hang on. Referee!!! What's that "i" at the end of the RE - to signify the
> > case should be ignored? I've never seen that before and I can't find any
> > reference to it in the gawk manual or the POSIX standard but I see it
> > works with gawk 4.0.0:
> >
> > $ echo "aBc" | awk '/b/'
> > $ echo "aBc" | awk '/b/i'
> > aBc
> >
> > Is that a gawk thing or a POSIX thing or what?
>
> Um, it's a TAWK thing I took to be more general than it really is,
> apparently. It can go away if 'arg' is subjected to toupper() before the
> match.
>
> TAWK also has an 's' qualifier for "shortest possible match" instead of the
> default "longest possible match", but I've not yet had occasion to use it.

Ah, got it. Now I understand.

> > You need a "$" at the end of the RE or an arg of "123XYZ" would match.
>
> Mmm, don't think so. At least, not the whole string.

You're using the match() for 2 things:

1) validating the input is a hex number, and
2) determining the length of the hex number

In my view of what constitutes valid input, FWIW, you can only do "1" if you
catch "123XYZ" as not being a hex number and to do that you need to match on
/^..$/ rather than just /^../.

If you allow 123XYZ as valid input then you should probably also allow LMN123XYZ
and then start your loop below at RSTART.

Actually when I've done
> this in my own code I use:
>
> /[1-9A-F][0-9A-F]*/
>
> which neatly skips over any leading zeroes and also ignores any radix
> specifier such as prefix "0x" (C) or "$" (Motorola assembly) or suffix "H"
> (Intel assembly). But then I've already guaranteed what I'm matching is some
> form of hexadecimal notation, so it's pretty safe by that point (if it
> doesn't match, then the value portion of the string must be all "0"
> characters, so the result is zero).
>
> >> for ( s = 1, s <= RLENGTH; s++ )
> >> res = res * 16 + index( "123456789ABCDEF",
> >> toupper(substr(arg,s,1)) )
> >
> > That would call toupper() once for every character in arg, it'd obviously
> > be much more efficient to call it once before the loop.
> >
> >> return res
> >> }
> >> return "NaN"
> >> }
> >>
> >> Although I'm not certain I understand the reluctance just to use
> >> toupper() on
> >> 'arg' and be done with it.
> >
> > There's no reluctance AFAIK.
>
> Um, I may not have been terribly clear, sorry. Most of the code I saw
> assigned the result to 's' instead of back to 'arg'. That necessitated a
> local variable which seemed unnecessary to me.

Ah, now I understand where you're coming from. Personally, it just never
occurred to me NOT to use a local variable and modify that but YMMV I suppose.

Ed.

Posted using www.webuse.net

Janis Papanagnou

unread,
Oct 4, 2012, 5:32:44 AM10/4/12
to
Am 02.10.2012 17:59, schrieb Ed Morton:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> <snip>
>> Curious; what does XPG4 awk do with a construct like...?
>>
>> { if (/abc/) print "okay" }
>
> The right thing, same as gawk:
> [...]

Can you please also check this code on XPG4 awk:

function x (arg)
{
if (arg) print "hi"
return 1
}
x(/abc/)

(I seem to recall that there was an awk that accepts that.)

Thanks.

GNU awk for example doesn't seem to accept it. If /abc/ would
be a basic boolean expression it should, but obviously it isn't;
one has to call that as: x($0~/abc/)

Janis

Kenny McCormack

unread,
Oct 4, 2012, 5:55:35 AM10/4/12
to
In article <k4jl3p$2q7$1...@speranza.aioe.org>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>GNU awk for example doesn't seem to accept it. If /abc/ would
>be a basic boolean expression it should, but obviously it isn't;
>one has to call that as: x($0~/abc/)

Works for me (with a warning):

$ cat food
function x (arg) {
if (arg) print "hi (program output)"
return 1
}
{x(/abc/)}
$ gawk -f food
gawk: food:5: warning: regexp constant for parameter #1 yields boolean value
abc
hi (program output)
def
ghi
abdef
abcdef
hi (program output)
^C
$

--
Is God willing to prevent evil, but not able? Then he is not omnipotent.
Is he able, but not willing? Then he is malevolent.
Is he both able and willing? Then whence cometh evil?
Is he neither able nor willing? Then why call him God?
~ Epicurus

Aharon Robbins

unread,
Oct 4, 2012, 6:30:22 AM10/4/12
to
In article <k4jl3p$2q7$1...@speranza.aioe.org>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>Can you please also check this code on XPG4 awk:
>
> function x (arg)
> {
> if (arg) print "hi"
> return 1
> }
> x(/abc/)
>
>(I seem to recall that there was an awk that accepts that.)
>
>Thanks.
>
>GNU awk for example doesn't seem to accept it. If /abc/ would
>be a basic boolean expression it should, but obviously it isn't;
>one has to call that as: x($0~/abc/)
>
>Janis
>

Um, I beg your pardon?

$ cat foo.awk
function x (arg)
{
if (arg) print "hi"
return 1
}

x(/abc/)

$ echo abc | ./gawk -f foo.awk
gawk: foo.awk:7: warning: regexp constant for parameter #1 yields boolean value
hi
abc

Gawk clearly accepts it and it works like it is supposed to. How did you
test this?

In addition, it works correctly with nawk, mawk, and the MKS awk (the
code for /usr/xpg4/bin/awk from Open Solaris).

Janis Papanagnou

unread,
Oct 4, 2012, 7:32:10 AM10/4/12
to
Am 04.10.2012 12:30, schrieb Aharon Robbins:
> In article <k4jl3p$2q7$1...@speranza.aioe.org>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
[...]
>> (I seem to recall that there was an awk that accepts that.)
>>
>> Thanks.
>>
>> GNU awk for example doesn't seem to accept it. [...]

>
> Um, I beg your pardon?
[...]
> gawk: foo.awk:7: warning: regexp constant for parameter #1 yields boolean value
> hi
> abc
>
> Gawk clearly accepts it and it works like it is supposed to. How did you
> test this?

Argh! - I am so use to awk operating quietly without any of those (IME
rare) warning-level diagnostics that I sloppily mistook it for an error.
I had some faint memories WRT an error concerning a regexp-constant and
thought this was it. Thanks for the correction, Arnold and Kenny.

>
> In addition, it works correctly with nawk, mawk, and the MKS awk (the
> code for /usr/xpg4/bin/awk from Open Solaris).

Are there also warning messages in those awks? I have to admit that the
intention and usefulness of that specific message is not apparent to me.

Janis

Ed Morton

unread,
Oct 4, 2012, 8:25:44 AM10/4/12
to
On 10/4/2012 4:32 AM, Janis Papanagnou wrote:
<snip>
> Can you please also check this code on XPG4 awk:
>
> function x (arg)
> {
> if (arg) print "hi"
> return 1
> }
> x(/abc/)
<snip>

$ echo "abc\ndef" | nawk -f tst.awk
hi
abc
def
$ echo "abc\ndef" | /usr/xpg4/bin/awk -f tst.awk
hi
abc
def
$ echo "abc\ndef" | gawk -f tst.awk
gawk: tst.awk:6: warning: regexp constant for parameter #1 yields boolean value
hi
abc
def

Regards,

Ed.

Aharon Robbins

unread,
Oct 6, 2012, 4:02:15 PM10/6/12
to
Hi Janis.

In article <k4js3n$la8$1...@speranza.aioe.org>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>Am 04.10.2012 12:30, schrieb Aharon Robbins:
>[...]
>>gawk: foo.awk:7: warning: regexp constant for parameter #1 yields boolean value
>
>Are there also warning messages in those awks? I have to admit that the
>intention and usefulness of that specific message is not apparent to me.

With sub/gsub/gensub/match (at least), a regexp constant is passed as-is
to the function to be used as a regular expression. This is possible
since they are built-in functions.

User defined functions are different. In that case, /foo/ is a simple
expression, whose meaning is `$0 ~ /foo/'. This is evaluated at the point
of call and yields either 1 or 0 (a boolean value). Said 1 or 0 is passed
into the user defined function as the value of the parameter, not the
original regexp.

The warning message is intended to remind you of this since it is unlikely
that this is really what you want. I think I "upgraded" this message from
a lint warning to an always-on warning after making this mistake myself
one too many times. (That is generally the case with the always-on
warnings. :-)

HTH,

Janis Papanagnou

unread,
Oct 7, 2012, 5:26:54 AM10/7/12
to
On 06.10.2012 22:02, Aharon Robbins wrote:
> Hi Janis.
>
> In article <k4js3n$la8$1...@speranza.aioe.org>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> Am 04.10.2012 12:30, schrieb Aharon Robbins:
>> [...]
>>> gawk: foo.awk:7: warning: regexp constant for parameter #1 yields boolean value
>>
>> Are there also warning messages in those awks? I have to admit that the
>> intention and usefulness of that specific message is not apparent to me.
>
> With sub/gsub/gensub/match (at least), a regexp constant is passed as-is
> to the function to be used as a regular expression. This is possible
> since they are built-in functions.
>
> User defined functions are different. In that case, /foo/ is a simple
> expression, whose meaning is `$0 ~ /foo/'. This is evaluated at the point
> of call and yields either 1 or 0 (a boolean value). Said 1 or 0 is passed
> into the user defined function as the value of the parameter, not the
> original regexp.

Yes, that is how I would have assumed it to behave. To me it's comparable
to other usages like if(/foo/) or /foo/{action} .

OTOH, { x = /abc/ ; print x } is not considered worth a warning? - Not
for me, but I think assigning a regexp constant to a function parameter
or to a variable should not make any difference WRT a warning message.

>
> The warning message is intended to remind you of this since it is unlikely
> that this is really what you want.

Hmm.. - well, my impulse would be to pass regexp expressions as strings;
as I understand it there's no values in variables and parameters other
than scalar (number/string) and arrays (and maybe some array extensions
in gawk), but specifically no regexp constants. I thought that is quite
clear from the awk language.

> I think I "upgraded" this message from
> a lint warning to an always-on warning after making this mistake myself
> one too many times.

Seriously? It's hard to believe that you made that mistake. :-)

> (That is generally the case with the always-on warnings. :-)

I now understand why you made that visible, but, frankly, I still don't
see how that is more an issue to always report as, say, an unintended
assignment like

{ if ($1 = 5) print "foo" ; else print "bar" }

IMO, if one is unsure about the depicted issue (or the issue in topic),
or if one gets problems, why not just activate lint?

My vote would go to remove unnecessary warnings like that from being
displayed constantly.

Janis

>
> HTH,
>
> Arnold
>

Aharon Robbins

unread,
Oct 7, 2012, 6:48:45 AM10/7/12
to
Hi Janis.

In article <k4rhr5$u1f$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> User defined functions are different. In that case, /foo/ is a simple
>> expression, whose meaning is `$0 ~ /foo/'. This is evaluated at the point
>> of call and yields either 1 or 0 (a boolean value). Said 1 or 0 is passed
>> into the user defined function as the value of the parameter, not the
>> original regexp.
>
>Yes, that is how I would have assumed it to behave. To me it's comparable
>to other usages like if(/foo/) or /foo/{action} .

You are an experienced awk user. What's obvious to you is not obvious
to many other people.

>OTOH, { x = /abc/ ; print x } is not considered worth a warning?

It's different, I think, although I will be the first to admit that
gawk does not issue all the warnings that might be useful.

>> I think I "upgraded" this message from
>> a lint warning to an always-on warning after making this mistake myself
>> one too many times.
>
>Seriously? It's hard to believe that you made that mistake. :-)

Seriously.

>I now understand why you made that visible, but, frankly, I still don't
>see how that is more an issue to always report as, say, an unintended
>assignment like
>
> { if ($1 = 5) print "foo" ; else print "bar" }

It's mostly an implementation issue; in some cases it's easier to add
warnings than in others.

>My vote would go to remove unnecessary warnings like that from being
>displayed constantly.

As James Kirk once said, "I was not aware that this was a democracy."
:-)

Thanks,

Ed Morton

unread,
Oct 8, 2012, 12:26:12 PM10/8/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> On 06.10.2012 22:02, Aharon Robbins wrote:
> > Hi Janis.
> >
> > In article <k4js3n$la8$1...@speranza.aioe.org>,
> > Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> >> Am 04.10.2012 12:30, schrieb Aharon Robbins:
> >> [...]
> >>> gawk: foo.awk:7: warning: regexp constant for parameter #1 yields boolean
value
> >>
> >> Are there also warning messages in those awks? I have to admit that the
> >> intention and usefulness of that specific message is not apparent to me.
> >
> > With sub/gsub/gensub/match (at least), a regexp constant is passed as-is
> > to the function to be used as a regular expression. This is possible
> > since they are built-in functions.
> >
> > User defined functions are different. In that case, /foo/ is a simple
> > expression, whose meaning is `$0 ~ /foo/'. This is evaluated at the point
> > of call and yields either 1 or 0 (a boolean value). Said 1 or 0 is passed
> > into the user defined function as the value of the parameter, not the
> > original regexp.
>
> Yes, that is how I would have assumed it to behave. To me it's comparable
> to other usages like if(/foo/) or /foo/{action} .

Given a function like this:

function func(re) { if ($1 ~ re) print }

I'd have expected this:

{ func(/abc/) }

to behave like this (i.e. print $0 if $1 contains "abc"):

{ func("abc") }

rather than like this (i.e. print $0 if $1 contains "1"):

{ func($0~/abc/) }

given the way the built-in functions treat RE arguments so I think a warning is
helpful in this case.

Anyone who doesn't like getting the warning can just re-write the argument
explicitly as in the last case above to get rid of it and that syntax will be
clearer to anyone reading the code too.

Ed.


Posted using www.webuse.net

Ed Morton

unread,
Oct 8, 2012, 1:04:24 PM10/8/12
to
Aharon Robbins <arn...@skeeve.com> wrote:

> Hi Janis.
>
> In article <k4rhr5$u1f$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> >> User defined functions are different. In that case, /foo/ is a simple
> >> expression, whose meaning is `$0 ~ /foo/'. This is evaluated at the point
> >> of call and yields either 1 or 0 (a boolean value). Said 1 or 0 is passed
> >> into the user defined function as the value of the parameter, not the
> >> original regexp.
> >
> >Yes, that is how I would have assumed it to behave. To me it's comparable
> >to other usages like if(/foo/) or /foo/{action} .
>
> You are an experienced awk user. What's obvious to you is not obvious
> to many other people.
>
> >OTOH, { x = /abc/ ; print x } is not considered worth a warning?
>
> It's different, I think,

I agree. While a warning in the above case would, I think, also be warranted,
it's a lot less likely someone will make a mistake with code like that as
there's no extremely common precedent in the language that would lead anyone to
think the above means something other than what it does mean.

For the case of /abc/ as a function argument however, we have some of the most
frequently used built-in functions treat it one way and user-defined functions
treat it totally differently:

*sub(/abc/) = pass "abc" to the function as an RE
match(/abc/) = pass "abc" to the function as an RE
split(/abc/) = pass "abc" to the function as an RE
func(/abc/) = compare "abc" to "$0" and pass the result to the function as
an integer value

Clearly that last one having completely different semantics to the others is
non-intuitive and easily open to mis-interpretation.

Ed.

Posted using www.webuse.net

Kenny McCormack

unread,
Oct 8, 2012, 1:06:21 PM10/8/12
to
In article <201210081...@webuse.net>,
Ed Morton <morto...@gmail.com> wrote:
...
>Anyone who doesn't like getting the warning can just re-write the argument
>explicitly as in the last case above to get rid of it and that syntax will be
>clearer to anyone reading the code too.

All of which suggests that this all qualifies as a "bug". A bug of the "If
I had it to do over again (from scratch), I'd have done it differently."
variety.

That said:
1) I agree with Ed - that, given where we are today - and given that we
aren't going to do what I suggest below - that the best solution is to
observe that the warning is there and can't be turned off, so the best
workaround is to code it (the pattern match) explicitly. Which, as Ed says,
both makes the warning go away and makes your code a little clearer.

2) When I see things like this (bugs of the "If I had it do over
again..." variety), I often think (as I do in this case) that the problem
could be fixed with a "pragma". That is, you could have something like
"pragma newregexpbehavior" at the top of the file (program) and then it
would behave like Ed "expected" it to - i.e., that reg exp parameters get
passed as strings (*).

3) Or you could just do it the way TAWK does (making reg exps true
"first class objects").

(*) Of course, one might object to this and say that if you want to pasa
string, you should just pass a string. Which is true, of course...

--
The motto of the GOP "base": You can't be a billionaire, but at least you
can vote like one.

Janis Papanagnou

unread,
Oct 8, 2012, 2:53:45 PM10/8/12
to
Hi Arnold

On 07.10.2012 12:48, Aharon Robbins wrote:
> Hi Janis.
>
> In article <k4rhr5$u1f$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
[...]
[ about f(/abc/) semantics ]
>
>> OTOH, { x = /abc/ ; print x } is not considered worth a warning?
>
> It's different, I think, although I will be the first to admit that
> gawk does not issue all the warnings that might be useful.

I suppose you are right that gawk might not issue all warnings that are
potentially useful. OTOH, as mentioned, I am grateful for any missing
unnecessary warning. :-) And certainly, what's useful or not is in the
eye of the beholder (so I will abstain from judging about usefulness).

Where I disagree is that, from a semantic point of view, the assignment
case is substantially different from the function argument passing here;
I think it isn't (and I have a strong opinion on that - but anyway, who
cares; see "democracy" below ;-).

For a scientific explanation I want to refer to the observation[*] that
program variables can be considered as degenerated parameters of [(in
the general case) recursive] functions; as in

function R (m) { if (cond(m)) then return R(f(m)) else return g(m) }

degenerated to

x = m # initialisation

x = f(x) # modification, placed in a loop that depends on cond(m)

return g(x)

where cond(m) can be any condition depending on m, including m itself
(i.e. the identity).

(That said; I'm sure the semantics in question could be easier explained
than by the above derivation, but it's at least included herein.)

[*] See Bauer, Wössner: "Algorithmische Sprache und Programmentwicklung"
[1984], Ch.5 - this is written in German, though, but you can find that
also in Partsch: "Specification and Transformation of Programs" [1990],
Ch.7.

> [...]
>
> It's mostly an implementation issue; in some cases it's easier to add
> warnings than in others.

Aha, I see.

>
>> My vote would go to remove unnecessary warnings like that from being
>> displayed constantly.
>
> As James Kirk once said, "I was not aware that this was a democracy."
> :-)

Ah, well - obviously it isn't... :-)

(BTW, from which film or episode is that quote? - I cannot find it.)

>
> Thanks,
>
> Arnold
>

Thanks as well,

Janis

Janis Papanagnou

unread,
Oct 8, 2012, 3:13:34 PM10/8/12
to
On 08.10.2012 19:04, Ed Morton wrote:
> [...]
>
> For the case of /abc/ as a function argument however, we have some of the most
> frequently used built-in functions treat it one way and user-defined functions
> treat it totally differently:
>
> *sub(/abc/) = pass "abc" to the function as an RE
> match(/abc/) = pass "abc" to the function as an RE
> split(/abc/) = pass "abc" to the function as an RE
> func(/abc/) = compare "abc" to "$0" and pass the result to the function as
> an integer value
>
> Clearly that last one having completely different semantics to the others is
> non-intuitive and easily open to mis-interpretation.

While I see where you are coming from I think that this comparison is not
appropriate; awk's built-in function seem to be different from user defined
functions in some respects, implementation-wise (as we hear here from time
to time, with effects on behaviour of some gawk features), syntactically
(no space between name and parenthesis), and, yes, also WRT the /.../ regexp
constants that they, uniquely seem to allow. That special behaviour must be
known.

As mentioned upthread other standard contexts of occurrences of /.../ have
$0~/.../ semantics. So it depends, what you trade for the one or the other
semantics decision.

Janis

>
> Ed.
>
> Posted using www.webuse.net
>

Janis Papanagnou

unread,
Oct 8, 2012, 3:23:19 PM10/8/12
to
On 08.10.2012 18:26, Ed Morton wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
[...]
>>
>> Yes, that is how I would have assumed it to behave. To me it's comparable
>> to other usages like if(/foo/) or /foo/{action} .
>
> Given a function like this:
>
> function func(re) { if ($1 ~ re) print }

Now, I think, you're arguing from the back end of the horse.

>
> I'd have expected this:
>
> { func(/abc/) }
>
> to behave like [...]

We must define what /abc/ at the position of a function argument
means, no more, no less, and this semantics should be independent
of the implementation of the function.

Anyway... I think I see where you're coming from and that's okay.
I accept your point of view.

Janis

Ed Morton

unread,
Oct 8, 2012, 6:01:29 PM10/8/12
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 07.10.2012 12:48, Aharon Robbins wrote:
<snip>
> > As James Kirk once said, "I was not aware that this was a democracy."
<snip>
> (BTW, from which film or episode is that quote? - I cannot find it.)

I think it's a misquote from The Corbomite Maneuver (1966):

Lieutenant Dave Bailey: Sir, we gonna just let it hold us here? We got phaser
weapons. I vote we blast it.
Capt. Kirk: I'll keep that in mind, Mr. Bailey, when this becomes a democracy.

Regards,

Ed (still sitting at the nerd table....)

Posted using www.webuse.net

Aharon Robbins

unread,
Oct 9, 2012, 11:24:36 AM10/9/12
to
In article <201210082...@webuse.net>,
Indeed, Ed, I think your nerd points outnumber mine. I knew the quote
was approximate when I wrote it, but that is indeed the episode I was
referrring to. I only remembered that it was one of the earlier episodes.

Aharon Robbins

unread,
Oct 9, 2012, 11:31:06 AM10/9/12
to
Hi Janis.

In article <k4v7e0$shp$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>Where I disagree is that, from a semantic point of view, the assignment
>case is substantially different from the function argument passing here;
>I think it isn't (and I have a strong opinion on that - but anyway, who
>cares; see "democracy" below ;-).

We are in agreement that semantically it is NOT different. What I meant
by "it's different" is that `x = /abc/' is (more) obvious to someone reading
the code as meaning `x = ($0 ~ /abc/)' whereas to someone looking at code

foo(/abc/)

meaning

foo($0 ~ /abc/)

And thus a warning is more justified in the latter case than in the former.

>>> My vote would go to remove unnecessary warnings like that from being
>>> displayed constantly.
>>
>> As James Kirk once said, "I was not aware that this was a democracy."
>> :-)
>
>Ah, well - obviously it isn't... :-)
>
>(BTW, from which film or episode is that quote? - I cannot find it.)

See Ed's post for the exact quote. I knew I was only quoting things
approximately.

Anton Treuenfels

unread,
Oct 9, 2012, 10:41:06 PM10/9/12
to

"Kenny McCormack" <gaz...@shell.xmission.com> wrote in message
news:k4v16d$7is$1...@news.xmission.com...
> In article <201210081...@webuse.net>,
> Ed Morton <morto...@gmail.com> wrote:

> 3) Or you could just do it the way TAWK does (making reg exps true
> "first class objects").

Oh, I am SO spoiled..you mean in most AWK variants they're not?

To me the "natural" interpretation of:

x = /abc/

is to assign the RE '/abc/' to the variable 'x.' Then I can use the variable
'x' whereever I can use the literal '/abc/'. The concept that in non-TAWK
AWKs it does (and should) mean:

x = $0 ~ /abc/

is frankly a bit unsettling to me. The ability of the same AWK variable to
be a number or a string depending on context (and what's been assigned most
recently) is uncontroversial. Why not then can't they be an RE?

Similarly, passing a literal RE to a function as an RE doesn't seem
unreasonable, but instead consistent. No special cases to worry about. An
AWK parser can, I think, in general can only verify that the argument(s) of
a user-defined function are valid expressions. Their types can't be checked
at parse time because (for one thing) there's no way to tell from the formal
parameter list what types are actually expected. So...if RE s could be
assigned to variables the same principle would let them be used as function
arguments.

It's probably long past the time this could be implemented, though. Too many
programs would break. It does make the thought of porting some of my TAWK
programs to other AWKs more daunting than I realized.

I suppose the implied comparison '$0 ~ /RE/' is reasonable if variables can
only be numbers or strings. I shall have to try to remember this limitation
in any samples I post, though...let me know if I slip!

- Anton Treuenfels

Ed Morton

unread,
Oct 9, 2012, 10:56:10 PM10/9/12
to
On 10/9/2012 9:41 PM, Anton Treuenfels wrote:
<snip>
> To me the "natural" interpretation of:
>
> x = /abc/
>
> is to assign the RE '/abc/' to the variable 'x.' Then I can use the variable 'x'
> whereever I can use the literal '/abc/'.

In every situation I can think of off the top of my head (but I admittedly
haven't spent a lot of time thinking about it), doing that would make your code
unnecessarily cryptic. Could you give an example of when it would be desirable?

Ed.


Kaz Kylheku

unread,
Oct 10, 2012, 1:03:10 AM10/10/12
to
For example, you can put some logic into a function, and pass a first-class
regex as an argument:

new_array = grep(array_of_strings, /regex/) # nothing cryptic here

/regex/ is not being assigned to a variable here, but of course the formal
parameter of a function is a kind of variable.

Ed Morton

unread,
Oct 10, 2012, 9:31:00 AM10/10/12
to
I get that usage, but I was asking specifically about being able to just assign
/abc/ to a variable as in:

x = /abc/

Regards,

Ed.

Kenny McCormack

unread,
Oct 10, 2012, 9:54:44 AM10/10/12
to
In article <k53tak$20b$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>I get that usage, but I was asking specifically about being able to just
>assign /abc/ to a variable as in:
>
> x = /abc/

Let's be honest. The only reason your mind sees the above as:

x = $0 ~ /abc/

is because that's the way most AWKs work and (most importantly in many
people's eyes) that's what the "standard" says it should mean. Your whole
"clarity" argument boils down this this.

Clearly, the way TAWK does it is a) non-standard and b) superior.
But it is certainly true that in many people's mind, the first trumps the
later.

If you could unwind your mind and see it with fresh eyes, I think you'd
agree that:

x = /abc/

should mean: assign the reg exp to x (and make x be of type
regular_expression), which is exactly what TAWK does.

Now, one caveat. This *does* create an asymmetry between expressions in the
"pattern space" vs. expressions in actions (inside {}s). To be entirely
consistent, TAWK should cease to recognize:

/foo/ {print}

and instead require you to write:

$0 ~ /foo/ {print}

But that *would* be going to far (would break too much existing code).

P.S. I think you may have raised, somewhat indirectly, the issue of
"usefulness". I.e., is any of this useful? Why would anyone want to assign
a reg exp to a variable? I can't give a full answer to this, but I'll say
that in my time working with TAWK, I've used it very rarely, but my memory
is that the times that I have, it's been quite useful (a real saver). I
wish I could give a concrete example.

One thing, though, is that (as we've been discussing) it allows you to write
user functions with the same argument passing semantics as (some of) the
built-ins.

--
Modern Christian: Someone who can take time out from
complaining about "welfare mothers popping out babies we
have to feed" to complain about welfare mothers getting
abortions that PREVENT more babies to be raised at public
expense.

Ed Morton

unread,
Oct 10, 2012, 10:23:40 AM10/10/12
to
On 10/10/2012 8:54 AM, Kenny McCormack wrote:
> In article <k53tak$20b$1...@dont-email.me>,
> Ed Morton <morto...@gmail.com> wrote:
> ...
>> I get that usage, but I was asking specifically about being able to just
>> assign /abc/ to a variable as in:
>>
>> x = /abc/
>
> Let's be honest. The only reason your mind sees the above as:
>
> x = $0 ~ /abc/
>
> is because that's the way most AWKs work and (most importantly in many
> people's eyes) that's what the "standard" says it should mean.

Well, if we're being honest - if I saw that construct I wouldn't have any
immediate opinion of what it means. I'd have to think about it for a bit and try
a test or 2 to figure out what it does. After that if it proved to be the
equivalent of:

x = ($0 ~ /abc/) # store the result of comparing /abc/ to $0

then I'd think that was understandable and made sense given the other contexts
in which we use /abc/ to mean ($0 ~ /abc/), and if it proved to be the
equivalent of:

x = /abc/ # store the RE /abc/

then I'd be very surprised given that AFAIK non-array awk variables can only be
strings or numbers and it'd make me start re-thinking what values each variable
might have in the rest of the program.

<snip>
> One thing, though, is that (as we've been discussing) it allows you to write
> user functions with the same argument passing semantics as (some of) the
> built-ins.

Yes, as I mentioned in the previous 2 posts I understand the usefulness of that,
I'm just trying to understand when you'd really want to store an RE in a
variable as above.

Ed.


Kaz Kylheku

unread,
Oct 10, 2012, 12:59:22 PM10/10/12
to
It wouldn't make sense to rule out this usage, while allowing it to be
passed to functions.

One use for assignment might be to set up a configuration area in the program
where it is customized.

What if the regex is to occur in multiple places? If it's not in a variable,
it has to be repeated.

I mean, how about the fact that you can assign a field or record separating regex
on the command line? That's a sort of variable assignment!

Aharon Robbins

unread,
Oct 10, 2012, 2:04:27 PM10/10/12
to
In article <201210100...@kylheku.com>,
Kaz Kylheku <k...@kylheku.com> wrote:
>One use for assignment might be to set up a configuration area in the program
>where it is customized.
>
>What if the regex is to occur in multiple places? If it's not in a variable,
>it has to be repeated.

This is why, since ~ 1988, awk has allowed the use of string variables
(and constants) where regexp constants are typically used. Thus

BEGIN { x = "foo{3,5}(bar)+" }
...
$0 ~ x { ... }
{ ... ; if (match(a, x)) ... }

It's not a 1-for-1 swap, and there are some caveats to be aware of
when doing this, documented in the gawk manual, but this kind of thing
is possible.

Just for the record, while I agree it would have been nice to have /regex/
as first class objects, it wasn't done, and it's too late now.

Ed Morton

unread,
Oct 10, 2012, 2:24:36 PM10/10/12
to
Aharon Robbins <arn...@skeeve.com> wrote:

> In article <201210100...@kylheku.com>,
> Kaz Kylheku <k...@kylheku.com> wrote:
> >One use for assignment might be to set up a configuration area in the program
> >where it is customized.
> >
> >What if the regex is to occur in multiple places? If it's not in a variable,
> >it has to be repeated.
>
> This is why, since ~ 1988, awk has allowed the use of string variables
> (and constants) where regexp constants are typically used. Thus
>
> BEGIN { x = "foo{3,5}(bar)+" }
> ...
> $0 ~ x { ... }
> { ... ; if (match(a, x)) ... }
>
> It's not a 1-for-1 swap, and there are some caveats to be aware of
> when doing this, documented in the gawk manual, but this kind of thing
> is possible.

Right and it works well and lets you build up REs from arguments or input file
contents or whatever by using normal string operations like concatenation. Now,
let's imagine we also allowed

x = /abc/

Can you build that up from variables or input? I mean with the way things are
now I can create a variable to hold a string to use in an RE context by doing
this (assume $1 contains "ab"):

x = $1 "c"
$0 ~ x

Could I alternatively do this:

y = $1 /c/
$0 ~ x

Does that mean there's some nameless operator for "RE concatenation" that
converts strings or numbers to REs just like string concatenation? Could I then
do this to create an RE from a string:

x = "abc" //

?

Is there some other way to create one of these REs from a [input] string? How do
I know the intent of the above wasn't to do:

y = $1 ($0 ~ /c/)

I just find it all confusing and unnecessary.

Ed.

Posted using www.webuse.net

Anton Treuenfels

unread,
Oct 10, 2012, 7:19:31 PM10/10/12
to

"Ed Morton" <morto...@gmail.com> wrote in message
news:k52o4b$u52$1...@dont-email.me...

> In every situation I can think of off the top of my head (but I admittedly
> haven't spent a lot of time thinking about it), doing that would make your
> code unnecessarily cryptic. Could you give an example of when it would be
> desirable?

As part of an assembler I wrote in TAWK I did this for defining what label
variants should look like:

======================================================

# source language: Thompson AWK 4.0

# first created: 03/08/03
# last revision: 06/09/12

# public function prefix: "SYM"

# ----------------------------

# public constants

# for ease of consistent definition, all symbol/label patterns
# are defined here (even if not used in this file)

# base label form (default; most others add prefixes/suffixes to this)

global SYMglobal = /^[_A-Z](\.?[_A-Z0-9])*$/i

# label types differ based on their initial and final characters

global SYMlocLabel = /^@/ # local label
local varLabel = /^]/ # variable label
local glbLabel = /^[_A-Z]/ # global label
local autLabel = /^[bfl]/ # auto label (internal form)

# "user label" - all the forms available to the user (in label field)

local userLabel = /^[]@]?[_A-Z](\.?[_A-Z0-9])*\$?:?$/i

# branch target "auto-labels"
# - automatically replaced by assembler-generated labels

# branch target (in label field)

local branchLabel = /^-\+?$|^\+-?$/

# branch target references (in expression field)

local fwdtargetRef = /\++/
local baktargetRef = /-+/

# user labels recognized in expressions
# suffixes:
# - b = branch target (decorated)
# - g = global
# - l = local
# - Ub = branch target (undecorated = complete expression)
# - v = variable

global SYMnumLabel_glv = /^[]@]?[_A-Z](\.?[_A-Z0-9])*\:?/i
global SYMnumLabel_b = /^:(\++|-+)/
global SYMnumLabel_Ub = /^(\++|-+)$/

global SYMstrLabel_glv = /^[]@]?_?[A-Z]([_\.]?[A-Z0-9])*_?\$:?/i

# macro text formal argument name pattern
# - replaced by the text of its actual argument during macro expansion
# suffix:
# - e = embedded match (within longer string)

global SYMtxtArg = /^\?[_A-Z](\.?[_A-Z0-9])*$/i
global SYMtxtArg_e = /\?[_A-Z](\.?[_A-Z0-9])*/i

# macro label formal argument name
#- assigned the value of its actual argument during macro expansion

global SYMlblArg = /^[]@][_A-Z](\.?[_A-Z0-9])*\$?$/i

=================================================

I can - and HAVE, several times - changed the REs defined here. Usually the
purpose was to make the patterns less restrictive, eliminating needless
limitations. The point is I do that once here, and the effect ripples out to
whereever these variables are used. I don't have to worry about missing any
use of an RE literal in some other file.

In point of fact, these are only the definitions in the current released
version of the assembler. The development version has slightly different
definitions yet again.

And from a different source file of that assembler:

==========================================================

# field separators

global CKfieldSep = "," # default (could be changed, though not so far)
local fsStopChar

# character, string and regular expression literal operand patterns
# - basic pattern is delimiter, (1) any single char except delimiter or
# backslash or (2) backslash plus next char as a pair, delimiter

global CKcharLitToken = /^'([^'\\]|\\.)+'/
global CKstrLitToken = /^"([^"\\]|\\.)*"/
global CKregexLitToken = /^\/([^/\\]|\\.)+\/i?/

# char and string literal escape codes

local allEscPat = /\\(.|(\$|0?X)[0-9A-F]+|([0-9]|0[A-F])[0-9A-F]*H)/i

{ ....and a bit later.... }

INIT {

# field split scan stop characters

fsStopChar[ "," ] = /[,"'\\\(\/]/
fsStopChar[ "=" ] = /[="'\\]/
fsStopChar[ ":" ] = /[:"'\\]/

}

{ ..okay, even I'm not daring enough to try to statically initialize an
array element :) ...}

{...and a bit later the above are used in this function (note especially the
use of 'stopch' to hold different REs): }

# split string into fields
# - but must not split where separator is escaped or within literal
# - or, if separator is ",", within possible function call

global function CKsplitfield(str, result, fsep) {

local i, j
local ch
local stopch, inparen
local fcount, fval
local field

# no split char -> no split processing
# - still, we make sure result is definitely a string ('0' is ambiguous)

if ( !index(str, fsep) ) {
result[ 1 ] = str ""
return( 1 )
}

# check if split necessary

i = j = 1
inparen = fcount = 0
stopch = fsStopChar[ fsep ]
while ( match(str, stopch, j) ) {

# assume we're going to restart just after this stop char

j = RSTART + 1

# field separator ?

if ( (ch = substr(str, RSTART, 1)) == fsep ) {
field[ ++fcount ] = substr( str, i, RSTART - i )
i = j
continue
}

# open parenthesis ?
# - removes field split char from stops, adds close parenthesis

if ( ch == "(" ) {
inparen++
stopch = /["'\\\(\/\)]/
continue
}

# close parenthesis ?

if ( ch == ")" ) {
if ( --inparen == 0 )
stopch = fsStopChar[ fsep ]
continue
}

# must be double quote, single quote, slash or escape char
# - if we can match a pattern, skip to its end

if ( ch == "\"" )
match( str, CKstrLitToken, RSTART )
else if ( ch == "'" )
match( str, CKcharLitToken, RSTART )
else if ( ch == "/" )
match( str, CKregexLitToken, RSTART )
else
match( str, allEscPat, RSTART )

# if found match to pattern, restart after it
# - if no match there's a good chance it's an error,
# but we'll restart just after the stop char anyway

if ( RSTART )
j = RSTART + RLENGTH
}

# unconditionally take everything left
# - if i == 1 here, then any split char was escaped or within literal
# - does this happen often enough to be worth checking for ?
# - if i > length(str), then the last char in str must have been
# a field separator
# - which is an error we want to catch anyway !

field[ ++fcount ] = substr( str, i )

# eliminate leading and trailing whitespace and save result

i = fcount
do {
fval = field[ i ]
if ( !match(fval, /[^ \t]/) ) {
UMerror( "BadField", "#" i "> <" str )
fcount = 0
}
else {
if ( RSTART > 1 )
fval = substr( fval, RSTART )
if ( match(fval, /[ \t]+$/) )
fval = substr( fval, 1, RSTART-1 )
result[ i ] = fval
}
} while ( --i )

return( fcount )
}

========================================

So there's a couple of examples. HTH.

- Anton Treuenfels



Ed Morton

unread,
Oct 10, 2012, 11:23:42 PM10/10/12
to
On 10/10/2012 6:19 PM, Anton Treuenfels wrote:
>
> "Ed Morton" <morto...@gmail.com> wrote in message
> news:k52o4b$u52$1...@dont-email.me...
>
>> In every situation I can think of off the top of my head (but I admittedly
>> haven't spent a lot of time thinking about it), doing that would make your
>> code unnecessarily cryptic. Could you give an example of when it would be
>> desirable?
>
> As part of an assembler I wrote in TAWK I did this for defining what label
> variants should look like:
<snip>
> global SYMglobal = /^[_A-Z](\.?[_A-Z0-9])*$/
<snip>
> So there's a couple of examples. HTH.

I think so. It looks like you're using them the way you'd otherwise use strings
to contain the REs, but you just don't need to escape backslashes like you would
in a string used in an RE context. Is that right?

Given the above definition and this statement:

SYMglobal { print "found" }

would it mean "print found if the variable SYMglobal is populated" or "print
found if $0 matches the RE contained in SYMglobal"?

If SYMglobal was a string populated as the equivalent:

SYMglobal = "^[_A-Z](\\.?[_A-Z0-9])*$"

it'd be the former and you'd need to write:

$0 ~ SYMglobal { print "found" }

to get the latter behavior so I'm wondering....

Also, can a variable change it's type to/from an RE and if so what does that
syntax look like?

Ed.

Anton Treuenfels

unread,
Oct 11, 2012, 10:58:28 PM10/11/12
to

"Ed Morton" <morto...@gmail.com> wrote in message
news:k55e3v$d43$1...@dont-email.me...
> On 10/10/2012 6:19 PM, Anton Treuenfels wrote:
>>
>> "Ed Morton" <morto...@gmail.com> wrote in message
>> news:k52o4b$u52$1...@dont-email.me...
>>
>>> In every situation I can think of off the top of my head (but I
>>> admittedly
>>> haven't spent a lot of time thinking about it), doing that would make
>>> your
>>> code unnecessarily cryptic. Could you give an example of when it would
>>> be
>>> desirable?
>>
>> As part of an assembler I wrote in TAWK I did this for defining what
>> label
>> variants should look like:
> <snip>
>> global SYMglobal = /^[_A-Z](\.?[_A-Z0-9])*$/
> <snip>
>> So there's a couple of examples. HTH.
>
> I think so. It looks like you're using them the way you'd otherwise use
> strings to contain the REs, but you just don't need to escape backslashes
> like you would in a string used in an RE context. Is that right?

Not quite. The point is 'SYMglobal' is used in multiple places. As has been
often been pointed out, the meaning of a literal of any type is not always
obvious from context. Naming a literal can help (1) make is meaning clearer
in context and (2) make it easier to change the underlying value by making
it harder to miss a use of it. If TAWK offered the equivalent of a
"constant" declaration I'd use that for these particular definitions, but as
it doesn't I use variables.

> Given the above definition and this statement:
>
> SYMglobal { print "found" }
>
> would it mean "print found if the variable SYMglobal is populated" or
> "print found if $0 matches the RE contained in SYMglobal"?
>
> If SYMglobal was a string populated as the equivalent:
>
> SYMglobal = "^[_A-Z](\\.?[_A-Z0-9])*$"
>
> it'd be the former and you'd need to write:
>
> $0 ~ SYMglobal { print "found" }
>
> to get the latter behavior so I'm wondering....

Seems correct as far as I can tell.

> Also, can a variable change it's type to/from an RE and if so what does
> that syntax look like?

A TAWK variable can hold any type at any time by simple assigment. Generally
the nastiest thing I do is use the same variable in the same function as
both a scalar and an array, depending on branches. In arrays, individual
elements of one array can be either scalars or arrays, depending on need.
Functions do not have to return the same type from each and every possible
exit point, and whatever the result types, they can all be assigned to the
same variable when the function returns.

My mental picture of TAWK variables has always been a 'C' structure
consisting of an integer type identifier and a union that can hold either a
value or a pointer to a value, depending on the current type. The idea of
them as "all the same, yet different" makes it easy to accept that
user-defined function arguments can have multiple types - not just each
formal argument, but even the same formal argument at different times (just
try that in a strongly-typed language!) - and how one function can return
multiple types (ditto!).

- Anton Treuenfels


Ed Morton

unread,
Oct 12, 2012, 9:37:50 AM10/12/12
to
On 10/11/2012 9:58 PM, Anton Treuenfels wrote:
>
> "Ed Morton" <morto...@gmail.com> wrote in message
> news:k55e3v$d43$1...@dont-email.me...
>> On 10/10/2012 6:19 PM, Anton Treuenfels wrote:
>>>
>>> "Ed Morton" <morto...@gmail.com> wrote in message
>>> news:k52o4b$u52$1...@dont-email.me...
>>>
>>>> In every situation I can think of off the top of my head (but I admittedly
>>>> haven't spent a lot of time thinking about it), doing that would make your
>>>> code unnecessarily cryptic. Could you give an example of when it would be
>>>> desirable?
>>>
>>> As part of an assembler I wrote in TAWK I did this for defining what label
>>> variants should look like:
>> <snip>
>>> global SYMglobal = /^[_A-Z](\.?[_A-Z0-9])*$/
>> <snip>
>>> So there's a couple of examples. HTH.
>>
>> I think so. It looks like you're using them the way you'd otherwise use
>> strings to contain the REs, but you just don't need to escape backslashes like
>> you would in a string used in an RE context. Is that right?
>
> Not quite. The point is 'SYMglobal' is used in multiple places.

Right but so could it be if the variable held a string representing an RE,
unless I'm missing something about your statement.

As has been
> often been pointed out, the meaning of a literal of any type is not always
> obvious from context. Naming a literal can help (1) make is meaning clearer in
> context and (2) make it easier to change the underlying value by making it
> harder to miss a use of it. If TAWK offered the equivalent of a "constant"
> declaration I'd use that for these particular definitions, but as it doesn't I
> use variables.
>
>> Given the above definition and this statement:
>>
>> SYMglobal { print "found" }
>>
>> would it mean "print found if the variable SYMglobal is populated" or "print
>> found if $0 matches the RE contained in SYMglobal"?

You didn't answer the above question.

>>
>> If SYMglobal was a string populated as the equivalent:
>>
>> SYMglobal = "^[_A-Z](\\.?[_A-Z0-9])*$"
>>
>> it'd be the former and you'd need to write:
>>
>> $0 ~ SYMglobal { print "found" }
>>
>> to get the latter behavior so I'm wondering....
>
> Seems correct as far as I can tell.
>
>> Also, can a variable change it's type to/from an RE and if so what does that
>> syntax look like?
>
> A TAWK variable can hold any type at any time by simple assigment.

OK, I'm not being clear. I can do this with a variable holding an RE:

$ cat file
abc
def
abcxyz here
ghi
$ awk -v var="abc" ' $0 ~ var "xyz" ' file
abcxyz here

What does something like that (i.e. compose an RE from parts) look like if the
variable ("var" above) holds an RE constant instead of a string? You can set var
in a BEGIN statement if you can't set it outside of awk.

Ed.

Anton Treuenfels

unread,
Oct 12, 2012, 6:41:57 PM10/12/12
to

"Ed Morton" <morto...@gmail.com> wrote in message
news:k596ff$peb$1...@dont-email.me...
Or perhaps something about TAWK. As a compiler rather than an interpreter, a
literal RE is transformed into internal form just once at compile time. REs
in string form work as expected, but must wait until run time to be
transformed, where there's a chance the same RE will be unnecessarily
transformed multiple times.

TAWK also offers a regex() function to transform a string RE into internal
form just to avoid that possibility. I suppose it would work just as well on
a command-line variable assignment of a string representing an RE, though
I've never done it myself.

> As has been
>> often been pointed out, the meaning of a literal of any type is not
>> always
>> obvious from context. Naming a literal can help (1) make is meaning
>> clearer in
>> context and (2) make it easier to change the underlying value by making
>> it
>> harder to miss a use of it. If TAWK offered the equivalent of a
>> "constant"
>> declaration I'd use that for these particular definitions, but as it
>> doesn't I
>> use variables.
>>
>>> Given the above definition and this statement:
>>>
>>> SYMglobal { print "found" }
>>>
>>> would it mean "print found if the variable SYMglobal is populated" or
>>> "print
>>> found if $0 matches the RE contained in SYMglobal"?
>
> You didn't answer the above question.

Thought I did below (meaning the whole block, including the above).

>>>
>>> If SYMglobal was a string populated as the equivalent:
>>>
>>> SYMglobal = "^[_A-Z](\\.?[_A-Z0-9])*$"
>>>
>>> it'd be the former and you'd need to write:
>>>
>>> $0 ~ SYMglobal { print "found" }
>>>
>>> to get the latter behavior so I'm wondering....
>>
>> Seems correct as far as I can tell.
>>
>>> Also, can a variable change it's type to/from an RE and if so what does
>>> that
>>> syntax look like?
>>
>> A TAWK variable can hold any type at any time by simple assigment.
>
> OK, I'm not being clear. I can do this with a variable holding an RE:
>
> $ cat file
> abc
> def
> abcxyz here
> ghi
> $ awk -v var="abc" ' $0 ~ var "xyz" ' file
> abcxyz here
>
> What does something like that (i.e. compose an RE from parts) look like if
> the variable ("var" above) holds an RE constant instead of a string? You
> can set var in a BEGIN statement if you can't set it outside of awk.

Never tried that, but I suspect it would not work in any AWK variant. As far
as I can tell - and if definitive statements exist I have never seen them -
any RE can only ever be the last element of a syntactically legal
expression. Eg., you can write:

$0 ~ /RE/

but not:

/RE/ ~ $0

The pattern match operators are NOT commutative (I think I tested this once
just to make sure). Similarly, if a parameter of a built-in function is an
RE, the only thing allowed to follow it is an expression terminator such as
"," or ")". You can probably write something like:

($0 ~ /RE/) + 1

but the parentheses "isolate" that portion of the expression and again, the
RE is the last element of the sub-expression it appears in.

In your example, if 'var' held an RE then it would not be the last element
of the expression it appears in. Reversing the order:

'$0 ~ "xyz" var'

won't help because then a string follows a pattern match operator, which is
not legal syntax. The parser should complain.

But now I'm thinking, why wouldn't:

$0 ~ /RE/ + 1

work? If the pattern match operator has a higher precedence than the
addition operator, then it should be legal.

(pause while I look at my assembler's expression parser...)

Okay, I'm back. The parser doesn't currently recognize a binary numeric
operator following an RE. I may have to make some changes!

- Anton Treuenfels

Anton Treuenfels

unread,
Oct 12, 2012, 6:52:36 PM10/12/12
to

"Anton Treuenfels" <teamt...@yahoo.com> wrote in message
news:36udnc90KsAnAeXN...@earthlink.com...

> Okay, I'm back. The parser doesn't currently recognize a binary numeric
> operator following an RE. I may have to make some changes!
>
> - Anton Treuenfels

And apparently I was looking at the wrong part of the state table. The
parser DOES recognize

"str" matchop /RE/

as having a numeric result and will next accept anything that can legally
follow a number. Whew! A bare /RE/ (as in a function argument) can only be
followed by an end-of-expression marker of some kind.

Regarding your example, Ed, I still don't think it's legal syntax in any AWK
variant. A string cannot directly follow an RE (as in your example) or a
pattern match operator (as in my alteration of your example).

- Anton Treuenfels

Janis Papanagnou

unread,
Oct 12, 2012, 8:33:55 PM10/12/12
to
On 09.10.2012 00:01, Ed Morton wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 07.10.2012 12:48, Aharon Robbins wrote:
> <snip>
>>> As James Kirk once said, "I was not aware that this was a democracy."
> <snip>
>> (BTW, from which film or episode is that quote? - I cannot find it.)
>
> I think it's a misquote from The Corbomite Maneuver (1966):
>
> Lieutenant Dave Bailey: Sir, we gonna just let it hold us here? We got phaser
> weapons. I vote we blast it.
> Capt. Kirk: I'll keep that in mind, Mr. Bailey, when this becomes a democracy.

Now I know why that sounded unfamiliar to me; the German synchronisation
is different, something like: "I'll revisit your suggestion, Mr. Bayley,
if I won't come up with something better."

Janis

PS: Arnold, you compared my awk warning vote with a phaser blast?! Wow!
Now just wait until I've armed and activated my Corbomite! :-)

Ed Morton

unread,
Oct 13, 2012, 9:25:09 AM10/13/12
to
OK, thanks for indulging me! I've learned something...

Ed.

Aharon Robbins

unread,
Oct 13, 2012, 2:10:14 PM10/13/12
to
In article <44qdnaZ_SKOlAuXN...@earthlink.com>,
Anton Treuenfels <teamt...@yahoo.com> wrote:
>
>"Anton Treuenfels" <teamt...@yahoo.com> wrote in message
>news:36udnc90KsAnAeXN...@earthlink.com...
>
>> Okay, I'm back. The parser doesn't currently recognize a binary numeric
>> operator following an RE. I may have to make some changes!
>>
>> - Anton Treuenfels
>
>And apparently I was looking at the wrong part of the state table. The
>parser DOES recognize
>
>"str" matchop /RE/
>
>as having a numeric result and will next accept anything that can legally
>follow a number. Whew! A bare /RE/ (as in a function argument) can only be
>followed by an end-of-expression marker of some kind.

So - how do you know this? Do you have source code for tawk? If so,
how did you get it?

>Regarding your example, Ed, I still don't think it's legal syntax in any AWK
>variant. A string cannot directly follow an RE (as in your example) or a
>pattern match operator (as in my alteration of your example).

foo ~ var1 var2
--> foo ~ (var1 var2)

is what Ed meant. I'd have to double check how gawk actually parses this.

(The other options would be

foo ~ var1 var2
---> (foo ~ var1) var2
)

Thanks,

Arnold

Anton Treuenfels

unread,
Oct 13, 2012, 11:13:07 PM10/13/12
to

"Aharon Robbins" <arn...@skeeve.com> wrote in message
news:k5caq6$34e$1...@dont-email.me...
> In article <44qdnaZ_SKOlAuXN...@earthlink.com>,
> Anton Treuenfels <teamt...@yahoo.com> wrote:
>>
>>"Anton Treuenfels" <teamt...@yahoo.com> wrote in message
>>news:36udnc90KsAnAeXN...@earthlink.com...
>>
>>> Okay, I'm back. The parser doesn't currently recognize a binary numeric
>>> operator following an RE. I may have to make some changes!
>>>
>>> - Anton Treuenfels
>>
>>And apparently I was looking at the wrong part of the state table. The
>>parser DOES recognize
>>
>>"str" matchop /RE/
>>
>>as having a numeric result and will next accept anything that can legally
>>follow a number. Whew! A bare /RE/ (as in a function argument) can only be
>>followed by an end-of-expression marker of some kind.
>
> So - how do you know this? Do you have source code for tawk? If so,
> how did you get it?

Mmm, sorry if I misled you. I was referring to the expression parser I wrote
for my assembler. It reflects my understanding of the Rules of Regular
Expressions. The described behavior is what I wrote it to do (or not do, as
the case may be) based on observation, since I have never seen a complete
list of what's legal and what's not.

>>Regarding your example, Ed, I still don't think it's legal syntax in any
>>AWK
>>variant. A string cannot directly follow an RE (as in your example) or a
>>pattern match operator (as in my alteration of your example).
>
> foo ~ var1 var2
> --> foo ~ (var1 var2)
>
> is what Ed meant. I'd have to double check how gawk actually parses this.
>
> (The other options would be
>
> foo ~ var1 var2
> ---> (foo ~ var1) var2
> )

Either way, to be legal doesn't there have to be an implied type conversion
somewhere along the line?

> foo ~ var1 var2
> --> foo ~ (var1 var2)

Assuming "foo", "var1" and "var2" are all strings, expressed as parsed
Reverse Polish this might look something like:

foo var1 var2 _CONCAT _S2RE _MATCH

where _CONCAT is an operator "added" to make the implied concatention
explicit (and evaluation easier) and _S2RE is an operator "added" or
"injected" during parsing to make things turn out right when it is noticed
an RE is needed rather than a string. Simlarly,

> foo ~ var1 var2
> ---> (foo ~ var1) var2

might look like:

foo var1 _S2RE _MATCH _N2S var2 _CONCAT

where _N2S is an operator "added" or "injected" during parsing to convert a
number to a string.

Either can be legal if a parser is designed that way, I suppose. I've never
seen a string on the right hand of a pattern match operator (or if I have I
don't recall) though, so I never really considered the possibility. I have
seen contexts in which a string is "automagically" converted to an RE,
usually a built-in function of some kind, so I imagine such parsers have the
equivalent of an _S2RE operator hidden inside them.

On a side note, an interesting question is what should the evaluator do if
it turns out the string does not represent a legal RE, which can't be
determined at parse time if the string is a variable rather than a literal.
TAWK tends to complain directly to the console, which makes it hard for a
program to handle such happenings elegantly.

But what I understood Ed to be asking involved the types of "var1" and
"var2" being different. That is, "foo" and "var2" are strings, but "var1" is
an RE. It's easy enough to coerce "var2" to an RE as well, but I have no
idea in the world what it would mean to concatenate two REs.

Of course as soon as I wrote that I thought "What if it means RE1|RE2"? How
easy that would be depends, I suppose, on what the internal representation
of an RE is like. If each "clause" is internally a separate RE and
evaluation moves along from one alternative to the next, it might not be so
bad. If instead "it's all one big thing" then joining two separate REs would
be problematic.

But in any case an "RE1|RE2" interpretation requires an implied
concatenation I'm not aware of in any AWK variant. In fact I don't think
it's possible as it is to concatenate two separate RE literals, implicitly
or explicitly (of course, why would anyone want to?).The only way it can
happen is to start from strings and convert the combined result to an RE -
and one of the strings would have to have the "|" operator in it as well.

- Anton Treuenfels

0 new messages