Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is gawk POSIX-compliant in handling numeric-string to numeric-string comparisons?

118 views
Skip to first unread message

Ed Morton

unread,
Aug 2, 2018, 1:32:51 AM8/2/18
to
The POSIX awk spec
(http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) says:

----
Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators) shall be
made numerically if both operands are numeric, if one is numeric and the other
has a string value that is a numeric string, or if one is numeric and the other
has the uninitialized value. Otherwise, operands shall be converted to strings
as required and a string comparison shall be made
----

which best I can tell means that if a numeric string is being compared to
another numeric string then that should be a string comparison since numeric
string to numeric string isn't one of the 3 stated combinations that cause a
numeric comparison (numeric vs numeric, numeric vs numeric-string, and numeric
vs uninitialized).

The gawk manual, however, in it's comparison table at
https://www.gnu.org/software/gawk/manual/gawk.html#Variable-Typing says:

----
+----------------------------------------------
| STRING NUMERIC STRNUM
--------+----------------------------------------------
|
STRING | string string string
|
NUMERIC | string numeric numeric
|
STRNUM | string numeric numeric
--------+----------------------------------------------
----

and sure enough, using gawk (with or without the --posix flag) we see:

$ echo '07 7' |
awk '{print typeof($1), $1, "==", typeof($2), $2, "is " ($1 == $2 ? "true" :
"false")}'
strnum 07 == strnum 7 is true

So - is gawk operating counter to the POSIX standard here or is there some piece
of the puzzle I'm not seeing?

Ed.

Janis Papanagnou

unread,
Aug 2, 2018, 2:49:43 AM8/2/18
to
You quoted POSIX: "Comparisons [...] shall be made numerically if [...] one
is numeric and the other has a string value that is a numeric string [...]"

One difference is that POSIX does not consider the STRNUM type for input
data. So for POSIX an (external) input of "7" is considered numeric (and the
interpretation of "07" may be whatever it is since the rule already applies
if one argument is numeric). Informally it would be considered by POSIX as
numeric string 07 == numeric 7 is true

IIUC.

Was it that what puzzled you?

Janis

Ed Morton

unread,
Aug 2, 2018, 7:56:56 AM8/2/18
to
POSIX says:

----
A string value shall be considered a **numeric string** if it comes from one of
the following:
**Field variables**
----

Emphasis above mine. If that's not saying it does consider the STRNUM type for
input data then what is it saying? The conditions after the above statement say,
among other things, that a successful call to strtod() determines if a field is
a STRNUM or not and I'd expect that both 07 and 7 would have such a result.

So for POSIX an (external) input of "7" is considered numeric (and the
> interpretation of "07" may be whatever it is since the rule already applies
> if one argument is numeric). Informally it would be considered by POSIX as
> numeric string 07 == numeric 7 is true

I'm just not seeing how it's not considered by POSIX as numeric string 07 ==
numeric string 7 given the parts I've quoted from the POSIX spec (and after
reading the rest of the spec related to this area).

Ed.

Geoff Clare

unread,
Aug 2, 2018, 8:41:04 AM8/2/18
to
Input data is accessed via field variables ($1, $2, etc.), and POSIX
says:

Each field variable shall have a string value or an uninitialized
value when created. [...] If appropriate, the field variable shall
be considered a numeric string (see Expressions in awk).

(The referenced section describes the circumstances under which it is
considered a numeric string.) It does not say that a field variable
can have a numeric value.

Therefore I think Ed has a point - the gawk manual and behaviour
do not match POSIX. However, this seems like a defect in POSIX
to me, as other awk implementations behave the same way.

--
Geoff Clare <net...@gclare.org.uk>

Ed Morton

unread,
Aug 2, 2018, 10:04:34 AM8/2/18
to
On 8/2/2018 7:33 AM, Geoff Clare wrote:
<snip>
> Therefore I think Ed has a point - the gawk manual and behaviour
> do not match POSIX. However, this seems like a defect in POSIX
> to me, as other awk implementations behave the same way.
>

Agreed. Here's another related example of what POSIX says vs what actual awks do
and a contradiction in the POSIX standard:

POSIX says under "Variables and Special Variables":
---
Uninitialized variables, including scalar variables, array elements, and field
variables, shall have an uninitialized value. An uninitialized value shall have
both a numeric value of zero and a string value of the empty string.
---

So according to POSIX an uninitialized variable and an uninitialized field are
treated identically. OK, got it. But then you have how gawk (and other awks such
as BSD/OSX awk) actually behave:

Uninitialized variable:

$ awk 'BEGIN{print typeof(x), x, (x=="" ? "==" : "!="), typeof(""), ""}'
untyped == string
$ awk 'BEGIN{print typeof(x), x, (x==0 ? "==" : "!="), typeof(0), 0}'
untyped == number 0


Uninitialized field:

$ echo 'a b ' | awk '{print typeof($3), $3, ($3=="" ? "==" : "!="), typeof(""), ""}'
unassigned == string
$ echo 'a b ' | awk '{print typeof($3), $3, ($3==0 ? "==" : "!="), typeof(0), 0}'
unassigned != number 0

Note the difference between the two for the comparison against a number 0.

Now, how awk actually behaves makes sense wrt this other part of the standard
under "Expressions in awk":

---
Syntax | Name | Type of Result | Associativity
$expr | Field reference | String | N/A
...
A string value shall be considered a numeric string if it comes from one of the
following:

Field variables

and an implementation-dependent condition corresponding to either case (a) or
(b) below is met.
----

where case a is a call to strtod() and case be is seeing if the field value it
parses as a NUMERIC token.

So according to this part of the POSIX spec the type of a field, $expr, is
String and it can only become Numeric String if it satisfies the criteria in the
rest of that section (strtod() result or is a NUMERIC token). Therefore, since
"" is not a number by any of the stated criteria, an uninitialized field should
be a String with value "" and then the awk behavior above makes perfect sense.

If only that didn't contradict the required behavior given the quote at the
start of this post from the other part of the POSIX standard.

So now what? Can/should we get the POSIX spec changed to match actual awk
behavior? Maybe there's some awks out there that actually behave as POSIX says
they should in that first quote, idk. Do we need gawk to behave differently when
--posix is enabled?

Regards,

Ed.

Ed Morton

unread,
Aug 2, 2018, 12:30:38 PM8/2/18
to
On 8/2/2018 9:04 AM, Ed Morton wrote:
> On 8/2/2018 7:33 AM, Geoff Clare wrote:
> <snip>
>> Therefore I think Ed has a point - the gawk manual and behaviour
>> do not match POSIX.  However, this seems like a defect in POSIX
>> to me, as other awk implementations behave the same way.
>>
>
> Agreed. Here's another related example of what POSIX says vs what actual awks do
> and a contradiction in the POSIX standard:
>
> POSIX says under "Variables and Special Variables":

....

Another difference was just brought to my attention:

POSIX:
---
Field variables shall have the uninitialized value when created from $0 using FS
and the variable does not contain any characters.
---

So given a CSV with contents "a,,c" split on comma the value of $2 should be
"the uninitialized value" which we know from elsewhere in the POSIX spec is
zero-or-null.

How awk (thankfully) actually works is to simply treat $2 as a null string:

$ echo 'a,,c' | awk -F, 'BEGIN{print typeof($2), $2, ($2=="" ? "==" : "!="),
typeof(""), ""}'
unassigned == string
$ echo 'a,,c' | awk -F, 'BEGIN{print typeof($2), $2, ($2==0 ? "==" : "!="),
typeof(0), 0}'
unassigned != number 0

Regards,

Ed.

Janis Papanagnou

unread,
Aug 2, 2018, 3:06:30 PM8/2/18
to
On 02.08.2018 14:33, Geoff Clare wrote:
>
> Input data is accessed via field variables ($1, $2, etc.), and POSIX
> says:
>
> Each field variable shall have a string value or an uninitialized
> value when created. [...] If appropriate, the field variable shall
> be considered a numeric string (see Expressions in awk).

Ah, with this quote it's indeed inconsistent.

> (The referenced section describes the circumstances under which it is
> considered a numeric string.) It does not say that a field variable
> can have a numeric value.
>
> Therefore I think Ed has a point - the gawk manual and behaviour
> do not match POSIX. However, this seems like a defect in POSIX

It seems so.

> to me, as other awk implementations behave the same way.

Janis


Kaz Kylheku

unread,
Aug 2, 2018, 4:00:12 PM8/2/18
to
On 2018-08-02, Geoff Clare <ge...@clare.See-My-Signature.invalid> wrote:
> Janis Papanagnou wrote:
>> One difference is that POSIX does not consider the STRNUM type for input
>> data. So for POSIX an (external) input of "7" is considered numeric (and the
>> interpretation of "07" may be whatever it is since the rule already applies
>> if one argument is numeric). Informally it would be considered by POSIX as
>> numeric string 07 == numeric 7 is true
>
> Input data is accessed via field variables ($1, $2, etc.), and POSIX
> says:
>
> Each field variable shall have a string value or an uninitialized
> value when created. [...] If appropriate, the field variable shall
> be considered a numeric string (see Expressions in awk).

What is a "-v foo=bar" variable set on the command line?

"Input data" or not?

rici...@gmail.com

unread,
Aug 3, 2018, 2:54:25 AM8/3/18
to
Yes, according to Posix (although Posix doesn't use the phrase "input data"). The origins of string values which are to be considered potential numeric strings are:

> 1. Field variables
> 2. Input from the getline() function
> 3. FILENAME
> 4. ARGV array elements
> 5. ENVIRON array elements
> 6. Array elements created by the split() function
> 7. A command line variable assignment
> 8. Variable assignment from another numeric string variable

Command line assignments (#7 in that list) include both '-v var=value' command-line options and 'var=value' command line operands.

Ed Morton

unread,
Aug 5, 2018, 8:53:16 AM8/5/18
to
So from the above, discussions on other forums, and talking to Arnold, gawk and
other awks aren't strictly POSIX compliant in this regard but it's the POSIX
standard that needs to change, not the awk implementations, otherwise something
as basic as this (example courtesy of Arnold):

echo '5.0 10.0' | awk '$1 < $2'

would evaluate to false so I've opened a ticket with the Open Group to try to
get the standard updated. The ticket is at
https://help.opengroup.org/hc/en-us/requests/193457, you'll need to log in to
see it but even then idk if you'll be able to or if it's just me who can see it.

Thanks all for the feedback.

Ed.

Geoff Clare

unread,
Aug 6, 2018, 8:41:06 AM8/6/18
to
Since you used BEGIN as the pattern, what you proved here is that the
awk version you tried this in does not conform to the POSIX requirement
regarding "the uninitialized value", not that $2 was not being treated
as "the uninitialized value".

Solaris does what POSIX says:

$ echo 'a,,c' | awk -F, 'BEGIN {print ($2==0 ? "==" : "!=")}'
==

and also without the BEGIN:

$ echo 'a,,c' | awk -F, '{print ($2==0 ? "==" : "!=")}'
==

Ed Morton

unread,
Aug 6, 2018, 8:51:02 AM8/6/18
to
On 8/6/2018 7:36 AM, Geoff Clare wrote:
> Ed Morton wrote:
>
>> On 8/2/2018 9:04 AM, Ed Morton wrote:
>>
>> Another difference was just brought to my attention:
>>
>> POSIX:
>> ---
>> Field variables shall have the uninitialized value when created from $0 using FS
>> and the variable does not contain any characters.
>> ---
>>
>> So given a CSV with contents "a,,c" split on comma the value of $2 should be
>> "the uninitialized value" which we know from elsewhere in the POSIX spec is
>> zero-or-null.
>>
>> How awk (thankfully) actually works is to simply treat $2 as a null string:
>>
>> $ echo 'a,,c' | awk -F, 'BEGIN{print typeof($2), $2, ($2=="" ? "==" : "!="),
>> typeof(""), ""}'
>> unassigned == string
>> $ echo 'a,,c' | awk -F, 'BEGIN{print typeof($2), $2, ($2==0 ? "==" : "!="),
>> typeof(0), 0}'
>> unassigned != number 0
>
> Since you used BEGIN as the pattern, what you proved here is that the

Aaarghhh!!! Yeah, that was a mistake - trying too many combinations and messed
up when posting the above, thanks for catching it.

> awk version you tried this in does not conform to the POSIX requirement
> regarding "the uninitialized value", not that $2 was not being treated
> as "the uninitialized value".

What I meant to post was (using gawk 4.2.0):

$ echo 'a,,c' | awk -F, '{print typeof($2), $2, ($2=="" ? "==" : "!="),
typeof(""), ""}'
string == string

$ echo 'a,,c' | awk -F, '{print typeof($2), $2, ($2==0 ? "==" : "!="),
typeof(0), 0}'
string != number 0

>
> Solaris does what POSIX says:
>
> $ echo 'a,,c' | awk -F, 'BEGIN {print ($2==0 ? "==" : "!=")}'
> ==
>
> and also without the BEGIN:
>
> $ echo 'a,,c' | awk -F, '{print ($2==0 ? "==" : "!=")}'
> ==
>

There's a few awks on Soliars - is that old, broken awk (/bin/awk aka oawk) or
nawk or /usr/xpg[46]/bin/awk?

Ed.

Martin Neitzel

unread,
Aug 6, 2018, 6:24:03 PM8/6/18
to
>> Solaris does what POSIX says:
>> $ echo 'a,,c' | awk -F, 'BEGIN {print ($2==0 ? "==" : "!=")}'
>> ==
>
>There's a few awks on Soliars - is that old, broken awk (/bin/awk aka oawk) or
>nawk or /usr/xpg[46]/bin/awk?

Yes. :-)

$ uname -a
SunOS carlton 5.10 Generic_147148-26 i86pc i386 i86pc

$ ls -li /usr/{xpg*/,}bin/*awk* | sort
4993 -r-xr-xr-x 2 root bin 80164 Apr 21 2011 /usr/bin/awk
4993 -r-xr-xr-x 2 root bin 80164 Apr 21 2011 /usr/bin/oawk
5037 -r-xr-xr-x 1 root bin 110080 Apr 21 2011 /usr/bin/nawk
57332 -r-xr-xr-x 1 root bin 66800 Dec 9 2011 /usr/xpg4/bin/awk

$ echo 'a,,c' | /usr/xpg4/bin/awk -F, 'BEGIN {print ($2==0 ? "==" :
"!=")}'
==

$ echo 'a,,c' | nawk -F, 'BEGIN {print ($2==0 ? "==" : "!=")}'
!=

$ echo 'a,,c' | oawk -F, 'BEGIN {print ($2==0 ? "==" : "!=")}'
awk: syntax error near line 1
awk: illegal statement near line 1

That's oawk not knowing about the ?: ternary. To check anyway:

$ echo 'a,,c' | oawk -F, '$2 ==0'
$

$ echo 'a,,c' | oawk -F, '$2+0==0'
a,,c
$

(All of this was with LC_NUMERIC="en_US.UTF-8")

I am too kind to my sleeping neighbours (and too lazy) to fire up the
X4140 in the next room with Solaris-11.3 (or .2?) to cross-check.

Martin Neitzel

Martin Neitzel

unread,
Aug 6, 2018, 6:39:03 PM8/6/18
to
I just wrote:
> [solaris-10 awk tests with erroneous BEGINs]

confirmed w/o the BEGIN, too.

Martin

Geoff Clare

unread,
Aug 7, 2018, 8:11:04 AM8/7/18
to
Ed Morton wrote:

> So from the above, discussions on other forums, and talking to Arnold, gawk and
> other awks aren't strictly POSIX compliant in this regard but it's the POSIX
> standard that needs to change, not the awk implementations, otherwise something
> as basic as this (example courtesy of Arnold):
>
> echo '5.0 10.0' | awk '$1 < $2'
>
> would evaluate to false so I've opened a ticket with the Open Group to try to
> get the standard updated. The ticket is at
> https://help.opengroup.org/hc/en-us/requests/193457, you'll need to log in to
> see it but even then idk if you'll be able to or if it's just me who can see it.

It's just you who can see it (and Open Group staff, which includes me).
I have raised an Austin Group defect report based on your help desk ticket:

http://austingroupbugs.net/view.php?id=1198

That one is viewable by anyone (but you need an account to comment).

Ed Morton

unread,
Aug 7, 2018, 9:18:21 AM8/7/18
to
Thanks Geoff.

Ed.
0 new messages