Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

gawk 4.0.1 vs. 4.1.1 "rshift" and "and"

123 views
Skip to first unread message

catsk...@gmail.com

unread,
Oct 2, 2016, 2:08:04 PM10/2/16
to
Can anyone explain why this program yields different results for -4 to -1.

BEGIN {
print(PROCINFO["version"])
for (i=-4; i<=4; i++) {
printf("%+2d %s\n",i,bits2str(i))
}
exit(0)
}
function bits2str(bits, data) {
if (bits == 0) {
return("0")
}
for (; bits != 0; bits = rshift(bits,1)) {
data = (and(bits,1) ? "1" : "0") data
}
while ((length(data) % 8) != 0) {
data = "0" data
}
return(data)
}

4.0.0l
-4 11111111111111111111111111111100
-3 11111111111111111111111111111101
-2 11111111111111111111111111111110
-1 11111111111111111111111111111111
+0 0
+1 00000001
+2 00000010
+3 00000011
+4 00000100

4.1.1
-4 00111111111111111111111111111111111111111111111111111100
-3 00111111111111111111111111111111111111111111111111111101
-2 00111111111111111111111111111111111111111111111111111110
-1 00111111111111111111111111111111111111111111111111111111
+0 0
+1 00000001
+2 00000010
+3 00000011
+4 00000100

Janis Papanagnou

unread,
Oct 2, 2016, 3:28:54 PM10/2/16
to
On 02.10.2016 20:07, catsk...@gmail.com wrote:
> Can anyone explain why this program yields different results for -4 to -1.

Maybe use of a different underlying floating point data type (with a length
of the mantissa of 31+1 in gawk 4.0 vs. 53+1 in gawk 4.1), or alternatively
use of a 32 bit integer in 4.0 vs. a float in 4.1? - Just guessing.

Janis

Andrew Schorr

unread,
Oct 2, 2016, 8:10:59 PM10/2/16
to
On Sunday, October 2, 2016 at 3:28:54 PM UTC-4, Janis Papanagnou wrote:
> Maybe use of a different underlying floating point data type (with a length
> of the mantissa of 31+1 in gawk 4.0 vs. 53+1 in gawk 4.1), or alternatively
> use of a 32 bit integer in 4.0 vs. a float in 4.1? - Just guessing.

I believe that the internal calculations are done using 64-bit integers in both cases. I believe the change in behavior results from modifications to the way those integer results are converted back into floating point. The code in floatcomp.c:adjust_uint was changed between 4.0 and 4.1. Here is the relevant comment:

/*
* If uintmax_t is so wide that AWKNUM cannot represent all its
* values, strip leading nonzero bits of integers that are so large
* that they cannot be represented exactly as AWKNUMs, so that their
* low order bits are represented exactly, without rounding errors.
* This is more desirable in practice, since it means the user sees
* integers that are the same width as the AWKNUM fractions.
*/

A few further thoughts:

1. Please try running the gawk command with the --lint argument. You should see warning messages such as these:

gawk: test.awk:14: warning: and: argument 2 negative value -4 will give strange results
gawk: test.awk:13: warning: rshift(-4.000000, 1.000000): negative values will give strange results

2. I don't think it's reasonable to pad with zeroes on the left if the number of bits is not a multiple of 8. You seem to be making unwarranted assumptions about how the values are represented internally.

3. What are you actually trying to accomplish? Inside gawk, numeric values are stored as IEEE 754 double-precision numbers, which I think yields around 53 bits of precision. It doesn't really make sense to perform these bit operations on negative values, as per the warning messages.

Regards,
Andy


Janis Papanagnou

unread,
Oct 2, 2016, 9:55:41 PM10/2/16
to
On 03.10.2016 02:10, Andrew Schorr wrote:
> On Sunday, October 2, 2016 at 3:28:54 PM UTC-4, Janis Papanagnou wrote:
>> [...]
> [...]
> 2. I don't think it's reasonable to pad with zeroes on the left if the
> number of bits is not a multiple of 8. You seem to be making unwarranted
> assumptions about how the values are represented internally.

(Not me, actually, rather the OP posted this strange code. ;-)

>
> 3. What are you actually trying to accomplish? Inside gawk, numeric values
> are stored as IEEE 754 double-precision numbers, which I think yields
> around 53 bits of precision. It doesn't really make sense to perform these
> bit operations on negative values, as per the warning messages.

Here I disagree with you; the fact how the external value representation
is (invisible to the user) _internally_ stored *should* not be of concern,
specifically in such a (conceptually) strange case where a floating point
encoding is used and/or conversions take place implicitly. The language
user should get a well defined "Bit-Manipulation" concept at the interface
without being bothered with conceptually irrelevant floating point issues.
It is (seemingly) documented, but as so many other delegated functions in
GNU Awk it's not clearly defined where it says "widest C unsigned integer
type". (While for many applications it's probably sufficient, it's not
suited for applications who have to rely on well defined conditions; those
folks, though, would probably use Ada and not use Awk for such purposes.)
Usually bit operations in computer languages are done on an integral data
type, and usually a two's complement representation is used for negative
values. It's at least obvious to assume shifts (with a semantic of being
either a logical or an arithmetic shift) also with negative values. GNU
Awk's bit manipulation features are, IMO, anyway not very sophisticated
(it's size restricted, most functions can be done with arithmetic as well,
there's no I/O capabilities (as the OP needed to implement that himself),
and basic bit-functions are missing). Probably yet another area for GNU
Awk's extension library.

Janis

>
> Regards, Andy
>
>

Andrew Schorr

unread,
Oct 3, 2016, 9:22:38 AM10/3/16
to
On Sunday, October 2, 2016 at 9:55:41 PM UTC-4, Janis Papanagnou wrote:
> Here I disagree with you; the fact how the external value representation
> is (invisible to the user) _internally_ stored *should* not be of concern,
> specifically in such a (conceptually) strange case where a floating point
> encoding is used and/or conversions take place implicitly. The language
> user should get a well defined "Bit-Manipulation" concept at the interface
> without being bothered with conceptually irrelevant floating point issues.

But in what way is this "well-defined"? I just don't follow what the OP expects to happen here with negative values. It's very dependent on the internal representation, so it seems to me to be rather uninteresting and not useful.

Try running with -M, and it goes into an infinite loop!

> It is (seemingly) documented, but as so many other delegated functions in
> GNU Awk it's not clearly defined where it says "widest C unsigned integer
> type". (While for many applications it's probably sufficient, it's not
> suited for applications who have to rely on well defined conditions; those
> folks, though, would probably use Ada and not use Awk for such purposes.)
> Usually bit operations in computer languages are done on an integral data
> type, and usually a two's complement representation is used for negative
> values. It's at least obvious to assume shifts (with a semantic of being
> either a logical or an arithmetic shift) also with negative values. GNU
> Awk's bit manipulation features are, IMO, anyway not very sophisticated
> (it's size restricted, most functions can be done with arithmetic as well,
> there's no I/O capabilities (as the OP needed to implement that himself),
> and basic bit-functions are missing). Probably yet another area for GNU
> Awk's extension library.

I don't think "usual" behavior is a good basis for writing a robust program. I still question what the OP wants to achieve. This is simply not well-defined with respect to the awk language. Note that these functions are gawk extensions, so we're really in uncharted territory with this.

Regards,
Andy


Janis Papanagnou

unread,
Oct 3, 2016, 10:55:48 AM10/3/16
to
On 03.10.2016 15:22, Andrew Schorr wrote:
> On Sunday, October 2, 2016 at 9:55:41 PM UTC-4, Janis Papanagnou wrote:
>> Here I disagree with you; the fact how the external value representation
>> is (invisible to the user) _internally_ stored *should* not be of
>> concern, specifically in such a (conceptually) strange case where a
>> floating point encoding is used and/or conversions take place implicitly.
>> The language user should get a well defined "Bit-Manipulation" concept at
>> the interface without being bothered with conceptually irrelevant
>> floating point issues.
>
> But in what way is this "well-defined"?

By "well defined" I was referring to specifying the supported width of
the bit array, or by supporting bit arrays with unrestricted length.

> I just don't follow what the OP
> expects to happen here with negative values.

It's actually a standard behaviour in Bit-Manipulation libraries. Why do
you think there are terms like "logical" and "arithmetical" bit-shifts
existing? (See Wikipedia, for example, or a 101 computer science course.)

> It's very dependent on the
> internal representation, so it seems to me to be rather uninteresting and
> not useful.

Conceptual behaviour shouldn't be dependent of the internal representation.
Only the length should be the primary factor, and the shift semantic the
secondary factor, you don't need more.

I suspect you seem to view it from the Awk implementation side; starting
from the given Awk specific fact that "everything" is a floating point
number, and some logic defined on top of that container data type, as
opposed to the view of bit-manipulation working on bit-strings or integral
data types (where any [in awk differing] underlying container data type
should be hidden from the user if necessary).

>
> Try running with -M, and it goes into an infinite loop!

Don't know what you are refering to by "it". (If you mean the OP's code
I won't comment on it, with or without -M, since it's anyway Bad Code.)

>
>> It is (seemingly) documented, but as so many other delegated functions
>> in GNU Awk it's not clearly defined where it says "widest C unsigned
>> integer type". (While for many applications it's probably sufficient,
>> it's not suited for applications who have to rely on well defined
>> conditions; those folks, though, would probably use Ada and not use Awk
>> for such purposes.) Usually bit operations in computer languages are done
>> on an integral data type, and usually a two's complement representation
>> is used for negative values. It's at least obvious to assume shifts (with
>> a semantic of being either a logical or an arithmetic shift) also with
>> negative values. GNU Awk's bit manipulation features are, IMO, anyway not
>> very sophisticated (it's size restricted, most functions can be done with
>> arithmetic as well, there's no I/O capabilities (as the OP needed to
>> implement that himself), and basic bit-functions are missing). Probably
>> yet another area for GNU Awk's extension library.
>
> I don't think "usual" behavior is a good basis for writing a robust
> program.

A robust behaviour is well documented. Shifting negative values is standard
since the introduction of [signed] integer registers and shift functions.
If lacking documentation (or even if documented) one can expect that "usual"
behaviour is supported, presuming that usual behaviour is reflecting state
of the art since at least the mid of the last century.

> I still question what the OP wants to achieve.

Two-fold answer; a) he wanted to shift negative numbers - a valid request,
b) he does some strange padding (and whatnot) - which makes not much sense
in this Awk context.

(I wouldn't be astonished if it's only a test program to see how GNU Awk
behaves, before the OP would use these functions in his "production code".)

> This is simply not
> well-defined with respect to the awk language. Note that these functions
> are gawk extensions, so we're really in uncharted territory with this.

Since Bit-Manipulation is an GNU Awk extension to standard Awk the developer
was free to define it sensibly. (I already commented upthread what I think
about the state of that implementation.)

Why "uncharted territory"? (The functions are listed in the documentation.)

Janis

>
> Regards, Andy
>
>

Andrew Schorr

unread,
Oct 3, 2016, 12:56:45 PM10/3/16
to
Sorry, I misspoke previously. The documentation says the following regarding these functions:

"For all of these functions, first the double-precision
floating-point value is converted to the widest C unsigned integer
type, then the bitwise operation is performed. If the result cannot be
represented exactly as a C `double', leading nonzero bits are removed
one by one until it can be represented exactly. The result is then
converted back into a C `double'. (If you don't understand this
paragraph, don't worry about it.)"

Furthermore, I now see that the bits2str function comes straight out of the gawk documentation, including the left-padding with zeros behavior. It even includes an example with a negative number:

compl(0x99) = 0x3fffffffffff66 = 00111111111111111111111111111111111111\
111111111101100110

I think the code implements the behavior defined in the documentation, and I think this explains what the OP is seeing. If that's not the case, then a bug report should be filed.

If somebody doesn't like this behavior, then it's certainly possible to develop an alternate library with different behavior, implemented either as AWK source code or using a C extension library. For an AWK implementation, we are constrained by the fact that all scalars are either strings or floating-point numbers, but one can use strings to encode integers if desired. In the case of a C library, we could possibly pass around cookies that point to arbitrary types. The gawkextlib MPFR extension library actually encodes the arbitrary-precision floating-point values in AWK strings, so there's an example of how this sort of thing can be done, although this approach certainly has some shortcomings.

It's hard to suggest the best solution unless we know what the OP is actually trying to accomplish. It's possible that AWK may not be the best language for this.

Regards,
Andy

catsk...@gmail.com

unread,
Oct 3, 2016, 1:52:05 PM10/3/16
to
On Monday, October 3, 2016 at 12:56:45 PM UTC-4, Andrew Schorr wrote:
> Sorry, I misspoke previously. The documentation says the following regarding these functions:
>
> "For all of these functions, first the double-precision
> floating-point value is converted to the widest C unsigned integer
> type, then the bitwise operation is performed. If the result cannot be
> represented exactly as a C `double', leading nonzero bits are removed
> one by one until it can be represented exactly. The result is then
> converted back into a C `double'. (If you don't understand this
> paragraph, don't worry about it.)"
>
> Furthermore, I now see that the bits2str function comes straight out of the gawk documentation, including the left-padding with zeros behavior. It even includes an example with a negative number:

Thanks for pointing out the example came from the GNU awk manual.
http://www.gnu.org/software/gawk/manual/gawk.html#index-bits2str_0028_0029-user_002ddefined-function

I know Janis, who I have the utmost respect for, said the "OP posted this strange code" and in a subsequent post "it's anyway Bad Code" so I hope Arnold isn't offended.

>
> compl(0x99) = 0x3fffffffffff66 = 00111111111111111111111111111111111111\
> 111111111101100110

Under gawk 4.1.1 I am actually seeing
compl(0x99) = 0x1fffffffffff66 = ...
Is this a bug or does the documentation just need to be updated?

>
> I think the code implements the behavior defined in the documentation, and I think this explains what the OP is seeing. If that's not the case, then a bug report should be filed.
>
> If somebody doesn't like this behavior, then it's certainly possible to develop an alternate library with different behavior, implemented either as AWK source code or using a C extension library. For an AWK implementation, we are constrained by the fact that all scalars are either strings or floating-point numbers, but one can use strings to encode integers if desired. In the case of a C library, we could possibly pass around cookies that point to arbitrary types. The gawkextlib MPFR extension library actually encodes the arbitrary-precision floating-point values in AWK strings, so there's an example of how this sort of thing can be done, although this approach certainly has some shortcomings.
>
> It's hard to suggest the best solution unless we know what the OP is actually trying to accomplish. It's possible that AWK may not be the best language for this.

I use this program along with many others to test newer versions of gawk before migrating forward. When I saw the bits2str function behave differently between 4.0 and 4.1 I raised a question, that's all.

The testbits.awk program, along with my original example, also fails with --bignum and it should not.

Regards,
Dan Nielsen

>
> Regards,
> Andy

Andrew Schorr

unread,
Oct 3, 2016, 8:09:19 PM10/3/16
to
On Monday, October 3, 2016 at 1:52:05 PM UTC-4, catsk...@gmail.com wrote:
> I know Janis, who I have the utmost respect for, said the "OP posted this strange code" and in a subsequent post "it's anyway Bad Code" so I hope Arnold isn't offended.

:-)

> > compl(0x99) = 0x3fffffffffff66 = 00111111111111111111111111111111111111\
> > 111111111101100110
>
> Under gawk 4.1.1 I am actually seeing
> compl(0x99) = 0x1fffffffffff66 = ...
> Is this a bug or does the documentation just need to be updated?

Please upgrade to the latest 4.1.4 release. In that one, I think the result should be consistent with its documentation: 0x3fffffffffff66.

> I use this program along with many others to test newer versions of gawk before migrating forward. When I saw the bits2str function behave differently between 4.0 and 4.1 I raised a question, that's all.

Ah. Why do you test with negative values? Is that something you actually use?

> The testbits.awk program, along with my original example, also fails with --bignum and it should not.

That's a fair point. It does seem to go into an infinite loop when run with -M. That's a bug. Could you please file a bug report for that issue? Note that it's a bug in the bits2str function, not gawk itself.

Another possible bug is that the compl function gives wildly different results when run with and without -M. For example:

bash-4.3$ ./gawk --version | head -1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.2, GNU MP 6.0.0)
bash-4.3$ ./gawk 'BEGIN {print compl(1)}'
18014398509481982
bash-4.3$ ./gawk -M 'BEGIN {print compl(1)}'
-2

Regards,
Andy

catsk...@gmail.com

unread,
Oct 3, 2016, 10:44:01 PM10/3/16
to
> Please upgrade to the latest 4.1.4 release. In that one, I think the result should be consistent with its documentation: 0x3fffffffffff66.

I will as soon as becomes available on Olaf Schoenfeldt's site:
http://www.klabaster.com/freeware.htm

> Another possible bug is that the compl function gives wildly different results when run with and without -M. For example:
>
> bash-4.3$ ./gawk --version | head -1
> GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.2, GNU MP 6.0.0)
> bash-4.3$ ./gawk 'BEGIN {print compl(1)}'
> 18014398509481982
> bash-4.3$ ./gawk -M 'BEGIN {print compl(1)}'
> -2
>

My results are different.
GAWK_4 --version | GAWK_4 "/Awk/" -
GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2, GNU MP 5.1.2)
GAWK_4 "BEGIN{print compl(1)}"
9007199254740990
GAWK_4 --bignum "BEGIN{print compl(1)}"
-2

Marc de Bourget

unread,
Oct 4, 2016, 1:48:47 AM10/4/16
to
Le mardi 4 octobre 2016 04:44:01 UTC+2, catsk...@gmail.com a écrit :
> > Please upgrade to the latest 4.1.4 release. In that one, I think the result should be consistent with its documentation: 0x3fffffffffff66.
>
> I will as soon as becomes available on Olaf Schoenfeldt's site:
> http://www.klabaster.com/freeware.htm
>

You can download the official Windows GAWK port 4.1.4 from here:
http://sourceforge.net/projects/ezwinports/files/

BTW, I have finally found out how to compile GAWK on Windows with MinGW.
I'll write a instruction from scratch here, but it will take some weeks
because I don't have enough time at the moment.

Andrew Schorr

unread,
Oct 4, 2016, 8:07:27 PM10/4/16
to
On Monday, October 3, 2016 at 10:44:01 PM UTC-4, catsk...@gmail.com wrote:
> My results are different.
> GAWK_4 --version | GAWK_4 "/Awk/" -
> GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2, GNU MP 5.1.2)
> GAWK_4 "BEGIN{print compl(1)}"
> 9007199254740990
> GAWK_4 --bignum "BEGIN{print compl(1)}"
> -2

That's not totally surprising. There were some changes made to that code between 4.1.1 and 4.1.4. I'm not terribly happy with this "compl" behavior, but not sure what should be done about it. For non-negative integers that don't overflow doubles, these bitwise functions seem to make sense. But once we get involved with negative numbers, the behavior is just weird.

With -M, rshift(-1, 1) = -1, and that sort of makes sense. But it causes the bits2str function to go into an infinite loop. Maybe the MPFR rshift implementation should use mpz_tdiv_q_2exp instead of mpz_fdiv_q_2exp. With that change, rshift(-1, 1) is 0.

Regards,
Andy

Janis Papanagnou

unread,
Oct 5, 2016, 4:58:25 AM10/5/16
to
On 03.10.2016 19:52, catsk...@gmail.com wrote:
> On Monday, October 3, 2016 at 12:56:45 PM UTC-4, Andrew Schorr wrote:
>>
>> Furthermore, I now see that the bits2str function comes straight out of
>> the gawk documentation, including the left-padding with zeros behavior.
>> It even includes an example with a negative number:
>
> Thanks for pointing out the example came from the GNU awk manual.
> http://www.gnu.org/software/gawk/manual/gawk.html#index-bits2str_0028_0029-user_002ddefined-function
>
> I know Janis, who I have the utmost respect for, said the "OP posted this
> strange code" and in a subsequent post "it's anyway Bad Code" so I hope
> Arnold isn't offended.

Frankly, I have no idea who contributed that piece of code. And you've
certainly noticed that I made a statement about the code and not about
the [unknown] author of the code. (Though ranks and titles have also
never been a hindrance to me, to name a duck as "duck" if it has all of
a duck's properties. Even my own code hacks, where I did not put effort
in, are not exempt from my own criticism. ;-) Furthermore, given what
I've seen of Arnolds code, I also don't believe it stems from him, but I
don't know.

WRT the code (and one or another issue may have been mentioned before)
I think at least three issues are bad; 1) padding with "0" even in case
of negative numbers, 2) returning (inconsistently) an unpadded "0" in
case of a zero argument, 3) returning "0..0" (padded to eight digits) on
(empty or not) string arguments. - A bit much for my taste in a function
that has only seven lines of effective code. It's arguable whether also
4) the padding per se is bad, in cases where no digit grouping is also
done, and it's also arguable whether it's good to 5) implement a loop
for the padding instead of just calculating the padding string length
once.

Janis
0 new messages