Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

gawk rounding issue

91 views
Skip to first unread message

Janis Papanagnou

unread,
May 14, 2015, 1:41:56 AM5/14/15
to
To get the number of digits of a decimal number we can use the log function.
Ideally that would be log10(x), but since there's none available in awk we
need to calculate log(x)/log(10). This seems to work fine in gawk:

$ awk 'BEGIN {print (log(999)/log(10))+1}'
3.99957
$ awk 'BEGIN {print (log(1000)/log(10))+1}'
4

But to get whole integer numbers for the number of digits we also need to
apply the int() function:

$ awk 'BEGIN {print int(log(999)/log(10))+1}'
3
$ awk 'BEGIN {print int(log(1000)/log(10))+1}'
3

Doh! - And this result is really bad. It seems that implicit rounding issues
don't work well if those two functions, log() and int(), are combined. This
is especially strange since the complex log() division (where rounding issues
are more likely to be expected) already provided the right result, so int()
would not need to "guess" a correct rounding.

(At that point there's certainly many people who will be tempted to point me
to Goldberg's "What Every Computer Scientist Should Know About Floating-Point
Arithmetic" paper.) But note that other languages don't have that issue; e.g.
ksh:

$ echo $(( log(999)/log(10)+1 ))
3.99956548822598231
$ echo $(( int(log(999)/log(10))+1 ))
3
$ echo $(( log(1000)/log(10)+1 ))
4
$ echo $(( int(log(1000)/log(10))+1 ))
4

For now my workaround will be to implement a divide by 10 loop, or a switch
with a handful of hard coded values.

But my question is whether this should be considered a bug in gawk? (A first
impetus is to say "No"; certainly the easiest answer in FP arithmetic issues.)
And an [OT] question; what does ksh what gawk doesn't do (or cannot do)?
If not a bug; could that be changed in gawk to work more accurately, the way,
say, ksh does?

Janis

Kenny McCormack

unread,
May 14, 2015, 5:26:10 AM5/14/15
to
In article <mj1cj3$6a4$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>To get the number of digits of a decimal number we can use the log function.
>Ideally that would be log10(x), but since there's none available in awk we
>need to calculate log(x)/log(10). This seems to work fine in gawk:
>
>$ awk 'BEGIN {print (log(999)/log(10))+1}'
>3.99957
>$ awk 'BEGIN {print (log(1000)/log(10))+1}'
>4
>
>But to get whole integer numbers for the number of digits we also need to
>apply the int() function:
>
>$ awk 'BEGIN {print int(log(999)/log(10))+1}'
>3
>$ awk 'BEGIN {print int(log(1000)/log(10))+1}'
>3
>
>Doh! - And this result is really bad. It seems that implicit rounding issues
>don't work well if those two functions, log() and int(), are combined.

It seems to me that the key question here is "What are you looking for?"
I would imagine that you're not really interested in workarounds - of which
there are dozens. You're not interested in either being "usered" or "XYed".

Just for the record, here's a workaround:

$ gawk '{print length(sprintf("%d",$1))}'
1000
4
999
3
999.999999
3
^C
$

Anyway, I did a little investigation, and, although I can't really prove
anything, I *think* the issue is that there's just enough round-off error
in your calculations to cause the actual value to be just a smidge below 4.
That is, I don't think the error is in the int() function, but in the logs
and the division. The vagaries of AWK and how it converts numbers for
printing probably account for the result (without int()) being displayed as
4 even though it is actually just shy of 4. Again, this is all conjecture,
but it is based on the next observation (results):

# Here, we call the log10 function directly, and we get better results...
# Note: This particular version of 'gawk' has 'call_any' compiled in.
$ gawk '{print int(call_any("dd","log10",$1))+1}'
1000
4
999
3
1001
4
999.99
3
^C
$

This suggests that adding log10 to the core gawk language would be a good
idea. In the meantime, of course, you might want to write an extension
lib...

Anyway, I don't know if any of this helps you or not, but what else can I say?

P.S. I also wonder if using the new GMP/MPFR stuff might give more
interesting results. I don't know, because I have to admit that I don't
really understand it (the GMP/MPFR stuff).

--
Is God willing to prevent evil, but not able? Then he is not omnipotent.
Is he able, but not willing? Then he is malevolent.
Is he both able and willing? Then whence cometh evil?
Is he neither able nor willing? Then why call him God?
~ Epicurus

Kaz Kylheku

unread,
May 14, 2015, 10:02:19 AM5/14/15
to
On 2015-05-14, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> For now my workaround will be to implement a divide by 10 loop, or a switch
> with a handful of hard coded values.

The man pages for several awks say that the int() function truncates.
Some say toward zero. POSIX says:

int(x)
Return the argument truncated to an integer. Truncation shall be toward 0
when x>0.

It appears to be similar to doing an (int) cast (or implicit double to int
conversion) in the C language.

Of course, truncation is bad news whenever you have a floating-point result
which *approximates* an integer (and could err on either side), and you just
want that integer!

The classic C solution for this is to add 0.5 and then do the conversion:

double y = 3.999;
int x = (int) (y + 0.5); /* unnecessary cast, just for emphasis */

In C99, the round, roundl, and roundf functions were introduced. These
are a better solution because the hardware may have a way to do this
operation faster than doing an addition of 0.5.

I think that in awk, you can just apply the classic C solution: add 0.5,
then apply int().

> And an [OT] question; what does ksh what gawk doesn't do (or cannot do)?

It probably adds 0.5 to the floating-point value before converting it.
(Or perhaps it auto-detects the presence of the round function in the
library at build time and uses it.)

> If not a bug; could that be changed in gawk to work more accurately, the way,
> say, ksh does?

Not without breaking POSIX compliance. The fix is to introduce a round()
function (and get it standardized).

Kaz Kylheku

unread,
May 14, 2015, 10:05:33 AM5/14/15
to
On 2015-05-14, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> To get the number of digits of a decimal number we can use the log function.

It occurs to me that this could be a useful approach:

$ gawk 'BEGIN { printf("%e\n", 123456) }' # use sprintf to a string
1.234560e+05
^^^
Use sprintf instead, extract the exponent, clamp to -1 if it is negative,
add 1.

Janis Papanagnou

unread,
May 14, 2015, 10:36:16 AM5/14/15
to
On 05/14/15 11:26, Kenny McCormack wrote:
> In article <mj1cj3$6a4$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> To get the number of digits of a decimal number we can use the log function.
>> Ideally that would be log10(x), but since there's none available in awk we
>> need to calculate log(x)/log(10). This seems to work fine in gawk:
>>
>> $ awk 'BEGIN {print (log(999)/log(10))+1}'
>> 3.99957
>> $ awk 'BEGIN {print (log(1000)/log(10))+1}'
>> 4
>>
>> But to get whole integer numbers for the number of digits we also need to
>> apply the int() function:
>>
>> $ awk 'BEGIN {print int(log(999)/log(10))+1}'
>> 3
>> $ awk 'BEGIN {print int(log(1000)/log(10))+1}'
>> 3
>>
>> Doh! - And this result is really bad. It seems that implicit rounding issues
>> don't work well if those two functions, log() and int(), are combined.
>
> It seems to me that the key question here is "What are you looking for?"
> I would imagine that you're not really interested in workarounds - of which
> there are dozens. You're not interested in either being "usered" or "XYed".

(Don't know what "usered" means.)

>
> Just for the record, here's a workaround:
>
> $ gawk '{print length(sprintf("%d",$1))}'

Good to know.

> 1000
> 4
> 999
> 3
> 999.999999
> 3
> ^C
> $
>
> Anyway, I did a little investigation, and, although I can't really prove
> anything, I *think* the issue is that there's just enough round-off error
> in your calculations to cause the actual value to be just a smidge below 4.

That's also what I think. Where I start wondering is that *without* the
int() bracket around the log() division there's a correct result, thus
I assumed correct rounding is at least possible - but maybe it's just
correct display and incorrect internal representation? (Which would at
least leave a bad taste. - If "collapsing" an expression to a variable
I'd understand that binary representation problems could lead to such an
effect, but in expressions - that's what I experienced 35 years ago when
I was disassembling a scientific calculator - there's the possibility to
achieve a better rounding behaviour.) My suspicion (without having any
evidence) went along the lines that maybe either the int() function may
internally be missing some "correct"-rounding call, while other functions
do that, or that some "collapsing" of a sub-expression to an inherently
inaccurate stored [temporary] variable destroys the existing accuracy
information that's necessary to show (like ksh does) the correct result.

> That is, I don't think the error is in the int() function, but in the logs
> and the division. The vagaries of AWK and how it converts numbers for
> printing probably account for the result (without int()) being displayed as
> 4 even though it is actually just shy of 4. Again, this is all conjecture,
> but it is based on the next observation (results):
>
> # Here, we call the log10 function directly, and we get better results...
> # Note: This particular version of 'gawk' has 'call_any' compiled in.
> $ gawk '{print int(call_any("dd","log10",$1))+1}'

Assuming "call_any()" directly calls the (C-)library functions I'd be
interested in how that result would look like if there's a function
composition similary to the awk sample program (based on the log()
division) would be invoked. (I assume it's only possible with variable
assignments of intermediate results, thus bearing the danger to lead
to the same flawy result.) - Note: I'd also use log10() of course, but
use the division just to better understand where the issue stems from.

> 1000
> 4
> 999
> 3
> 1001
> 4
> 999.99
> 3
> ^C
> $
>
> This suggests that adding log10 to the core gawk language would be a good
> idea. In the meantime, of course, you might want to write an extension
> lib...
>
> Anyway, I don't know if any of this helps you or not, but what else can I say?

Thanks for the thorough reply. Actually for my case I went the way to let
the shell evaluate the expression and pass the value as variable to awk,
since (from the awk's program's perspective) it's a constant. I usually
also do no heavy FP calculation with awk (just percentages and such), so
this is no pressing issue for me. I just wanted to bring that question to
the public's attention; because of the discrepancy (if compared to ksh)
there *might* be a subtle bug.

>
> P.S. I also wonder if using the new GMP/MPFR stuff might give more
> interesting results. I don't know, because I have to admit that I don't
> really understand it (the GMP/MPFR stuff).

Well, I've used it once where I operated on large numbers, but I'd not
consider MPFR to be the appropriate answer in case it's a rounding issue.

Janis

Janis Papanagnou

unread,
May 14, 2015, 10:52:23 AM5/14/15
to
On 05/14/15 16:02, Kaz Kylheku wrote:
> On 2015-05-14, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> For now my workaround will be to implement a divide by 10 loop, or a switch
>> with a handful of hard coded values.
>
> The man pages for several awks say that the int() function truncates.
> Some say toward zero. POSIX says:
>
> int(x)
> Return the argument truncated to an integer. Truncation shall be toward 0
> when x>0.
>
> It appears to be similar to doing an (int) cast (or implicit double to int
> conversion) in the C language.

Though, while (e.g.) ksh has the same semantics of int(), ksh does it right.

>
> Of course, truncation is bad news whenever you have a floating-point result
> which *approximates* an integer (and could err on either side), and you just
> want that integer!
>
> The classic C solution for this is to add 0.5 and then do the conversion:
>
> double y = 3.999;
> int x = (int) (y + 0.5); /* unnecessary cast, just for emphasis */

I don't think this works correctly around the awk implicit rounding issue:

$ awk 'BEGIN {print int(0.5+log(1000)/log(10))+1}'
4
$ awk 'BEGIN {print int(0.5+log(999)/log(10))+1}'
4

The latter should be 3. - Note I don't want "rounding" in the sense that
any FP number should be rounded to the next integer. I expect *implicit*
rounding (e.g. as defined by IEEE) to address the binary representation
problem of decimal numbers. The semantics should still be "truncation",
as ksh (or POSIX) may define it for int().

>
> In C99, the round, roundl, and roundf functions were introduced. These
> are a better solution because the hardware may have a way to do this
> operation faster than doing an addition of 0.5.
>
> I think that in awk, you can just apply the classic C solution: add 0.5,
> then apply int().
>
>> And an [OT] question; what does ksh what gawk doesn't do (or cannot do)?
>
> It probably adds 0.5 to the floating-point value before converting it.
> (Or perhaps it auto-detects the presence of the round function in the
> library at build time and uses it.)

(As mentioned above; adding 0.5 will do explicit rounding to an integer;
and BTW also not even "IEEE correctly". What I'd want would be a "correct"
truncation.)

>
>> If not a bug; could that be changed in gawk to work more accurately, the way,
>> say, ksh does?
>
> Not without breaking POSIX compliance. The fix is to introduce a round()
> function (and get it standardized).

Certainly useful. Currently I need an accurately working int() function.

Janis

Andrew Schorr

unread,
May 14, 2015, 10:44:23 PM5/14/15
to
On Thursday, May 14, 2015 at 5:26:10 AM UTC-4, Kenny McCormack wrote:
> Anyway, I did a little investigation, and, although I can't really prove
> anything, I *think* the issue is that there's just enough round-off error
> in your calculations to cause the actual value to be just a smidge below 4.
> That is, I don't think the error is in the int() function, but in the logs
> and the division. The vagaries of AWK and how it converts numbers for
> printing probably account for the result (without int()) being displayed as
> 4 even though it is actually just shy of 4.

You are spot on. This is not hard to prove:

bash-4.2$ gawk 'BEGIN {print log(1000)/log(10)-3}'
-4.44089e-16

So yes, the ratio is epsilon less than 3. When you add 1, it's just slightly
less than 4. And apparently default print logic rounds it to 4:

bash-4.2$ gawk 'BEGIN {print log(1000)/log(10)+1}'
4

If you add a lot more precision, you can see the problem:

bash-4.2$ gawk 'BEGIN {printf "%.17g\n", log(1000)/log(10)+1}'
3.9999999999999996

It therefore makes sense that the int() function truncates the ratio to 2.

And yes, using MPFR seems to solve it:

bash-4.2$ gawk -M 'BEGIN {print int(log(1000)/log(10))+1}'
4

Regards,
Andy

Kenny McCormack

unread,
May 15, 2015, 9:20:36 AM5/15/15
to
In article <201505140...@kylheku.com>,
Well, isn't that pretty much equivalent, functionally, to my "%d" approach
(in my first post on this thread) ?

--
Rich people pay Fox people to convince middle class people to blame poor people.

(John Fugelsang)

Kenny McCormack

unread,
May 15, 2015, 9:58:05 AM5/15/15
to
In article <mj2bsv$jh5$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>> It seems to me that the key question here is "What are you looking for?"
>> I would imagine that you're not really interested in workarounds - of which
>> there are dozens. You're not interested in either being "usered" or "XYed".
>
>(Don't know what "usered" means.)

A neologism of my own creation - the user of "user" as a verb.
It refers to the habit of Usenet (and online fora in general) posters' of
treating people as if they were just dumb users who, as I often put it,
"Just want their SAS [*] to work". That is, they treat the posters as if
there were sub-morons, who just need the manual read to them.

It is bad enough in the general case, where, maybe, statistically, it is
the right call, but it is often employed with well-known posters, for whom
it well understood that it is uncalled for. As it would be in your case...

[*] A well known Statistical Analysis System (as its initials imply).

XYed, of course, means to treat every posting as an instance of the "XY
problem", where the real problem is, of course, assumed to be something
almost completely unrelated to the actual words found in the posting.

>> Just for the record, here's a workaround:
>>
>> $ gawk '{print length(sprintf("%d",$1))}'
>
>Good to know.

Thanks.

...
>Assuming "call_any()" directly calls the (C-)library functions I'd be
>interested in how that result would look like if there's a function
>composition similary to the awk sample program (based on the log()
>division) would be invoked. (I assume it's only possible with variable
>assignments of intermediate results, thus bearing the danger to lead
>to the same flawy result.)

I don't quite follow this paragraph...

>> P.S. I also wonder if using the new GMP/MPFR stuff might give more
>> interesting results. I don't know, because I have to admit that I don't
>> really understand it (the GMP/MPFR stuff).
>
>Well, I've used it once where I operated on large numbers, but I'd not
>consider MPFR to be the appropriate answer in case it's a rounding issue.

Again, I'm not sure what you are saying here, but, according to Andy, using
MPFR does solve this problem.

--
Faith doesn't give you the answers; it just stops you from asking the questions.

Janis Papanagnou

unread,
May 15, 2015, 10:14:03 AM5/15/15
to
On 05/15/15 15:58, Kenny McCormack wrote:
> In article <mj2bsv$jh5$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>
>>> Just for the record, here's a workaround:
>>>
>>> $ gawk '{print length(sprintf("%d",$1))}'
>>
>> Good to know.
>
> Thanks.

Oh, how stupid I was - or maybe just tired! - Now that I re-read your code...
Why not just:

{ n=1000; print length(n"") }


Janis

> [...]


Kaz Kylheku

unread,
May 15, 2015, 10:17:28 AM5/15/15
to
On 2015-05-15, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <201505140...@kylheku.com>,
> Kaz Kylheku <k...@kylheku.com> wrote:
>>On 2015-05-14, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> To get the number of digits of a decimal number we can use the log function.
>>
>>It occurs to me that this could be a useful approach:
>>
>> $ gawk 'BEGIN { printf("%e\n", 123456) }' # use sprintf to a string
>> 1.234560e+05
>> ^^^
>>Use sprintf instead, extract the exponent, clamp to -1 if it is negative,
>>add 1.
>
> Well, isn't that pretty much equivalent, functionally, to my "%d" approach
> (in my first post on this thread) ?

It's equivalent in that we compute a textual representation and then extract
something from it.

Your approach converts the value to an integer using %d, so it could have range
limitation if the awk implementation isn't bignum-based. It works in awk, but
not other awks that only have 32 (or 64) bit integers.

The above approach will work for numbers like 1.234E+307 quite portably.

Now here is a best-of-both approach, which is possible if we replace %d by the
floating-point specifier %f.

$ awk 'BEGIN { printf("%.0f\n", 1.234E+307) }'
1234000000000000045880777533191982933910913052614812954088954246079376776237
8731330514281278718423267307348619050576264899946629773087716423248039054813
3951200047547218046256465559892470051077817890675171115650690077229596907931
2183889774728966070333234609540778135034014659539089433030794717712459981914
1120

Kenny McCormack

unread,
May 15, 2015, 10:27:49 AM5/15/15
to
In article <201505150...@kylheku.com>,
Kaz Kylheku <k...@kylheku.com> wrote:
...
>Your approach converts the value to an integer using %d, so it could have range
>limitation if the awk implementation isn't bignum-based. It works in awk, but
>not other awks that only have 32 (or 64) bit integers.

It works in awk but not in awk?

>The above approach will work for numbers like 1.234E+307 quite portably.

Yeah, but...

Given that the starting point of this thread was using log and log10
(simulated in code), I think it is pretty clear that the OP is primarily
interested in small numbers (with lengths like 3 and 4...)

Anyway, it is, of course, all interesting.

--
They say compassion is a virtue, but I don't have the time!

- David Byrne -

Janis Papanagnou

unread,
May 15, 2015, 10:33:08 AM5/15/15
to
On 05/15/15 16:27, Kenny McCormack wrote:
>
> Given that the starting point of this thread was using log and log10
> (simulated in code), I think it is pretty clear that the OP is primarily
> interested in small numbers (with lengths like 3 and 4...)

Indeed. And with the upthread mentioned method it works up to values of 10^15:

awk 'BEGIN{for (n=1; n<=100000000000000000; n*=10)
print length(n""), length((n-1)"")}'

1 1 # 0 is also one digit
2 1
3 2
4 3
...
14 13
15 14
16 15
17 17 # rounding discrepancies start here
18 18

Janis

Kenny McCormack

unread,
May 15, 2015, 10:34:24 AM5/15/15
to
In article <mj4uva$i1l$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 05/15/15 15:58, Kenny McCormack wrote:
>> In article <mj2bsv$jh5$1...@news.m-online.net>,
>> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> ...
>>
>>>> Just for the record, here's a workaround:
>>>>
>>>> $ gawk '{print length(sprintf("%d",$1))}'
>>>
>>> Good to know.
>>
>> Thanks.
>
>Oh, how stupid I was - or maybe just tired! - Now that I re-read your code...
>Why not just:
>
> { n=1000; print length(n"") }

Well, you don't need the "". And, for what it is worth, I assumed you
wanted it to "do the right thing" (i.e., give the same basic numerical
results as solutions using log/log10 would) for non-integers (e.g., 3.123).

--
Genesis 2:7 And the LORD God formed man of the dust of the ground, and
breathed into his nostrils the breath of life; and man became a living soul.

Janis Papanagnou

unread,
May 15, 2015, 10:46:28 AM5/15/15
to
On 05/15/15 16:34, Kenny McCormack wrote:
> In article <mj4uva$i1l$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> Why not just:
>>
>> { n=1000; print length(n"") }
>
> Well, you don't need the "".

Ah, of course, since it's a string function. Even better then.

> And, for what it is worth, I assumed you
> wanted it to "do the right thing" (i.e., give the same basic numerical
> results as solutions using log/log10 would) for non-integers (e.g., 3.123).

Sorry for having given that impression. No, I just wanted the number of
digits in decimal numbers.

I seem to have just thought too "complicated", missed the obvious simple
way.

Thanks again for your inspiring post.

Janis

Kaz Kylheku

unread,
May 15, 2015, 1:05:54 PM5/15/15
to
On 2015-05-15, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <201505150...@kylheku.com>,
> Kaz Kylheku <k...@kylheku.com> wrote:
> ...
>>Your approach converts the value to an integer using %d, so it could have range
>>limitation if the awk implementation isn't bignum-based. It works in awk, but
>>not other awks that only have 32 (or 64) bit integers.
>
> It works in awk but not in awk?

Gawk.

>>The above approach will work for numbers like 1.234E+307 quite portably.
>
> Yeah, but...
>
> Given that the starting point of this thread was using log and log10
> (simulated in code), I think it is pretty clear that the OP is primarily
> interested in small numbers (with lengths like 3 and 4...)

For which that is pointless. Go to text, count digits.

> Anyway, it is, of course, all interesting.

Numerically the log approach is interesting. It's a problem in which there are
discrete steps, like from 999 to 1000, which is a ~0.1% jump. Piece of cake
for floating-point right? But at the same time, there are exact integral
solutions, which, if they are underestimated by a hairbreadth, will convert to
the wrong integer.

Is there a way to make that approach work?

One would be to do the rounding, but on a fine precision. I mistakenly
wrote about adding 0.5 earlier, but in fact, adding a *tiny* epsilon
before the integer conversion could work: a number which is not so large that
it causes the wrong answer for the largest 99999...999 that we care about.
A tiny epsilon that will tip 2.9999...9x to 3.000000...0y and correct
the truncation.

Another approach is to assume that the result is off by one, and test
it. Suppose you're told that D is the number of digits in N, but it
could be off by one in either direction. What can you do?

Something like:

M = simple_pow(10, D)
L = M/10
H = M*10

if (N < M) {
# D is correct or overestimates
if (N < L)
OUT = D - 1
} else {
# D underestimates: D + 1 must be right
OUT = D + 1
}

Simple_pow is:

function simple_pow(x, y,
acc)
{
if (y == 0)
return 1
acc = x
while (--y)
acc *= x
return acc
}

Of course we can also make "real" pow like this:

function pow(x, y)
{
return exp(log(x) * y)
}

Janis Papanagnou

unread,
May 15, 2015, 1:35:49 PM5/15/15
to
On 05/15/15 19:05, Kaz Kylheku wrote:
> [...]
>
> One would be to do the rounding, but on a fine precision. I mistakenly
> wrote about adding 0.5 earlier, but in fact, adding a *tiny* epsilon

Exactly. (I wouldn't want to do such adjustments manually, though.)

> before the integer conversion could work: a number which is not so large that
> it causes the wrong answer for the largest 99999...999 that we care about.
> A tiny epsilon that will tip 2.9999...9x to 3.000000...0y and correct
> the truncation.

I was under the impression that such "exact rounding on small scales"
would be part of the FP libraries (or even FP coprocessors).

Janis

> [...]

Kaz Kylheku

unread,
May 15, 2015, 2:58:39 PM5/15/15
to
Yes; unfortunately, truncation isn't rounding, at any scale, right?

If the value is even the smallest possible epsilon below 3.0, it
goes to 2.

In this particular problem, we need values which are *near* 3.0 to
go to the integer 3. But "near" does not mean "[2.5, 3.5)".

It's more like [3 - epsilon, 4.0 - epsilon) -> 3.

In general, it seems there is utility in float->integer truncation which
exhibits an area of "gravity" around the integers. There are situations
in which 2.999999999999 is intented to be an approximation of 3.0
as far as the application is concerned, even though 2.99997 clearly
goes to 2.

Janis Papanagnou

unread,
May 15, 2015, 3:48:05 PM5/15/15
to
On 05/15/15 20:58, Kaz Kylheku wrote:
> On 2015-05-15, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 05/15/15 19:05, Kaz Kylheku wrote:
>>> [...]
>>>
>>> One would be to do the rounding, but on a fine precision. I mistakenly
>>> wrote about adding 0.5 earlier, but in fact, adding a *tiny* epsilon
>>
>> Exactly. (I wouldn't want to do such adjustments manually, though.)
>>
>>> before the integer conversion could work: a number which is not so large that
>>> it causes the wrong answer for the largest 99999...999 that we care about.
>>> A tiny epsilon that will tip 2.9999...9x to 3.000000...0y and correct
>>> the truncation.
>>
>> I was under the impression that such "exact rounding on small scales"
>> would be part of the FP libraries (or even FP coprocessors).
>
> Yes; unfortunately, truncation isn't rounding, at any scale, right?

Also here we should distinguish the rounding on small scale vs. rounding OR
truncation on large (function) scale. The rounding that I am speaking about
is the small scale thing; that what I seem to recall had read in Goldberg's
paper (or was it some IEEE spec about even/odd least significant bits to be
handled in a specific way; one round up one down?). Or what I saw in that
scientific calculator library 35 years ago, doing some clever things to get
the results (like the one in my example) right. IOW; the small scale thing
to (also?) overcome the inherent binary representation problem, the large
scale thing as a user-chosen function in the mathematical application domain.

Janis

> [...]

glen herrmannsfeldt

unread,
May 15, 2015, 4:40:47 PM5/15/15
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 05/15/15 20:58, Kaz Kylheku wrote:
>> On 2015-05-15, Janis Papanagnou <janis_pa...@hotmail.com> wrote:

(snip on int(log10(1000))

>>> Exactly. (I wouldn't want to do such adjustments manually, though.)
(snip)
>>> I was under the impression that such "exact rounding on small scales"
>>> would be part of the FP libraries (or even FP coprocessors).

>> Yes; unfortunately, truncation isn't rounding, at any scale, right?

> Also here we should distinguish the rounding on small scale vs. rounding OR
> truncation on large (function) scale. The rounding that I am speaking about
> is the small scale thing; that what I seem to recall had read in Goldberg's
> paper (or was it some IEEE spec about even/odd least significant bits to be
> handled in a specific way; one round up one down?).

OK, but there is no right answer to this problem on a binary machine.
A machine with decimal floating point (IBM makes some of those) could
get it right in the case of log10(). Less obvious in the case of
log(x)/log(10).

I would add a tiny amount before the int, but note that at some point
int(log(999.999999)/log(10)+x) will go to 3.

> Or what I saw in that scientific calculator library 35 years ago,
> doing some clever things to get the results (like the one in my
> example) right. IOW; the small scale thing to (also?) overcome
> the inherent binary representation problem, the large scale thing
> as a user-chosen function in the mathematical application domain.

I have seen people do this for plot labels. Usually rounding up a
tiny it isn't so bad, rounding down a tiny bit is.

-- glen

Kenny McCormack

unread,
May 15, 2015, 4:55:27 PM5/15/15
to
In article <mj5lk4$jf1$1...@speranza.aioe.org>,
glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
...
>OK, but there is no right answer to this problem on a binary machine.
>A machine with decimal floating point (IBM makes some of those) could
>get it right in the case of log10(). Less obvious in the case of
>log(x)/log(10).

Note, FWIW, that I *did* get the right answer when I called "log10"
directly from the C library (using "call_any()"). No other tricks were
necessary in order to get the right answer.

I have no doubt that if either of the following become true at some future
point, everybody will easily get the right answer:
1) log10() is added to "core" gawk.
2) Someone writes and distributes an extension library that implements
log10().

--
The last time a Republican cared about you, you were a fetus.

Janis Papanagnou

unread,
May 16, 2015, 10:44:37 AM5/16/15
to
On 05/15/15 22:40, glen herrmannsfeldt wrote:
[...]
>
> OK, but there is no right answer to this problem on a binary machine.
> A machine with decimal floating point (IBM makes some of those) could
> get it right in the case of log10(). Less obvious in the case of
> log(x)/log(10).

I wouldn't have posted if I had't noticed ksh doing the calculation
right.

>
> I would add a tiny amount before the int, but note that at some point
> int(log(999.999999)/log(10)+x) will go to 3.
>
>> Or what I saw in that scientific calculator library 35 years ago,
>> doing some clever things to get the results (like the one in my
>> example) right. IOW; the small scale thing to (also?) overcome
>> the inherent binary representation problem, the large scale thing
>> as a user-chosen function in the mathematical application domain.
>
> I have seen people do this for plot labels. Usually rounding up a
> tiny it isn't so bad, rounding down a tiny bit is.

(I've also done it in the past, incidentally exactly when correcting
some plot program discrepancies. But, AFAIT, mainly because they also
seem to haven't it implemented correctly; for a scale plot there seem
still some programmers add FP increments in a loop, instead of doing
an integer loop and scale the result appropriately. Anyway. Just BTW.)

Janis

Luuk

unread,
May 16, 2015, 12:37:50 PM5/16/15
to
On 14-5-2015 07:41, Janis Papanagnou wrote:
> To get the number of digits of a decimal number we can use the log function.
> Ideally that would be log10(x), but since there's none available in awk we
> need to calculate log(x)/log(10). This seems to work fine in gawk:
>
> $ awk 'BEGIN {print (log(999)/log(10))+1}'
> 3.99957
> $ awk 'BEGIN {print (log(1000)/log(10))+1}'
> 4
>
> But to get whole integer numbers for the number of digits we also need to
> apply the int() function:
>
> $ awk 'BEGIN {print int(log(999)/log(10))+1}'
> 3
> $ awk 'BEGIN {print int(log(1000)/log(10))+1}'
> 3
>


Indeed the log() function to calculate the number of digits...

But is this not simpler? (and suffering from rounding errors ;)


$ awk 'func noi(x){ y=index(x ".",".")-1; return length(substr(x,0,y))
}BEGIN{ print noi(1000) }'
4
$ awk 'func noi(x){ y=index(x ".",".")-1; return length(substr(x,0,y))
}BEGIN{ print noi(999) }'
3

Janis Papanagnou

unread,
May 16, 2015, 1:30:51 PM5/16/15
to
On 05/16/15 18:37, Luuk wrote:
> On 14-5-2015 07:41, Janis Papanagnou wrote:
>> To get the number of digits of a decimal number we can use the log function.
>> Ideally that would be log10(x), but since there's none available in awk we
>> need to calculate log(x)/log(10). This seems to work fine in gawk:
>>
>> $ awk 'BEGIN {print (log(999)/log(10))+1}'
>> 3.99957
>> $ awk 'BEGIN {print (log(1000)/log(10))+1}'
>> 4
>>
>> But to get whole integer numbers for the number of digits we also need to
>> apply the int() function:
>>
>> $ awk 'BEGIN {print int(log(999)/log(10))+1}'
>> 3
>> $ awk 'BEGIN {print int(log(1000)/log(10))+1}'
>> 3
>>
>
>
> Indeed the log() function to calculate the number of digits...
>
> But is this not simpler? (and suffering from rounding errors ;)

You are correct in the observation that if working on strings you don't
suffer all that arithmetic rounding issues. But I fail to see why below
code is simpler than just using "length(x)" (as proposed upthread) for
the case of simple decimal numbers (which was my question).

>
>
> $ awk 'func noi(x){ y=index(x ".",".")-1; return length(substr(x,0,y)) }BEGIN{
> print noi(1000) }'
> 4
> $ awk 'func noi(x){ y=index(x ".",".")-1; return length(substr(x,0,y)) }BEGIN{
> print noi(999) }'
> 3

(Note: I supose your code needs GNU awk and even a recent one to handle
values like 123.4e95, right? And leading signs need separate processing.)

Janis

>

Kenny McCormack

unread,
May 16, 2015, 1:38:04 PM5/16/15
to
In article <mj7us9$me$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>(Note: I supose your code needs GNU awk and even a recent one to handle
>values like 123.4e95, right? And leading signs need separate processing.)

It looks to me like Luuk was trying to solve in via string mashing ode the
same problem that I was letting "sprintf" and "%d" handle - that of
integerizing the passed value.

You (as the OP) have made it clear that you don't require that fine-grained
a solution, so, as you say, just simply using "length" is good enough.

And note that where I had proposed:

print length(sprintf("%d",x))

one could more simply just do:

print length(int(x))

P.S. I'm not really sure why you think that you need GAWK (and/or a recent
version of GAWK) to handle "123.4e95". Care to elaborate?

--
"The anti-regulation business ethos is based on the charmingly naive notion
that people will not do unspeakable things for money." - Dana Carpender

Quoted by Paul Ciszek (pciszek at panix dot com). But what I want to know
is why is this diet/low-carb food author doing making pithy political/economic
statements?

Nevertheless, the above quote is dead-on, because, the thing is - business
in one breath tells us they don't need to be regulated (which is to say:
that they can morally self-regulate), then in the next breath tells us that
corporations are amoral entities which have no obligations to anyone except
their officers and shareholders, then in the next breath they tell us they
don't need to be regulated (that they can morally self-regulate) ...

Janis Papanagnou

unread,
May 16, 2015, 1:58:08 PM5/16/15
to
On 05/16/15 19:38, Kenny McCormack wrote:
>
> P.S. I'm not really sure why you think that you need GAWK (and/or a recent
> version of GAWK) to handle "123.4e95". Care to elaborate?

I don't recall to have seen (in old, non-GNU awks) long FP numbers expanded like:

$ awk 'BEGIN {print 123.45e95}'
12344999999999999126881965579722334747027638465843291769040083971836900515121603858307419692072960


Do I misremember?


BTW, I am also puzzled at the moment about this:

$ awk 'BEGIN {OFMT="%e" ; print 123.45 }'
1.234500e+02

$ awk 'BEGIN {OFMT="%e" ; print 123.45e95 }'
12344999999999999126881965579722334747027638465843291769040083971836900515121603858307419692072960


Janis

glen herrmannsfeldt

unread,
May 16, 2015, 3:02:51 PM5/16/15
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 05/15/15 22:40, glen herrmannsfeldt wrote:
> [...]

>> OK, but there is no right answer to this problem on a binary machine.
>> A machine with decimal floating point (IBM makes some of those) could
>> get it right in the case of log10(). Less obvious in the case of
>> log(x)/log(10).

> I wouldn't have posted if I had't noticed ksh doing the calculation
> right.

I use tcsh, but I just looked and don't see anything in the ksh man
page about log.

Most shells barely do integer arithmetic, so I would be surprised to
see them doing floating point.

gawk does floating point, with the usual floating results.

log(1000) (that is, base e) is transcendental, and has no finite
precision exact value in any base.

On systems, such as Fortran, that have a log10() function, it is
usually generated by computing a log2 value and multiplying by
the appropriate constant. (On machines with base 2 floating
point, the exponent is used for the whole number part of the
log2 result.)

I don't know at all the state of math libraries for decimal
floating point systems. On those, you might have some expectation
that log10(1000) is exactly 3, but maybe not even then.

-- glen

Luuk

unread,
May 16, 2015, 4:12:08 PM5/16/15
to
I happend to have an old version of gawk on my Windows PC

D:\wbin>gawk "BEGIN { print 123.45e95 }"
1.2345e+097

D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45 }"
1.234500e+002

D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45 }"
1.234500e+002

D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45e95 }"
1.234500e+097

D:\wbin>gawk -W version
GNU Awk 3.1.0
Copyright (C) 1989, 1991-2001 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

D:\wbin>



P.S. The change of subject was indeed because i did not read the entire
original thread, which had the 'workaround' printf("%d", nr)


Janis Papanagnou

unread,
May 16, 2015, 7:39:08 PM5/16/15
to
On 05/16/15 21:02, glen herrmannsfeldt wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 05/15/15 22:40, glen herrmannsfeldt wrote:
>> [...]
>
>>> OK, but there is no right answer to this problem on a binary machine.
>>> A machine with decimal floating point (IBM makes some of those) could
>>> get it right in the case of log10(). Less obvious in the case of
>>> log(x)/log(10).
>
>> I wouldn't have posted if I had't noticed ksh doing the calculation
>> right.
>
> I use tcsh, but I just looked and don't see anything in the ksh man
> page about log.

Which ksh's man page? - The official one can be found here:
http://www2.research.att.com/sw/download/man/man1/ksh.html
Search for the paragraph "Arithmetic Evaluation":
"[...] Evaluations are performed using double precision
floating point arithmetic or long double precision
floating point for systems that provide this data type."

>
> Most shells barely do integer arithmetic, so I would be surprised to
> see them doing floating point.

Your surprise has no factual significance. Especially since I
copy/pasted the shell code and result in my original posting
already.

Janis

> [...]

Janis Papanagnou

unread,
May 16, 2015, 7:42:14 PM5/16/15
to
On 05/16/15 22:12, Luuk wrote:
> On 16-5-2015 19:58, Janis Papanagnou wrote:
[...]
>> BTW, I am also puzzled at the moment about this:
>>
>> $ awk 'BEGIN {OFMT="%e" ; print 123.45 }'
>> 1.234500e+02
>>
>> $ awk 'BEGIN {OFMT="%e" ; print 123.45e95 }'
>> 12344999999999999126881965579722334747027638465843291769040083971836900515121603858307419692072960
>
> I happend to have an old version of gawk on my Windows PC

Thanks for the confirmation! That's exactly how I remembered it.

>
> D:\wbin>gawk "BEGIN { print 123.45e95 }"
> 1.2345e+097
>
> D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45 }"
> 1.234500e+002
>
> D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45 }"
> 1.234500e+002
>
> D:\wbin>gawk "BEGIN {OFMT=\"%e\" ; print 123.45e95 }"
> 1.234500e+097
>
> D:\wbin>gawk -W version
> GNU Awk 3.1.0
> [...]


Janis

glen herrmannsfeldt

unread,
May 16, 2015, 9:54:25 PM5/16/15
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

(snip)
>> I use tcsh, but I just looked and don't see anything in the ksh man
>> page about log.

> Which ksh's man page? - The official one can be found here:
> http://www2.research.att.com/sw/download/man/man1/ksh.html
> Search for the paragraph "Arithmetic Evaluation":
> "[...] Evaluations are performed using double precision
> floating point arithmetic or long double precision
> floating point for systems that provide this data type."

Seems that on this system, ksh actually runs zsh, and man ksh
gives the zsh man page, which doesn't say anything about log.

Can you give the input to ksh which shows the log result?

I tried some things, but non worked.

>> Most shells barely do integer arithmetic, so I would be surprised to
>> see them doing floating point.

> Your surprise has no factual significance. Especially since I
> copy/pasted the shell code and result in my original posting
> already.

In any case, pretty much in general for floating point you can't
expect exact results. If things work well, the results will only be
off in the last bit, which for cases like this means a 50% chance
that it does what you expect.

It might be that there are some versions of log10() that special
case exact, within the available precision, powers of 10, but I don't
know any that do that. If you want to round up, add an appropriate
(for the case at hand) small amount.

-- glen

Janis Papanagnou

unread,
May 17, 2015, 9:32:05 AM5/17/15
to
On 05/17/15 03:54, glen herrmannsfeldt wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
> (snip)
>>> I use tcsh, but I just looked and don't see anything in the ksh man
>>> page about log.
>
>> Which ksh's man page? - The official one can be found here:
>> http://www2.research.att.com/sw/download/man/man1/ksh.html
>> Search for the paragraph "Arithmetic Evaluation":
>> "[...] Evaluations are performed using double precision
>> floating point arithmetic or long double precision
>> floating point for systems that provide this data type."
>
> Seems that on this system, ksh actually runs zsh, and man ksh
> gives the zsh man page, which doesn't say anything about log.
>
> Can you give the input to ksh which shows the log result?

I thought I did that already upthread. Here's some code again...

$ ksh --version
version sh (AT&T Research) 93u+ 2012-08-01

$ ksh -c 'print $((log10(1000))) $((log(1000))) $((log(10)))
$((log(1000)/log(10)))'
3 6.90775527898213705 2.30258509299404568 3


Janis

> [...]


Andrew Schorr

unread,
May 17, 2015, 11:12:41 AM5/17/15
to
On Saturday, May 16, 2015 at 7:42:14 PM UTC-4, Janis Papanagnou wrote:
> On 05/16/15 22:12, Luuk wrote:
> > On 16-5-2015 19:58, Janis Papanagnou wrote:
> [...]
> >> BTW, I am also puzzled at the moment about this:
> >>
> >> $ awk 'BEGIN {OFMT="%e" ; print 123.45 }'
> >> 1.234500e+02
> >>
> >> $ awk 'BEGIN {OFMT="%e" ; print 123.45e95 }'
> >> 12344999999999999126881965579722334747027638465843291769040083971836900515121603858307419692072960
> >
> > I happend to have an old version of gawk on my Windows PC
>
> Thanks for the confirmation! That's exactly how I remembered it.
>
> >
> > D:\wbin>gawk "BEGIN { print 123.45e95 }"
> > 1.2345e+097

This behavior changed in gawk 3.1.6. In 3.1.5, if a numeric value is integral and can fit into a long int, it is converted to a string using "%ld"; otherwise, the conversion is done using OFMT. In 3.1.6, this was changed to print integral values that don't fit inside a long int using "%.0f". I'm not sure whether this new behavior is an improvement. One might argue that "%ld" should be used only for integer values that can be represented accurately inside an IEEE 754 double precision value, i.e. integers in the range [-2^53, 2^53].

Regards,
Andy

glen herrmannsfeldt

unread,
May 17, 2015, 11:23:26 AM5/17/15
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

(snip)
>> Seems that on this system, ksh actually runs zsh, and man ksh
>> gives the zsh man page, which doesn't say anything about log.

>> Can you give the input to ksh which shows the log result?

> I thought I did that already upthread. Here's some code again...

> $ ksh --version
> version sh (AT&T Research) 93u+ 2012-08-01
>
> $ ksh -c 'print $((log10(1000))) $((log(1000))) $((log(10)))
> $((log(1000)/log(10)))'
> 3 6.90775527898213705 2.30258509299404568 3

OK, I was still thinking about gawk when I saw those, and not
thinking about a ksh print statement. (I normally use tcsh, and
don't follow features of ksh much at all.)

Seems that I have zsh 4.3.10, which does know about $((...))
expressions, but doesn't know log or sqrt. It doesn't know floating
point, either, as print $((1/3)) gives 0.

-- glen

Janis Papanagnou

unread,
May 17, 2015, 11:31:21 AM5/17/15
to
On 05/17/15 17:23, glen herrmannsfeldt wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
> (snip)
>>> Seems that on this system, ksh actually runs zsh, and man ksh
>>> gives the zsh man page, which doesn't say anything about log.
>
>>> Can you give the input to ksh which shows the log result?
>
>> I thought I did that already upthread. Here's some code again...
>
>> $ ksh --version
>> version sh (AT&T Research) 93u+ 2012-08-01
>>
>> $ ksh -c 'print $((log10(1000))) $((log(1000))) $((log(10)))
>> $((log(1000)/log(10)))'
>> 3 6.90775527898213705 2.30258509299404568 3
>
> OK, I was still thinking about gawk when I saw those, and not
> thinking about a ksh print statement.

(Note, in the samples of my original posting I had used 'echo'.)

> (I normally use tcsh, and
> don't follow features of ksh much at all.)

(For programming you should use shells from the bourne family.)

>
> Seems that I have zsh 4.3.10, which does know about $((...))
> expressions,

Of course; since it's defined by POSIX.

> but doesn't know log or sqrt.

Which are not required by POSIX.

> It doesn't know floating
> point, either, as print $((1/3)) gives 0.

Most shells do not support FP.

Janis

>
> -- glen
>

0 new messages