Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Recursive function not work when if conditions >= 1024, 1000 is no problem, why ?

112 views
Skip to first unread message

Homer Li

unread,
Jun 2, 2016, 5:32:00 AM6/2/16
to
Hello ,
I want automatically switch directory unit.

# cat /tmp/test
10000 /dir
1001969264 /dir0
1002969264 /dir1
1005969264 /dir2
2005969264 /dir3

I need output like this(num>=1000):
# sh /tmp/tmp
9.77 MiB /dir
955.55 GiB /dir0
956.51 GiB /dir1
959.37 GiB /dir2
1.87 TiB /dir3

# cat /tmp/tmp
/opt/gawk-4.1.3/bin/awk '
function calc(num,i,value){
if (num>=1000) calc(num/1024,i+1,value)
else printf "%.2f %s %s\n", num,a[i+1],value
};
BEGIN {
split("KiB MiB GiB TiB PiB EiB ZiB YiB",a)
};
{b[$1]=$2};
END{
for (i in b) {
calc(i,0,b[i])
}
}' /tmp/test


if I modify "if (num>=1000) calc(num/1024,i+1,value)" to "if (num>=1024) calc(num/1024,i+1,value)".

Only dir3 will be switched. why ?

# sh /tmp/tmp1
10000.00 KiB /dir
1001969264.00 KiB /dir0
1002969264.00 KiB /dir1
1005969264.00 KiB /dir2
1.87 TB /dir3

# cat /tmp/tmp1
/opt/gawk-4.1.3/bin/awk '
function calc(num,i,value){
if (num>=1024) calc(num/1024,i+1,value)
else printf "%.2f %s %s\n", num,a[i+1],value
};
BEGIN {
split("KiB MiB GiB TiB PiB EiB ZiB YiB",a)
};
{b[$1]=$2};
END{
for (i in b) {
calc(i,0,b[i])
}
}' /tmp/test


Thanks.


Dave Sines

unread,
Jun 2, 2016, 10:24:36 AM6/2/16
to
String comparison -- "1002969264" is less than "1024". This is happening
because you're invoking the calc function in the END block with a string
(array indices are strings) as its first argument.

Force numeric comparison in the calc function:

if (num + 0 >= 1024) calc(num/1024, i+1, value)

Homer Li

unread,
Jun 2, 2016, 9:56:48 PM6/2/16
to
Hi,Dave, Thank you.

Geoff Clare

unread,
Jun 3, 2016, 8:41:03 AM6/3/16
to
Dave Sines wrote:

> Homer Li <01ja...@gmail.com> wrote:
>>
>> # cat /tmp/test
>> 10000 /dir
>> 1001969264 /dir0
>> 1002969264 /dir1
>> 1005969264 /dir2
>> 2005969264 /dir3
>>
[...]
>> # cat /tmp/tmp
>> /opt/gawk-4.1.3/bin/awk '
>> function calc(num,i,value){
>> if (num>=1000) calc(num/1024,i+1,value)
>> else printf "%.2f %s %s\n", num,a[i+1],value
>> };
>> BEGIN {
>> split("KiB MiB GiB TiB PiB EiB ZiB YiB",a)
>> };
>> {b[$1]=$2};
>> END{
>> for (i in b) {
>> calc(i,0,b[i])
>> }
>> }' /tmp/test
>>
>>
>> if I modify "if (num>=1000) calc(num/1024,i+1,value)" to
>> "if (num>=1024) calc(num/1024,i+1,value)".
>>
>> Only dir3 will be switched. why ?
>
> String comparison -- "1002969264" is less than "1024". This is happening
> because you're invoking the calc function in the END block with a string
> (array indices are strings) as its first argument.
>
> Force numeric comparison in the calc function:
>
> if (num + 0 >= 1024) calc(num/1024, i+1, value)

A better fix would be to use the directory name as the array index
instead of the size, as the current code will lose data if two
directories happen to have the same size. I.e.:

{b[$2]=$1}
END{
for (i in b) {
calc(b[i],0,i)
}
}' /tmp/test

Awk should automatically do a numeric comparison in this case, because
b[i] got its value from a field variable.

--
Geoff Clare <net...@gclare.org.uk>

Marc de Bourget

unread,
Jun 3, 2016, 5:49:47 PM6/3/16
to
Le vendredi 3 juin 2016 14:41:03 UTC+2, Geoff Clare a écrit :
> Dave Sines wrote:
>
Very good hint, Geoff!

One more hint: With TAWK, there is no problem with the original issue,
where num is a string (it is an array index, as Dave mentioned correctly):
function calc(num,i,value){
if (num >= 1024)
...
}

TAWK decides to make a numeric comparison. From the TAWK manual, page 45:
"If you ... compare a string with a number, TAWK will determine whether to
use numeric or string comparison. If the string consists only of a number,
with optional leading spaces or tabs and an optional + or - sign, then TAWK
converts the string to a number and performs a numeric comparison."

Not sure if this is good or not good, I just add this as a hint.
Probably is it better to force a numeric comparison by adding 0.
The TAWK manual also states it is better to force the comparison.

Andrew Schorr

unread,
Jun 6, 2016, 8:03:09 PM6/6/16
to
On Friday, June 3, 2016 at 5:49:47 PM UTC-4, Marc de Bourget wrote:
> TAWK decides to make a numeric comparison. From the TAWK manual, page 45:
> "If you ... compare a string with a number, TAWK will determine whether to
> use numeric or string comparison. If the string consists only of a number,
> with optional leading spaces or tabs and an optional + or - sign, then TAWK
> converts the string to a number and performs a numeric comparison."
>
> Not sure if this is good or not good, I just add this as a hint.
> Probably is it better to force a numeric comparison by adding 0.
> The TAWK manual also states it is better to force the comparison.

I believe this TAWK behavior is in violation of the POSIX awk specification:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

"Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required and a string comparison shall be made using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false."

Elsewhere, it states clearly that array subscripts shall be strings:

"The awk language supplies arrays that are used for storing numbers or strings. Arrays need not be declared. They shall initially be empty, and their sizes shall change dynamically. The subscripts, or element identifiers, are strings, providing a type of associative array capability."

And it defines elsewhere under what circumstances a value may be treated as a numeric string, and array subscripts do not qualify.

As such, it is wiser to write portable code by using +0 to force a numeric comparison.

Regards,
Andy


Kenny McCormack

unread,
Jun 7, 2016, 7:09:00 AM6/7/16
to
In article <f3740875-bea4-4f13...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
...
>I believe this TAWK behavior is in violation of the POSIX awk specification:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

TAWK does not strive to, nor does it need to, be POSIX compliant.

For one thing, it's treatment of a stand-alone reg exp (see below) is both
clearly superior to and contrary to POSIX:

x = /foo/

>
>Elsewhere, it states clearly that array subscripts shall be strings:
>

Yes, TAWK has always been smart about array indexes that are, in fact,
numbers. This is sensible, since most of the time (though by no means all)
array indexes *are* numbers. I.e., the user is using AWK arrays in pretty
much the same style as one uses arrays in other, more pedestrian,
programming languages.

>
>As such, it is wiser to write portable code by using +0 to force a numeric
>comparison.
>

True. I don't disagree with this.

Still, it is nice to have a programming language do the right thing, even
when we humans slip up.

--
The randomly generated signature file that would have appeared here is more than 4
lines in length. As such, it violates one or more Usenet RFPs. In order to remain in
compliance with said RFPs, the actual sig can be found at the following web address:
http://www.xmission.com/~gazelle/Sigs/Windows

Andrew Schorr

unread,
Jun 7, 2016, 4:06:00 PM6/7/16
to
On Tuesday, June 7, 2016 at 7:09:00 AM UTC-4, Kenny McCormack wrote:
> TAWK does not strive to, nor does it need to, be POSIX compliant.

Fair enough, but this is the comp.lang.awk newsgroup, not comp.lang.tawk. TAWK can do whatever it wants, but it doesn't implement the AWK language, and it's unsupported. So it seems like a waste of time and energy to focus on TAWK. I spend my time working with gawk which implements the AWK language with some useful extensions.

Regards,
Andy

Kenny McCormack

unread,
Jun 7, 2016, 4:38:24 PM6/7/16
to
In article <2d5bfc28-b3b4-40f6...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
>On Tuesday, June 7, 2016 at 7:09:00 AM UTC-4, Kenny McCormack wrote:
>> TAWK does not strive to, nor does it need to, be POSIX compliant.
>
>Fair enough, but this is the comp.lang.awk newsgroup, not comp.lang.tawk. TAWK
>can do whatever it wants, but it doesn't implement the AWK language, and it's

Nonsense.

But you are free to use this newsgroups as you see fit, as are the rest of us.

Your work on the 2nd best AWK implementation (*) is greatly appreciated.
It's my 2nd favorite programming language.

(*) And the 1st best oepn source version of AWK...

P.S. Your attitude (that only POSIX AWK is AWK) is akin to saying that,
say, only American English is English (and those of us in England,
Australia, India, etc, etc) are not speaking English. Obviously, this is
nonsense.

--
The book "1984" used to be a cautionary tale;
Now it is a "how-to" manual.

Kaz Kylheku

unread,
Jun 7, 2016, 4:43:52 PM6/7/16
to
On 2016-06-07, Andrew Schorr <asc...@telemetry-investments.com> wrote:
> On Tuesday, June 7, 2016 at 7:09:00 AM UTC-4, Kenny McCormack wrote:
>> TAWK does not strive to, nor does it need to, be POSIX compliant.
>
> Fair enough, but this is the comp.lang.awk newsgroup, not comp.lang.tawk.

Laugh, no, no.

This is the comp.lang.awk newsgroup, not comp.lang.posix.awk.

> it doesn't implement the AWK language and it's unsupported.

And to ensure it is as unsupported as possible, keep it out of the Awk
newsgroup!

> spend my time working with gawk which implements the AWK language with some useful extensions.

When you use Gawk extensions, your *program* is not written in the Awk
language. Discussing those extensions is exactly as off-topic as
discussing the non-POSIX-compliant aspects of Tawk.

(Gawk also needs some --posix flag to be POSIX, and that is almost
never used by anyone. Based on the description of --posix in the
manual, we can write a POSIX test case that doesn't work on Gawk without
that option.)

It is glaring hypocrisy to engage in Gawk-specific discussion, yet
insist that Tawk is off-topic.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Andrew Schorr

unread,
Jun 7, 2016, 9:10:12 PM6/7/16
to
On Tuesday, June 7, 2016 at 4:43:52 PM UTC-4, Kaz Kylheku wrote:
> It is glaring hypocrisy to engage in Gawk-specific discussion, yet
> insist that Tawk is off-topic.

Ha. You got me -- I am undoubtedly a complete and utter hypocrite. :-)

I should not have said that TAWK is off-topic. However, I still don't think it's a great thing that TAWK implements array subscripts in a way that conflicts with the standard and with how other awk implementations behave. FYI, internally, gawk does have optimizations for integer subscripts, so there is a performance benefit in the common case of using integer indices. But the subscript values should always behave as if they were strings. I'm in the midst of trying to fix a few bugs related to this aspect, so it's a topic of some concern at the moment.

Cheers,
Andy

Kaz Kylheku

unread,
Jun 7, 2016, 9:27:38 PM6/7/16
to
On 2016-06-08, Andrew Schorr <asc...@telemetry-investments.com> wrote:
> On Tuesday, June 7, 2016 at 4:43:52 PM UTC-4, Kaz Kylheku wrote:
>> It is glaring hypocrisy to engage in Gawk-specific discussion, yet
>> insist that Tawk is off-topic.
>
> Ha. You got me -- I am undoubtedly a complete and utter hypocrite. :-)
>
> I should not have said that TAWK is off-topic. However, I still don't
> think it's a great thing that TAWK implements array subscripts in a
> way that conflicts with the standard and with how other awk
> implementations behave.

Though it conflicts in some ways, you can write code that will work
the same way, e.g. array["0001234"+0] = 42; print array[1234].

That is to say, we can abstract away a common dialect (let's call it
"Portable Awk" in which it is "implementation-defined" whether integer
array subscripts are treated as integer keys from strings, or whether
they are always reduced to decimal strings which are then used as keys.

In the more rigidly defined "POSIX Awk" this matter is not
implementation-defined; the keying is required to be string-based.

You can write in "Portable Awk" by avoiding a dependence on that
"implementation-defined" choice, and only use arithmetically
normalized values as array indices: never a string with extra
spaces, leading zeros or a gratuitious plus sign.

Are portability issues even a big deal? Does anyone have 250,000 lines
of Awk to port from one Awk to another?

Lastly, is it even good idea to ever write code in any Awk dialect whose
correctness depends on "001234" and 1234 being distinct indices?

> FYI, internally, gawk does have optimizations
> for integer subscripts, so there is a performance benefit in the
> common case of using integer indices. But the subscript values should
> always behave as if they were strings. I'm in the midst of trying to
> fix a few bugs related to this aspect, so it's a topic of some concern
> at the moment.

Bugs, you say? Would these bugs affect "Portable Awk" programs or just
those foolishly optimistic ones written in POSIX Awk? ;)

Andrew Schorr

unread,
Jun 8, 2016, 9:15:34 AM6/8/16
to
On Tuesday, June 7, 2016 at 9:27:38 PM UTC-4, Kaz Kylheku wrote:
> Are portability issues even a big deal? Does anyone have 250,000 lines
> of Awk to port from one Awk to another?

Personally, I have perhaps tens of thousands of lines of awk code. I would not like to have port them.

> Lastly, is it even good idea to ever write code in any Awk dialect whose
> correctness depends on "001234" and 1234 being distinct indices?

Believe it or not, we receive bug reports about just such things. Maybe it's not a good idea, but eventually somebody will try something like this and complain if it doesn't work properly. One can then debate what "properly" means, but it's helpful to be able to rely upon a specification to define what's supposed to happen.

Regards,
Andy

Kenny McCormack

unread,
Jun 8, 2016, 9:28:06 AM6/8/16
to
In article <a891b1e2-c295-46d2...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
>On Tuesday, June 7, 2016 at 9:27:38 PM UTC-4, Kaz Kylheku wrote:
>> Are portability issues even a big deal? Does anyone have 250,000 lines
>> of Awk to port from one Awk to another?
>
>Personally, I have perhaps tens of thousands of lines of awk code. I would not
>like to have port them.

And nobody is suggesting that you do. I think it is pretty normal to write
an AWK script for a specific version of AWK - be it TAWK, GAWK, MAWK, or
just plain (minimalist) POSIX AWK.

>> Lastly, is it even good idea to ever write code in any Awk dialect whose
>> correctness depends on "001234" and 1234 being distinct indices?

I'm not really sure what we are talking about here (thread drift and all
that), but the following program generates the same output on both GAWK
(4.1) and TAWK (1996) - i.e., nothing (other than a newline):

BEGIN { A["001234"] = "test";print A[1234]}

>Believe it or not, we receive bug reports about just such things. Maybe it's not
>a good idea, but eventually somebody will try something like this and complain if
>it doesn't work properly. One can then debate what "properly" means, but it's
>helpful to be able to rely upon a specification to define what's supposed to
>happen.

Interesting.

--
BigBusiness types (aka, Republicans/Conservatives/Independents/Liberatarians/whatevers)
don't hate big government. They *love* big government as a means for them to get
rich, sucking off the public teat. What they don't like is *democracy* - you know,
like people actually having the right to vote and stuff like that.

Martin Neitzel

unread,
Jun 8, 2016, 12:54:02 PM6/8/16
to
+1 for any awk variants being on-topic here.

> an AWK script for a specific version of AWK - be it TAWK, GAWK, MAWK, or
> just plain (minimalist) POSIX AWK.

nitpick: "minimalist" would rather be "old awk" (no user-defined
functions, no regex variables, less built-in functions).

Martin

Kenny McCormack

unread,
Jun 8, 2016, 1:00:02 PM6/8/16
to
In article <o8Gp1...@gaertner.de>,
For that matter, there's probably some internal version of AWK that only
supports field splitting and addition...

For all practical purposes, POSIX AWK is the base/minimum nowadays.

(Yes, I know about /bin/awk on Solaris...)

--
Modern Conservative: Someone who can take time out from demanding more
flag burning laws, more abortion laws, more drug laws, more obscenity
laws, and more police authority to make warrantless arrests to remind
us that we need to "get the government off our backs".

Marc de Bourget

unread,
Jun 9, 2016, 6:11:28 AM6/9/16
to
Le mercredi 8 juin 2016 19:00:02 UTC+2, Kenny McCormack a écrit :
> For that matter, there's probably some internal version of AWK that only
> supports field splitting and addition...
>

So at least for this one, we could almost use C and strtok() instead :-)

Kaz Kylheku

unread,
Jun 9, 2016, 8:48:15 AM6/9/16
to
Or maybe an improved, re-entrant one with regexes called strtawk().

Mike Sanders

unread,
Jun 10, 2016, 12:00:51 PM6/10/16
to
Kaz Kylheku <545-06...@kylheku.com> wrote:

> Or maybe an improved, re-entrant one with regexes called strtawk().

strtok_r() and pcre, next up a repl...

Makes one appreciate the convenience of gawk all the more.

--
Mike Sanders
www.peanut-software.com
0 new messages