Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Numeric and string comparisons

348 views
Skip to first unread message

Harriet Bazley

unread,
May 3, 2016, 5:12:04 PM5/3/16
to
I've just discovered the reason for a bug in my script: it's reading in
numeric data from a file but unexpectedly comparing the values as strings,
e.g. it thinks 37 is is smaller than 7.

As a workaround I've currently put a "*1" into both sides of my if()
statement, i.e. if (views*1 > totalviews*1).

Is there a better way to force a numeric comparison? (Possibly I should be
doing this at the point where I'm reading the input, for one thing....)

--
Harriet Bazley == Loyaulte me lie ==

A poor workman blames his tools.

pop

unread,
May 3, 2016, 5:59:29 PM5/3/16
to
Harriet Bazley wrote on 5/3/2016 4:01 PM:
> I've just discovered the reason for a bug in my script: it's reading in
> numeric data from a file but unexpectedly comparing the values as strings,
> e.g. it thinks 37 is is smaller than 7.
>
> As a workaround I've currently put a "*1" into both sides of my if()
> statement, i.e. if (views*1 > totalviews*1).
>
> Is there a better way to force a numeric comparison? (Possibly I should be
> doing this at the point where I'm reading the input, for one thing....)
>
That is a normal anomaly of gawk/awk comparisons - what you did is an
accepted workaround - another workaround is adding 0 to the variable:
if (views+0 > totalviews+0)
there is also a gawk extension which yours may not have strtonum()
if (strtonum(views) > strtonum(totalviews))

Hope This Helps
pop->Mark

Marc de Bourget

unread,
May 4, 2016, 4:49:30 AM5/4/16
to
For further information see here:
http://awk.info/?keys2awk

Harriet Bazley

unread,
May 4, 2016, 10:09:55 AM5/4/16
to
On 3 May 2016 as I do recall,
pop wrote:

[snip]

> there is also a gawk extension which yours may not have strtonum()
> if (strtonum(views) > strtonum(totalviews))
>

Thanks. Is this actually more efficient? It's certainly less obscure to
read....

--
Harriet Bazley == Loyaulte me lie ==

Those who neglect the past do not deserve the future.

pop

unread,
May 4, 2016, 9:29:22 PM5/4/16
to
Harriet Bazley wrote on 5/3/2016 8:41 PM:
> On 3 May 2016 as I do recall,
> pop wrote:
>
> [snip]
>
>> there is also a gawk extension which yours may not have strtonum()
>> if (strtonum(views) > strtonum(totalviews))
>>
>
> Thanks. Is this actually more efficient? It's certainly less obscure to
> read....
>
Actually, adding 0 or multiply by 1 is more efficient if you are only
expecting decimal numbers. strtonum() is more useful when mixed decimal,
hex and octal numbers are in the strings to be operated on and it is a
less efficient.

regards;
pop->Mark

Kaz Kylheku

unread,
May 4, 2016, 10:21:30 PM5/4/16
to
On 2016-05-05, pop <p_...@hotmail.com> wrote:
> Harriet Bazley wrote on 5/3/2016 8:41 PM:
>> On 3 May 2016 as I do recall,
>> pop wrote:
>>
>> [snip]
>>
>>> there is also a gawk extension which yours may not have strtonum()
>>> if (strtonum(views) > strtonum(totalviews))
>>>
>>
>> Thanks. Is this actually more efficient? It's certainly less obscure to
>> read....
>>
> Actually, adding 0 or multiply by 1 is more efficient if you are only
> expecting decimal numbers.

Completely meaningless if you aren't compiling to machine language, and
executing hundreds of millions of iterations of that in a tight loop.

pop

unread,
May 5, 2016, 6:20:27 AM5/5/16
to
Yes, you are correct; I was answering the OP question about which was
more efficient; difference is negligible for normal applications and if
strtonum() makes the code clearer, then it is a good choice.

--
regards;
pop->Mark

Kenny McCormack

unread,
May 5, 2016, 8:07:27 AM5/5/16
to
In article <9f46a47a5...@blueyonder.co.uk>,
Harriet Bazley <harriet...@blueyonder.co.uk> wrote:
>I've just discovered the reason for a bug in my script: it's reading in
>numeric data from a file but unexpectedly comparing the values as strings,
>e.g. it thinks 37 is is smaller than 7.
>
>As a workaround I've currently put a "*1" into both sides of my if()
>statement, i.e. if (views*1 > totalviews*1).

The pretty much standard workaround is to add 0 (+0) to force something to
be numeric and to append an empty string ("") to force it to be a string.
Others have given these and other workarounds, but I am more interested in
how/why you need to do this. The thing is, in my too many years of AWK
programming, I think the number of times I've needed to do these things can
be counted on the fingers of both hands. I just tried to setup a test case
now to see if I could make it happen, but it didn't work - everytime it did
the comparison "right" - I could not get it to "fail". GAWK is pretty
"smart" about doing a numeric comparison when it should do so.

A fairly common use of the "+0" hack is to extract a leading numeric string
from a larger string. E.e., you have something like "123:456" in a
variable, and you want to extract the 123. You do var+0 and you get 123.
I mention this because I can't think off-hand of a common usage of the
'append ""' trick.

Anyway, could you say more about your actual use case? Help me out in
being able to actually test this.

--
Here's a simple test for Fox viewers:

1) Sit back, close your eyes, and think (Yes, I know that's hard for you).
2) Think about and imagine all of your ridiculous fantasies about Barack Obama.
3) Now, imagine that he is white. Cogitate on how absurd your fantasies
seem now.

See? That wasn't hard, was it?

Harriet Bazley

unread,
May 8, 2016, 5:33:22 PM5/8/16
to
On 5 May 2016 as I do recall,
pop wrote:


[snip]

> I was answering the OP question about which was
> more efficient; difference is negligible for normal applications and if
> strtonum() makes the code clearer, then it is a good choice.
>
Thanks!

--
Harriet Bazley == Loyaulte me lie ==

Radioactive cats have 18 half-lives.

Harriet Bazley

unread,
May 8, 2016, 5:53:33 PM5/8/16
to
On 5 May 2016 as I do recall,
Kenny McCormack wrote:

> In article <9f46a47a5...@blueyonder.co.uk>,
> Harriet Bazley <harriet...@blueyonder.co.uk> wrote:
> >I've just discovered the reason for a bug in my script: it's reading in
> >numeric data from a file but unexpectedly comparing the values as strings,
> >e.g. it thinks 37 is is smaller than 7.
> >
> >As a workaround I've currently put a "*1" into both sides of my if()
> >statement, i.e. if (views*1 > totalviews*1).
>
> The pretty much standard workaround is to add 0 (+0) to force something to
> be numeric and to append an empty string ("") to force it to be a string.
> Others have given these and other workarounds, but I am more interested in
> how/why you need to do this.
>
[snip]

> Anyway, could you say more about your actual use case? Help me out in
> being able to actually test this.
>
The script reads two input files; the first is a set of saved input values
written out by previous runs and the second is a Web page.

During the initial file (FS=="|") it assigns the input fields to
two-dimensional arrays indexed as [title,date], e.g.
totalviews[$1,date]=$2

During the second file it extracts the table data from the input HTML and
compares this to the latest recorded values, e.g.
views=striptags($4)
if(views > totalviews[title,date]) {.........}

The striptags function does a couple of gensubs/gsubs and presumably returns
a string value.

--
Harriet Bazley == Loyaulte me lie ==

It is better to have loved and lost than just to have lost.

Ed Morton

unread,
May 9, 2016, 8:17:36 AM5/9/16
to
That's not enough to identify the problem. Post a complete, minimal script along
with concise, testable sample input and expected output that demonstrates your
problem if you'd like us to be able to help you figure out what is going on.

Ed.

Kenny McCormack

unread,
May 9, 2016, 9:47:08 AM5/9/16
to
In article <ngpuub$a29$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>That's not enough to identify the problem. Post a complete, minimal script along
>with concise, testable sample input and expected output that demonstrates your
>problem if you'd like us to be able to help you figure out what is going on.
>
> Ed.

Well said, Ed. Thanks.

--
To most Christians, the Bible is like a software license. Nobody
actually reads it. They just scroll to the bottom and click "I agree."

- author unknown -

Kaz Kylheku

unread,
May 9, 2016, 10:26:47 AM5/9/16
to
On 2016-05-09, Ed Morton <morto...@gmail.com> wrote:
> That's not enough to identify the problem. Post a complete, minimal script along
> with concise, testable sample input and expected output that demonstrates your
> problem if you'd like us to be able to help you figure out what is going on.

I would add: "That is, if you still need help after confronting the
problem in these bare terms."

Marc de Bourget

unread,
May 9, 2016, 10:57:17 AM5/9/16
to
As for code examples: You need adding 0 if you use code like this one:

BEGIN {
str1 = "7mm"
str2 = "111dd"
number1 = substr(str1,1,1)
number2 = substr(str2,1,3)
if (number1+0 < number2+0)
print "number1: " number1 " is smaller than number2: " number2
else
print "number1: " number1 " is greater or less than number2: " number2
}

With if (number1 < number2) you get wrong results.

Marc de Bourget

unread,
May 9, 2016, 11:05:38 AM5/9/16
to
Of course the last line should read "greater than or equal to number2" instead.

BEGIN {
str1 = "7mm"
str2 = "111dd"
number1 = substr(str1,1,1)
number2 = substr(str2,1,3)
if (number1+0 < number2+0)
print "number1: " number1 " is smaller than number2: " number2
else
print "number1: " number1 " is greater than or equal to number2: " number2
}

BTW, I don't use this kind of code but this is to illustrate the pitfalls.

Ed Morton

unread,
May 9, 2016, 11:21:25 AM5/9/16
to
Yes, we know how numeric strings work and the situations in which they become
strings vs numbers, but we're trying to help Harriet debug her code so we need
to see HER code.

Ed.

Kenny McCormack

unread,
May 9, 2016, 11:26:57 AM5/9/16
to
In article <e3c8ff0a-a139-4df4...@googlegroups.com>,
Thanks. I've re-written is slightly as shown below:

BEGIN {
str1 = "7mm"
str2 = "111dd"
print "number1 =",number1 = substr(str1,1,1)
print "number2 =",number2 = substr(str2,1,3)
print "Note: 7 really is less than 111!"
print "Without '+0':"
print "\tnumber1 is",number1 < number2 ? "less" : "greater or equal","than number2"
print "With '+0':"
print "\tnumber1 is",number1+0 < number2+0 ? "less" : "greater or equal","than number2"
}

Output:

number1 = 7
number2 = 111
Note: 7 really is less than 111!
Without '+0':
number1 is greater or equal than number2
With '+0':
number1 is less than number2

--
Shikata ga nai...

Kenny McCormack

unread,
May 9, 2016, 11:30:23 AM5/9/16
to
In article <ngq9n0$icm$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>Yes, we know how numeric strings work and the situations in which they
>become strings vs numbers, but we're trying to help Harriet debug her code
>so we need to see HER code.
>
> Ed.

Actually, I've given up on Harriet's returning to this thread. From her
perspective, the problem is solved. OP's rarely return to the thread once
their problem is solved.

Thanks go to Marc for illustrating how to reproduce the problem (for my own
research purposes).

--
Watching ConservaLoons playing with statistics and facts is like watching a
newborn play with a computer. Endlessly amusing, but totally unproductive.

Marc de Bourget

unread,
May 9, 2016, 2:30:03 PM5/9/16
to
Thank you Kenny.
Maybe we should give Harriet some time, her last post dates less than 24 hours.

I've tested a bit more and I could imagine two reasons (probably both) for the failure with Harriet's code:
1. Using gsub in her striptags function may force a string as a result (like Harriet already mentioned correctly).
2. Further, I could imagine that the input file contains blancs before or after the numbers. This also causes strings as a result.

However - and here Ed is right - it would be better to have Harriet's code instead of making assumptions.

A very interesting thread, anyway: Maybe Arnold can clarify when exactly a string is the result (I assume when using substr or gsub etc.) and when a number is the result of a comparison, please?

Kaz Kylheku

unread,
May 9, 2016, 2:54:09 PM5/9/16
to
On 2016-05-09, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> OP's rarely return to the thread once their problem is solved.

Of course, that's why we are all here, buddy.

Ed Morton

unread,
May 10, 2016, 1:05:07 AM5/10/16
to
It's already explained very clearly in the the book Effective Awk Programming,
4th Edition, by Arnold Robbins (http://shop.oreilly.com/product/0636920033820.do).

You can also see the excerpt in question at
http://www.gnu.org/software/gawk/manual/gawk.html#Typing-and-Comparison but IMHO
anyone using gawk should buy the book for their own benefit and as a thank-you
to Arnold for all the effort that's gone into the tool and the documentation.

Ed.

Marc de Bourget

unread,
May 10, 2016, 5:13:29 AM5/10/16
to
Thanks Ed, I already own this book. Thank you for the hint:
I have forgotten about this chapter, this makes things much clearer.

However, I still have some questions:
1. Is it possible to read the operand's attribute (STRING, NUMERIC, STRNUM)?
2. I assume "getline input" from a file also forces the STRNUM attribute?
3. When adding content to an array, what exactly is the content's attribute?
(e.g. the OP's code: totalviews[title,date])
4. Why does a number preceded and followed by spaces seems to have the STRING attribute?

Please excuse if the answers may be listed somewhere in Arnold's book :-)

Ed Morton

unread,
May 10, 2016, 9:52:53 AM5/10/16
to
On 5/10/2016 4:13 AM, Marc de Bourget wrote:
> Thanks Ed, I already own this book. Thank you for the hint:
> I have forgotten about this chapter, this makes things much clearer.
>
> However, I still have some questions:
> 1. Is it possible to read the operand's attribute (STRING, NUMERIC, STRNUM)?

Not that I know of (except isarray() of course) but then what would be the use
for a function like that? If you want a string comparison then append "", if you
want numeric then add 0, I can't think of any reason to test the type of a
variable first. You CAN test the VALUE of a variable of course:

$ awk 'BEGIN{print ((x!="")&&(x+0==x)?"":"non-")"number"}'
non-number
$
$ awk -v x="" 'BEGIN{print ((x!="")&&(x+0==x)?"":"non-")"number"}'
non-number
$
$ awk -v x="3" 'BEGIN{print ((x!="")&&(x+0==x)?"":"non-")"number"}'
number
$
$ awk -v x="3e4" 'BEGIN{print ((x!="")&&(x+0==x)?"":"non-")"number"}'
number

and that has it's uses, e.g. for validating input data. The x!="" is necessary
to avoid incorrectly identifying an uninitialized (and so zero-or-null) variable
as a number.

> 2. I assume "getline input" from a file also forces the STRNUM attribute?

Yes, that's specifically stated in the section I referenced.

> 3. When adding content to an array, what exactly is the content's attribute?
> (e.g. the OP's code: totalviews[title,date])

The same as if you added to a scalar variable. Array indices are always strings
but array elements are numeric strings just like scalar variables are.

> 4. Why does a number preceded and followed by spaces seems to have the STRING attribute?

Because it has non-numeric characters (spaces) as part of it's value and so awk
knows it's a string, not a number.

> Please excuse if the answers may be listed somewhere in Arnold's book :-)

No problem.

Ed.

Mike Sanders

unread,
May 10, 2016, 10:54:58 AM5/10/16
to
Marc de Bourget <marcde...@gmail.com> wrote:

> 1. Is it possible to read the operand's attribute (STRING, NUMERIC, STRNUM)?

Not exactly what you may be seeking, but handy never-the-less...

PROCINFO["identifiers"]

<https://www.gnu.org/software/gawk/manual/gawk.html#Auto_002dset>

--
Mike Sanders
www.peanut-software.com

Harriet Bazley

unread,
May 10, 2016, 9:23:28 PM5/10/16
to
On 9 May 2016 as I do recall,
I have figured out what was going on; it was Kenny McCormack who was
subsequently asking for *my* assistance to help him test the behaviour of
the language... or so I had supposed.

I apologise if my data was insufficient for the required purpose, but but
posting a self-contained and functioning mini-script is a less than trivial
piece of adaptation, and the string substitution is clearly the source of
the problem; it's just difficult for me to prove offhand what the 'type' of
a given return value is.

(And it's currently 2:30am - after a quick glance at the scale of the task
involved I'm not in a position to do it right now.)

--
Harriet Bazley == Loyaulte me lie ==

Money is the root of all evil - and man needs roots

Kenny McCormack

unread,
May 10, 2016, 10:10:05 PM5/10/16
to
In article <aa12577e5...@blueyonder.co.uk>,
Harriet Bazley <harriet...@blueyonder.co.uk> wrote:
...
>I have figured out what was going on; it was Kenny McCormack who was
>subsequently asking for *my* assistance to help him test the behaviour of
>the language... or so I had supposed.

Correct.

>I apologise if my data was insufficient for the required purpose, but but
>posting a self-contained and functioning mini-script is a less than trivial
>piece of adaptation, and the string substitution is clearly the source of
>the problem; it's just difficult for me to prove offhand what the 'type' of
>a given return value is.

Yeah, it is fine. I think Marc has given me what I need.

--
If you think you have any objections to anything I've said above, please
navigate to this URL:

http://www.xmission.com/~gazelle/Truth

This should clear up any misconceptions you may have.

Harriet Bazley

unread,
May 11, 2016, 3:47:28 AM5/11/16
to
On 11 May 2016 as I do recall,
Kenny McCormack wrote:

> In article <aa12577e5...@blueyonder.co.uk>,
> Harriet Bazley <harriet...@blueyonder.co.uk> wrote:
> ...
> >I have figured out what was going on; it was Kenny McCormack who was
> >subsequently asking for *my* assistance to help him test the behaviour of
> >the language... or so I had supposed.
>
> Correct.
>
> >I apologise if my data was insufficient for the required purpose, but but
> >posting a self-contained and functioning mini-script is a less than trivial
> >piece of adaptation, and the string substitution is clearly the source of
> >the problem; it's just difficult for me to prove offhand what the 'type' of
> >a given return value is.
>
> Yeah, it is fine. I think Marc has given me what I need.
>
Here's a minimal case (now that I've had five hours' sleep):


END{
views=striptags("<td>46")
totalviews=striptags("<td>7")

if(views<totalviews)
{
printf("%d is less than %d\n", views, totalviews) }
}


function striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)
return(HTML)
}

------------------------------------------------------------

This will reliably print mathematical rubbish!

--
Harriet Bazley == Loyaulte me lie ==

Micro Credo: Never trust a computer bigger than you can lift.

Kenny McCormack

unread,
May 11, 2016, 5:38:20 AM5/11/16
to
In article <8c347a7e5...@blueyonder.co.uk>,
Harriet Bazley <harriet...@blueyonder.co.uk> wrote:
...
>Here's a minimal case (now that I've had five hours' sleep):
>
>
>END{
>views=striptags("<td>46")
>totalviews=striptags("<td>7")
>
>if(views<totalviews)
> {
> printf("%d is less than %d\n", views, totalviews) }
>}
>
>
>function striptags(HTML)
>{
>#strip anything between < and next occurrence of >
> gsub(/<\/?[^>]*>/,"",HTML)
> return(HTML)
>}
>
>------------------------------------------------------------
>
>This will reliably print mathematical rubbish!

Got it. Thanks.

--
It's possible that leasing office space to a Starbucks is a greater liability
in today's GOP than is hitting your mother on the head with a hammer.

Ed Morton

unread,
May 11, 2016, 7:52:36 AM5/11/16
to
OK, you're dealing with strings right out the gate and with nothing to convert
them to numbers so of course you're getting string comparison, same as if you wrote:

$ awk 'function f(n){return n} BEGIN{x=f("46"); y=f("7"); print (x>y?x:y)}'
7

If you want the return value from that function to always be numeric then you
need to add zero to its return code (i.e. 'return(HTML+0)' ) so you get the same
effect as:

$ awk 'function f(n){return n+0} BEGIN{x=f("46"); y=f("7"); print (x>y?x:y)}'
46

Regards,

Ed.

Aharon Robbins

unread,
May 11, 2016, 11:50:18 AM5/11/16
to
In article <12da5089-20ec-4068...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
>1. Is it possible to read the operand's attribute (STRING, NUMERIC, STRNUM)?

The master branch has a typeof function to bring this information.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com

Marc de Bourget

unread,
May 11, 2016, 4:31:36 PM5/11/16
to
Thank you all for this interesting thread and Harriet's introducing example.

Ed's hint to Arnold's book excerpt was very useful:
http://www.gnu.org/software/gawk/manual/gawk.html#Typing-and-Comparison
For me the two most important sentences are:
"A string constant or the result of a string operation has the string attribute." ...
"In short, when one operand is a “pure” string, such as a string constant, then a string comparison is performed. Otherwise, a numeric comparison is performed."

So it is clear that e.g using substr or gsub ensures there is no need for string concatenation with "". This is a slightly different but kind of similar topic: Having programmed AWK more than 20 years, I made heavy use of adding "" to ensure strings or adding 0 to ensure numbers in the past. I'm going to avoid string concatenation in the future if it isn't needed because it is time consuming (of course using substr is even more time consuming). However, AFAIK concatenation is less time consuming in AWK than in Ruby or Python where they always build new objects.

Harriet Bazley

unread,
May 11, 2016, 9:02:40 PM5/11/16
to
On 11 May 2016 as I do recall,
Ed Morton wrote:

> On 5/11/2016 2:46 AM, Harriet Bazley wrote:

[snip]

> > function striptags(HTML)
> > {
> > #strip anything between < and next occurrence of >
> > gsub(/<\/?[^>]*>/,"",HTML)
> > return(HTML)
> > }

[snip]

> If you want the return value from that function to always be numeric then you
> need to add zero to its return code (i.e. 'return(HTML+0)' )

I tried that; sadly I'd forgotten that I also need it to extract strings
from the same HTML table with which to access the indexed arrays!

Best workaround seems to be to force the returned values to numeric at the
four points in the script where I actually need them as integers.

--
Harriet Bazley == Loyaulte me lie ==

I mean to live forever - or die trying!

Aharon Robbins

unread,
May 11, 2016, 11:37:36 PM5/11/16
to
In article <51c03951-3cb0-411e...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
>time consuming). However, AFAIK concatenation is less time consuming in
>AWK than in Ruby or Python where they always build new objects.

Gawk has one significant optimization. The case of

scalar = scalar other_value

uses realloc instead of malloc new, copy values, free old. Other
cases do build a new string from scratch.

However, being aware of this when you write your scripts can lead to
programs that run immensely faster than they would otherwise.

I'm not sure if I can correctly optimize

val1 = val2 ""

into something that just gives val2 a string value without actually
also creating a separate string to assign to val1. I'll have to
investigate that one.

Marc de Bourget

unread,
May 12, 2016, 3:32:35 AM5/12/16
to
> > On 5/11/2016 2:46 AM, Harriet Bazley wrote:
>
> I tried that; sadly I'd forgotten that I also need it to extract strings
> from the same HTML table with which to access the indexed arrays!

Hi Harriet, I don't understand exactly what you mean :-)
Can you complete your mini-script with the arrays, please?

Marc de Bourget

unread,
May 13, 2016, 8:52:13 AM5/13/16
to
> I'm not sure if I can correctly optimize
>
> val1 = val2 ""
>
> into something that just gives val2 a string value without actually
> also creating a separate string to assign to val1. I'll have to
> investigate that one.
> --
> Aharon (Arnold) Robbins arnold AT skeeve DOT com

Thank you Arnold, sounds great. One more question, please:
val1 = val2 + 0
Is this assignment time consuming or no problem?

Janis Papanagnou

unread,
May 13, 2016, 8:59:36 AM5/13/16
to
On 13.05.2016 14:52, Marc de Bourget wrote:
>
> [...] One more question, please:
> val1 = val2 + 0
> Is this assignment time consuming or no problem?

Presuming that val2 is a string, the implicit string->int conversion would
certainly be the most costly here. (Not the addition, not the assignment.)

Janis

Kenny McCormack

unread,
May 13, 2016, 9:12:47 AM5/13/16
to
In article <ngvk54$1km$1...@dont-email.me>,
Aharon Robbins <arn...@skeeve.com> wrote:
>In article <12da5089-20ec-4068...@googlegroups.com>,
>Marc de Bourget <marcde...@gmail.com> wrote:
>>1. Is it possible to read the operand's attribute (STRING, NUMERIC, STRNUM)?
>
>The master branch has a typeof function to bring this information.

Continuing the tradition of GAWK catching up with where TAWK was 20 years
ago. Seriously, I am very much looking forward to seeing this and some of
the other goodies promised in the "master branch".

Well done!

--

There are many self-professed Christians who seem to think that because
they believe in Jesus' sacrifice they can reject Jesus' teachings about
how we should treat others. In this country, they show that they reject
Jesus' teachings by voting for Republicans.

Kenny McCormack

unread,
May 13, 2016, 9:16:31 AM5/13/16
to
In article <nh4j3n$fle$1...@news.m-online.net>,
FWIW, the actual conversion is probably just a simple call to strtod().
Note that strtod() does indeed do the "extract out the leading numeric
string" functionality, which is the core of how AWK handles string to
numeric conversions.

--
Conservatives want smaller government for the same reason criminals want fewer cops.

Janis Papanagnou

unread,
May 13, 2016, 10:34:54 AM5/13/16
to
On 13.05.2016 15:16, Kenny McCormack wrote:
> In article <nh4j3n$fle$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 13.05.2016 14:52, Marc de Bourget wrote:
>>>
>>> [...] One more question, please:
>>> val1 = val2 + 0
>>> Is this assignment time consuming or no problem?
>>
>> Presuming that val2 is a string, the implicit string->int conversion would
>> certainly be the most costly here. (Not the addition, not the assignment.)
>
> FWIW, the actual conversion is probably just a simple call to strtod().

Yes, and (as most such conversions) the most costly in above statement.

In the "opposite" case (with an int->string conversion), var2 = var1 "" ,
the conversion would still be costly but the copy of a string as well as
the implicit (non-optimized) string concatenation would also account more
than an int copy and int addition. - That's why (I think) Arnold is
considering an optimization.

> Note that strtod() does indeed do the "extract out the leading numeric
> string" functionality, which is the core of how AWK handles string to
> numeric conversions.

Interesting.

Janis

Harriet Bazley

unread,
May 17, 2016, 2:14:12 PM5/17/16
to
On 12 May 2016 as I do recall,
Given input in the form
<tr >
<td ><a href='story_eyes_story.php?storyid=n'>Story title</a></td>
<td >9,285</td>
<td >10</td>
<td >4</td>
</tr>

what the script actually does is more akin to this (where totalviews and
totalvisitors are set up elsewhere from data read from the preceding file):

BEGIN{
RS="<tr >"
FS="\t<td >"
date=strftime("01/%m/%Y")
}



/story_eyes_story.php\?storyid/ {#scanning data from Web page
views=strtonum(striptags($4))
title=striptags($2)
visitors=striptags($5)
if(views > totalviews[title,date])
{
printf("%30s %4s %6s\n",title,"+"views-totalviews[title,date],"(+"visitors-totalvisitors[title,date]")")
changes++
}
}




function striptags(HTML)
{
#strip </td> and anything following (i.e. clean up end of last record)
b= gensub("([^>]*)</td>.*","\\1",1,HTML)
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",b)
return(b)

}


Forcing all call to striptags() to return integer values results in its
failing to return the story title correctly, such that the array indexing
fails. You'd need two versions of the function, one for strings and one
for integers - or a parameter passed in to tell it what type was wanted.
--
Harriet Bazley == Loyaulte me lie ==

Questions are a burden to others, but answers are a prison for oneself.

Marc de Bourget

unread,
May 17, 2016, 2:36:26 PM5/17/16
to
> Forcing all call to striptags() to return integer values results in its
> failing to return the story title correctly, such that the array indexing
> fails. You'd need two versions of the function, one for strings and one
> for integers - or a parameter passed in to tell it what type was wanted.

Hi Harriet, yes, if I have understood it correctly, you could try something like:

views=striptags($4, 1)
title=striptags($2, 0)

function striptags(HTML, integer)
{
#strip </td> and anything following (i.e. clean up end of last record)
b= gensub("([^>]*)</td>.*","\\1",1,HTML)

#strip anything between < and next occurrence of >

gsub(/<\/?[^>]*>/,"",b)
if (integer)
return(b) + 0
else
return(b) ""
}

Marc de Bourget

unread,
May 17, 2016, 2:48:55 PM5/17/16
to
or better:

function striptags(HTML, integer)
{
#strip </td> and anything following (i.e. clean up end of last record)
b= gensub("([^>]*)</td>.*","\\1",1,HTML)

#strip anything between < and next occurrence of >

gsub(/<\/?[^>]*>/,"",b)
if (integer)
return(b) + 0
else
return(b)
}

Of course like we discussed intensively above, the last concatenation isn't needed due to the gsub before.

Marc de Bourget

unread,
May 17, 2016, 4:19:46 PM5/17/16
to
and better with an shorter function call of the same function:

views=striptags($4, 1)
title=striptags($2)

function striptags(HTML, integer)
{
...

Ed Morton

unread,
May 18, 2016, 10:43:50 AM5/18/16
to
On 5/11/2016 7:49 PM, Harriet Bazley wrote:
> On 11 May 2016 as I do recall,
> Ed Morton wrote:
>
>> On 5/11/2016 2:46 AM, Harriet Bazley wrote:
>
> [snip]
>
>>> function striptags(HTML)
>>> {
>>> #strip anything between < and next occurrence of >
>>> gsub(/<\/?[^>]*>/,"",HTML)
>>> return(HTML)
>>> }
>
> [snip]
>
>> If you want the return value from that function to always be numeric then you
>> need to add zero to its return code (i.e. 'return(HTML+0)' )
>
> I tried that; sadly I'd forgotten that I also need it to extract strings
> from the same HTML table with which to access the indexed arrays!
>
> Best workaround seems to be to force the returned values to numeric at the
> four points in the script where I actually need them as integers.
>

You could always add a test for a number inside the function and add zero when
appropriate, e.g. in the new_striptags() function below:

$ cat tst.awk
function old_striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)
return(HTML)
}

function new_striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)
return(HTML+0 == HTML ? HTML+0 : HTML)
}

{
old_ret = old_striptags($0)
print "old:", old_ret, (old_ret > 7 ? ">" : "<="), 7

new_ret = new_striptags($0)
print "new:", new_ret, (new_ret > 7 ? ">" : "<="), 7

print ""
}
$
$ cat file
<string>foo</string>
<integer>37</integer>
$
$ awk -f tst.awk file
old: foo > 7
new: foo > 7

old: 37 <= 7
new: 37 > 7

If all you REALLY want to identify are integers (not, for example exponents like
"3e7") then you could change "HTML+0==HTML" to "HTML~/^[0-9]+$/" and so on.

Ed.

Marc de Bourget

unread,
May 18, 2016, 1:02:01 PM5/18/16
to
Le mercredi 18 mai 2016 16:43:50 UTC+2, Ed Morton a écrit :
...
> function new_striptags(HTML)
> {
> #strip anything between < and next occurrence of >
> gsub(/<\/?[^>]*>/,"",HTML)
> return(HTML+0 == HTML ? HTML+0 : HTML)
> }
>
> {

Hi Ed, great code. I really like this snippet of automatic identification.
Two hints: If I have understood Harriet correctly, for some reason the story title should always be handled as a string (but I may be wrong) and for a VERY big number of input records the return comparison will be a bit slower than the other variant. But I admit your code is much nicer :-). I'll add it to my favourite AWK snippets collection.

Janis Papanagnou

unread,
May 18, 2016, 1:23:40 PM5/18/16
to
On 18.05.2016 19:01, Marc de Bourget wrote:
> Le mercredi 18 mai 2016 16:43:50 UTC+2, Ed Morton a écrit :
> ...
>> function new_striptags(HTML)
>> {
>> #strip anything between < and next occurrence of >
>> gsub(/<\/?[^>]*>/,"",HTML)
>> return(HTML+0 == HTML ? HTML+0 : HTML)
>> }
>>
>> {
>
> Hi Ed, great code. I really like this snippet of automatic identification.

I haven't followed this thread in detail, though I'm unsure whether it's
correct how the latter four testcases of

<string>foo</string>
<integer>37</integer>
<string> foo</string>
<integer> 37</integer>
<string>foo </string>
<integer>37 </integer>

shall be handled (i.e. containing spaces, but still representing integers).
Actually they all run into the 'else' branch of the conditional expression.
The final comment that Ed gave, using a regexp comparison, could be easily
extended to cover such white space padding of HTML data.

Janis

> [...]

Marc de Bourget

unread,
May 18, 2016, 2:48:57 PM5/18/16
to
Hi Janis,

good hint. It depends of what you want to get as a result.
So, I'll add the alternative which ensures integers with leading or trailing spaces to be handled as integers (I think this is most times what you expect to get as a result):

function new_striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)
return (HTML ~ /^ *[0-9]+ *$/ ? HTML+0 : HTML)
}

Ed Morton

unread,
May 18, 2016, 3:39:58 PM5/18/16
to
change " *" to "[[:space:]]*" to include all white space.

Ed.

Marc de Bourget

unread,
May 18, 2016, 4:00:56 PM5/18/16
to
It seems [[:space:]] is GAWK specific. I don't know this one with TAWK or MAWK.

Ed Morton

unread,
May 18, 2016, 4:10:30 PM5/18/16
to
No, [[:space:]] is POSIX. If it doesn't work in the awk you are using then get a
modern/current awk.

Ed.

Marc de Bourget

unread,
May 18, 2016, 4:27:38 PM5/18/16
to
> No, [[:space:]] is POSIX. If it doesn't work in the awk you are using then get a
> modern/current awk.
>
> Ed.

I have never needed it. I always use tab separated files so I don't need to remove tab characters (ASCII 9).

Marc de Bourget

unread,
May 18, 2016, 4:59:45 PM5/18/16
to
I've just had a look at this site:
http://www.regular-expressions.info/posixbrackets.html

All this POSIX stuff like [:word:] seems useless to me because it doesn't use accents. "Céline" is a word, but not for POSIX. It's only useful for English words, so I don't miss this POSIX stuff at all. POSIX's poor definition of a "word" is [A-Za-z0-9_] which is far too primitive. I always build my own character classes which are better suited for my needs.

Janis Papanagnou

unread,
May 18, 2016, 5:25:57 PM5/18/16
to
That's what locale settings are for. Consider these test cases:

$ echo 'Öse' |
awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
Öse

$ echo 'Öse' |
LC_ALL=C awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
se

The first one is running in my default locale (which has 'Umlauts' as part
of the alpha character type. And the second one is explicitly defined to
use the standard C locale which does not know about Umlauts (or accents).

Janis

Ed Morton

unread,
May 18, 2016, 6:12:04 PM5/18/16
to
You misunderstand, character classes exist to make your scripts portable across
locales and are precisely to include accented characters and other characters
that are considered part of a "word" in your specific locale but not in other
locales. It also protects you from locales where, for example `[a-z]` means
`[aAbBcC...yYz]` instead of what you probably expect `[abc...z]`.

If the character classes aren't behaving as you expect it's because the locale
setting is wrong in your environment. Google "locale" for more info.

Ed.

Harriet Bazley

unread,
May 18, 2016, 6:59:39 PM5/18/16
to
On 18 May 2016 as I do recall,
Yours is a lot more readable, though :-)


(I don't remember the a ? b : c operators from when I learnt awk; I first
encountered this when learning lua some ten years or so later. Has it been
added to the language?)

--
Harriet Bazley == Loyaulte me lie ==

USER ERROR: replace user and press any key to continue.

Andrew Schorr

unread,
May 18, 2016, 8:05:00 PM5/18/16
to
On Wednesday, May 18, 2016 at 10:43:50 AM UTC-4, Ed Morton wrote:
> function new_striptags(HTML)
> {
> #strip anything between < and next occurrence of >
> gsub(/<\/?[^>]*>/,"",HTML)
> return(HTML+0 == HTML ? HTML+0 : HTML)
> }

I hate to be a party pooper, but the (x+0 == x) test works only for strnum values, not for strings. Here's an example:

bash-4.2$ echo junk5.555555555 | gawk '{x = $1; print typeof(x); gsub(/junk/,"", x); print typeof(x); print x; print (x+0 == x)}'
strnum
string
5.555555555
0

Without the gsub, everything is good:

bash-4.2$ echo 5.555555555 | gawk '{x = $1; print typeof(x); print (x+0 == x)}'
strnum
1

When gsub succeeds, it changes the type from strnum to string. If the gsub doesn't match, it doesn't seem to have this effect:

bash-4.2$ echo 5.555555555 | gawk '{x = $1; print typeof(x); gsub(/junk/,"", x); print typeof(x); print x; print (x+0 == x)}'
strnum
strnum
5.555555555
1

When a string contains a numeric value like "5.555555555", the (x+0 == x) test occurs as a string comparison. So x+0 is converted back to a string using CONVFMT:

bash-4.2$ echo junk5.555555555 | gawk '{x = $1; print typeof(x); gsub(/junk/,"", x); print typeof(x); print x; print (x+0 == x); print (x+0)""}'
strnum
string
5.555555555
0
5.55556

This is a really subtle and annoying issue. Also, you should note that the (x+0 == x) test ignores leading and trailing white space. So a strnum containing " 5" or "5 " would pass the (x+0 == x) test. That may or may not be what you want.

Lastly, if you have a variable of uncertain type, you can convert it to a strnum forcibly using the split or match functions. So you could do something like this:

bash-4.2$ cat /tmp/test.awk
function isnumeric(x, f) {
match(x, /^(.*)$/, f)
x = f[1]
return (x+0 == x) # ignoring white-space issues
}

bash-4.2$ gawk -i /tmp/test.awk 'BEGIN {x = "5.555555555"; print (x+0 == x); print isnumeric(x)}'
0
1

This is really yucky stuff. You can also use split() to convert a string to a strnum if you know of an FS character that is guaranteed not to be in the string. Maybe that's faster than match. I haven't tested the relative performance, but I guess that the match() call is pretty slow.

Regards,
Andy




Janis Papanagnou

unread,
May 19, 2016, 4:20:30 AM5/19/16
to
On 19.05.2016 00:54, Harriet Bazley wrote:
> [...]
>
> (I don't remember the a ? b : c operators from when I learnt awk; I first
> encountered this when learning lua some ten years or so later. Has it been
> added to the language?)

The conditional expression is existing since three decades; it's already
described in A., K., and W.'s original Awk book.

Janis

Marc de Bourget

unread,
May 19, 2016, 4:54:06 AM5/19/16
to
Le mercredi 18 mai 2016 23:25:57 UTC+2, Janis Papanagnou a écrit :
> That's what locale settings are for. Consider these test cases:
>
> $ echo 'Öse' |
> awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
> Öse
>
> $ echo 'Öse' |
> LC_ALL=C awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
...

Hi Janis, GNU confirmed LC_ALL locale can't be set for native MS Windows :-)
(see https://groups.google.com/forum/#!topic/comp.lang.awk/coXxXOpeoXU).

Marc de Bourget

unread,
May 19, 2016, 5:29:22 AM5/19/16
to
Hi Janis, I'm interested: Can you get a list of all "[[:alpha:]]" characters affected by your local setting on Linux?

Janis Papanagnou

unread,
May 19, 2016, 6:51:36 AM5/19/16
to
On 19.05.2016 11:29, Marc de Bourget wrote:
> Le jeudi 19 mai 2016 10:54:06 UTC+2, Marc de Bourget a écrit :
>> Le mercredi 18 mai 2016 23:25:57 UTC+2, Janis Papanagnou a écrit :
>>> That's what locale settings are for. Consider these test cases:
>>>
>>> $ echo 'Öse' |
>>> awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
>>> Öse
>>>
>>> $ echo 'Öse' |
>>> LC_ALL=C awk 'match($0,/[[:alpha:]]+/){print substr($0,RSTART,RLENGTH)}'
>> ...
>>
>> Hi Janis, GNU confirmed LC_ALL locale can't be set for native MS Windows :-)
>> (see https://groups.google.com/forum/#!topic/comp.lang.awk/coXxXOpeoXU).

LC_ALL is a shell variable; the locale is defined by what follows behind
the '=', something like C, POSIX, fr_FR, de_DE, de_DE.UTF-8, or somesuch.

The linked thread suggested:
"The Cygwin port behaves like Linux. No problem setting a utf-8 locale."
(Why do you still take all that hassle instead of just installing Cygwin?
I know you once failed to do so, but a new try from scratch would be much
more satisfying - and not only for you -, and more likely to get something
working.)

And isn't there (on "native" Windows) some other setting for the locale?
(I can't believe that Windows isn't locale-aware in the third millenium.)

>
> Hi Janis, I'm interested: Can you get a list of all "[[:alpha:]]" characters affected by your local setting on Linux?

Frankly, I'm not sure how to iterate over all UTF-8 characters to create
such a list. - But why do you want that? - My locale is a DE locale and,
I suppose, of little use for you if you're in a, say (derived from your
name), FR domain. You should assume that setting the appropriate locale
will do the "Right Thing"[tm] for you. (Otherwise send a bug report to
the vendor.)

Janis

Andrew Schorr

unread,
May 19, 2016, 12:40:18 PM5/19/16
to
On Wednesday, May 18, 2016 at 8:05:00 PM UTC-4, Andrew Schorr wrote:
> bash-4.2$ cat /tmp/test.awk
> function isnumeric(x, f) {
> match(x, /^(.*)$/, f)
> x = f[1]
> return (x+0 == x) # ignoring white-space issues
> }

On 2nd thought, this might be more efficient:

function isnumeric(x, f) {
if (x+0 == x)
return 1

Marc de Bourget

unread,
May 19, 2016, 2:03:31 PM5/19/16
to
Hi Andy,

what do you think of our (actually Ed's) alternative version?

function new_striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)
return (HTML ~ /^[ \t]*[0-9]+[ \t]*$/ ? HTML+0 : HTML)
}

BTW, I have now added "\t" to the space as a character class.

Mike Sanders

unread,
May 19, 2016, 4:18:42 PM5/19/16
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> And isn't there (on "native" Windows) some other setting for the locale?

Yes 'chcp' (CHange Code Page) <http://ss64.com/nt/chcp.html> is one way.

--
Mike Sanders
www.peanut-software.com

Janis Papanagnou

unread,
May 19, 2016, 4:58:55 PM5/19/16
to
On 19.05.2016 22:18, Mike Sanders wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
>> And isn't there (on "native" Windows) some other setting for the locale?
>
> Yes 'chcp' (CHange Code Page) <http://ss64.com/nt/chcp.html> is one way.

A code-page is not exactly a locale definition. Or is that in the Windows
world more than just a table of integer-code-to-character mapping?

Janis

Janis Papanagnou

unread,
May 19, 2016, 5:04:26 PM5/19/16
to
(I should have inspected your link before posting.)

It at least mentions that, beyond the code-page, there is also a Locale:

"Change the active console Code Page. The default
code page is determined by the Windows Locale."

So I would expect that with a proper locale-setting you will also get
the correct character classes for the defined locale.

Janis

Kenny McCormack

unread,
May 19, 2016, 5:29:51 PM5/19/16
to
In article <nhl9op$7e6$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...
>So I would expect that with a proper locale-setting you will also get
>the correct character classes for the defined locale.

Changed the Subject: for you (Why can't you people learn to do this
yourselves? Why must I do everything?)

Please continue this discussion under this new sub-thread. Thank you.

--
The randomly generated signature file that would have appeared here is more than 4
lines in length. As such, it violates one or more Usenet RFPs. In order to remain in
compliance with said RFPs, the actual sig can be found at the following web address:
http://www.xmission.com/~gazelle/Sigs/Seneca

Marc de Bourget

unread,
May 19, 2016, 5:30:40 PM5/19/16
to
Le jeudi 19 mai 2016 23:04:26 UTC+2, Janis Papanagnou a écrit :
> ...
> So I would expect that with a proper locale-setting you will also get
> the correct character classes for the defined locale.
>
> Janis

Hi, Janis. OK, /[[:alpha:]]+/ works properly with my Windows Locale.
I should have tested this one more thoroughly. I'm sorry about that.
Thank you!

Marc de Bourget

unread,
May 19, 2016, 5:36:13 PM5/19/16
to
Hi Kenny,

I'll try to get back to the original thread, so I'd like to ask:
function new_striptags(HTML)
{
#strip anything between < and next occurrence of >
gsub(/<\/?[^>]*>/,"",HTML)

return (HTML ~ /^[ \t]*[0-9]+[ \t]*$/ ? HTML+0 : HTML)
}

Can we consider this one as kind of final version for Harriet or are there still issues?

Janis Papanagnou

unread,
May 19, 2016, 5:55:45 PM5/19/16
to
On 19.05.2016 23:29, Kenny McCormack wrote:
> In article <nhl9op$7e6$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>> So I would expect that with a proper locale-setting you will also get
>> the correct character classes for the defined locale.
>
> Changed the Subject: for you (Why can't you people learn to do this
> yourselves? Why must I do everything?)

I perceived the character-class/locale issue as a detail/sidetrack of
the original issue. Thus I'm not convinced by your complaints that the
subject should be changed.

>
> Please continue this discussion under this new sub-thread. Thank you.

(If this thread would be terminated I'd also not be unhappy.)

Janis

Harriet Bazley

unread,
May 19, 2016, 7:18:46 PM5/19/16
to
On 19 May 2016 as I do recall,
To be honest, since I'm using strtonum() when setting up the
totalviews/totalvisitors arrays as well and these lines are *not* read from
HTML input, an auto-detecting striptags(HTML) is only going to replace half
the occurrences in the script anyway. Which is why it seems simpler and
probably clearer to stick to the explicit force-to-numeric approach where
relevant.

e.g.
(FS=="|") {#scanning saved internal data file
totalviews[$1,date]=strtonum($2) #force to numeric
totalvisitors[$1,date]=strtonum($3)
}

(Sample input: "Story title|59|18")


--
Harriet Bazley == Loyaulte me lie ==

We have met the enemy, and he is us.

Harriet Bazley

unread,
May 19, 2016, 7:18:48 PM5/19/16
to
On 19 May 2016 as I do recall,
Found it: "7.12 Conditional Expressions". :-)


--
Harriet Bazley == Loyaulte me lie ==

Cleanliness is next to impossible.

Andrew Schorr

unread,
May 19, 2016, 8:13:21 PM5/19/16
to
Using a regexp is fine, but it's really slow. And that matches only unsigned integers. To match scientific notation requires a really nasty regexp.

Regards,
Andy

Andrew Schorr

unread,
May 19, 2016, 10:19:28 PM5/19/16
to
On Thursday, May 19, 2016 at 8:13:21 PM UTC-4, Andrew Schorr wrote:
> Using a regexp is fine, but it's really slow. And that matches only unsigned integers. To match scientific notation requires a really nasty regexp.

Also, if floating-point numbers are a possibility, then there are issues with this. Consider:

bash-4.2$ gawk 'BEGIN {x = "5.555555555"; y = x+0; print y; print x}'
5.55556
5.555555555

When you assign x+0 to a new variable, you lose the original string representation. This can also apply to really large integer values:

bash-4.2$ gawk 'BEGIN {x = "12345678901234567"; y = x+0; print y; print x}'
12345678901234568
12345678901234567

There are lots of subtle issues here. It depends on your usage case. If you care only about the numeric value, then x+0 is fine.

And as has been suggested before, I like using character classes. For integers, something like /^[[:space:]]*[+-]?[[:digit:]]+[[:space:]]*$/. I leave floating-point scientific notation as an exercise for the reader. :-)

Regards,
Andy




Kaz Kylheku

unread,
May 20, 2016, 12:09:24 AM5/20/16
to
On 2016-05-20, Andrew Schorr <asc...@telemetry-investments.com> wrote:
> On Thursday, May 19, 2016 at 8:13:21 PM UTC-4, Andrew Schorr wrote:
>> Using a regexp is fine, but it's really slow. And that matches only unsigned integers. To match scientific notation requires a really nasty regexp.
>
> Also, if floating-point numbers are a possibility, then there are issues with this. Consider:
>
> bash-4.2$ gawk 'BEGIN {x = "5.555555555"; y = x+0; print y; print x}'
> 5.55556
> 5.555555555
>
> When you assign x+0 to a new variable, you lose the original string representation. This can also apply to really large integer values:

Your point stands for substantially more precise numbers.

But in this case, it's just due to the way gawk printed it by default
when no formatting was given, which we can fix:

$ gawk 'BEGIN {x = "5.555555555"; y = x+0; printf("%.15g\n", y); print
x}'
5.555555555
5.555555555

5.555555555 is well below the 15 digit limit of the ability of an IEEE
64 bit double to preserve the decimal digits.

Marc de Bourget

unread,
May 20, 2016, 4:02:14 PM5/20/16
to
Le vendredi 20 mai 2016 01:18:46 UTC+2, Harriet Bazley a écrit :
> To be honest, since I'm using strtonum() when setting up the
> totalviews/totalvisitors arrays as well and these lines are *not* read from
> HTML input, an auto-detecting striptags(HTML) is only going to replace half
> the occurrences in the script anyway. Which is why it seems simpler and
> probably clearer to stick to the explicit force-to-numeric approach where
> relevant.
>
> e.g.
> (FS=="|") {#scanning saved internal data file
> totalviews[$1,date]=strtonum($2) #force to numeric
> totalvisitors[$1,date]=strtonum($3)
> }
>
> (Sample input: "Story title|59|18")

Hi Harriet, great if this code works best for you. Of course I don't want to impose other code on you. Some more thoughts:

"strtonum" reminds me of my Delphi "strtoint" function. For some reason, I don't like the name "strtonum" very much because it implies kind of persistence.
I'm just wondering why there aren't any persistent data types like integer in AWK and other scripting languages. The data type of a variable can change with each line of the script, so we can never be shure of the actual type (unless we know very exactly the rules). Why can't we simply declare "int views" like in C if we want to be shure the "views" variable is always an integer? Of course if so, "inttostr" and "strtoint" conversion functions were needed.
Sometimes I really wished there were data types like "int" and "string". There must be very important and obvious reasons why they don't exist (maybe I have simply forgotten them at the moment).

Kaz Kylheku

unread,
May 20, 2016, 6:47:49 PM5/20/16
to
On 2016-05-20, Marc de Bourget <marcde...@gmail.com> wrote:
> "strtonum" reminds me of my Delphi "strtoint" function. For some
> reason, I don't like the name "strtonum" very much because it implies
> kind of persistence.
> I'm just wondering why there aren't any persistent data types like
> integer in AWK and other scripting languages. The data type of a
> variable can change with each line of the script, so we can never be
> shure of the actual type (unless we know very exactly the rules). Why
> can't we simply declare "int views" like in C if we want to be shure
> the "views" variable is always an integer? Of course if so, "inttostr"
> and "strtoint" conversion functions were needed.

No clue what you're talking about here. Persistence usually refers
to putting objects into non-volatile storage (disk, etc) so they get
revived if you stop your system and re-start it.

Variables are usually stable in mainstream scripting languages,
including Awk. If you have some datum in variable X, the expression X+0
doesn't affect what is stored in X. Only an assignment to X will do
that. X + 0 retrieves the value in X, and then converts it; but the
original is left alone.

There are plenty of strongly typed dynamic languages to choose from
if you don't like "duck typing" such as strings which look like numbers
being acceptable to numeric operators.

Python:

>>> 2 + "3"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'


Lisp:

[1]> (+ 2 "3")

*** - +: "3" is not a number

Ruby:

2 + "3"
String can't be coerced into Fixnum
(repl):1:in `+'
(repl):1:in `initialize'

Marc de Bourget

unread,
May 21, 2016, 4:24:21 AM5/21/16
to
Neither in Ruby nor in Python you can declare "int views" like in C.
There isn't anything better, the behaviour is even more cumbersome:
I really don't like Ruby messages like "can't convert nil into String".

So what I meant by "persistent" is that I was wondering why it isn't possible to declare "int views" like in C to ensure an integer always to be an integer.
This would avoid the type of "views" to be changed like in Harriet's example:
gsub(/<\/?[^>]*>/,"",views)

My question was more theoretical: Why can't you declare "int views" like in C? Of course this would change the AWK language too dramatically and will not happen. But what are the main reasons for that, I assume simplicity?

Janis Papanagnou

unread,
May 21, 2016, 4:53:47 AM5/21/16
to
On 21.05.2016 10:24, Marc de Bourget wrote:
> Neither in Ruby nor in Python you can declare "int views" like in C. [...]

What are "int views" supposed to be? I've never heard that; neither in context
of the C language nor generally. - You better try to use standard terminology.

>
> So what I meant by "persistent" is that I was wondering why it isn't

(As pointed out already elsethread, "persistent" is also a wrong term here.)

> possible to declare "int views" like in C to ensure an integer always to be
> an integer.

You mean a variable declaration of [fixed] integer type?

It's not possible in Awk because it has another paradigm than languages with
fixed (or even strong) typing.

> This would avoid the type of "views" to be changed like in
> Harriet's example: gsub(/<\/?[^>]*>/,"",views)
>
> My question was more theoretical: Why can't you declare "int views" like in
> C? Of course this would change the AWK language too dramatically and will
> not happen. But what are the main reasons for that, I assume simplicity?

Simplicity is certainly one reason. (You can learn Awk in a few hours.)
And not declaring variables allows (in conjunction with default values for
variables) for short programs, e.g. the typical one-lines.

Note that in Awk there's a lot missing that full blown programming languages
have; very obvious is (with the exception of associative arrays) the complete
lack of data structures.

I suggest to abstain from complaints of the form "Why does Awk not support
feature XYZ that is supported in language ABC?" before understanding (from
lecture or own experience) what Awk is designed for.

Janis

Marc de Bourget

unread,
May 21, 2016, 5:14:46 AM5/21/16
to
Le samedi 21 mai 2016 10:53:47 UTC+2, Janis Papanagnou a écrit :

Thank you, Janis!
> What are "int views" supposed to be? I've never heard that; neither in context
> of the C language nor generally. - You better try to use standard terminology.

"int" is "int" in C like in "int main(void)" and "views" is the variable Harriet used, so the declaration in C would be "int views".

> (As pointed out already elsethread, "persistent" is also a wrong term here.)

Maybe this word is unusual in this context (I'm not a native English speaker) but I can't see it is wrong.

>
> You mean a variable declaration of [fixed] integer type?

Yes!

> I suggest to abstain from complaints of the form "Why does Awk not support
> feature XYZ that is supported in language ABC?" before understanding (from
> lecture or own experience) what Awk is designed for.

This wasn't meant in any way as a reproach or complaint, just a question derived from the original issue of the thread. Maybe if there was the need to declare variables I wouldn't like it anymore because this would no longer be "AWK".

Mike Sanders

unread,
May 21, 2016, 10:14:37 AM5/21/16
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> What are "int views" supposed to be? I've never heard that;
> neither in context of the C language nor generally. - You
> better try to use standard terminology.

Yet, he is... please consider:

int x = 0; /* standard c */

> ...the complete lack of data structures.

<http://troubleshooters.com/codecorn/awk/index.htm#_Using_Arrays_to_Structure_Data>

#include <stdio.h>

struct person {
char * lname;
char * fname;
char * phone;
};

void printperson (struct person *p){
printf("%s\n", p->lname);
printf("%s\n", p->fname);
printf("%s\n", p->phone);
}
main(int argc, char * argv[]){
struct person p;
p.lname="Litt";
p.fname="Steve";
p.phone="123-456-7890";
printperson(&p);
}

#!/usr/local/bin/mawk -We

function printperson(p){
print p["lname"]
print p["fname"]
print p["phone"]
}

BEGIN{
person["lname"] = "Litt"
person["fname"] = "Steve"
person["phone"] = "123-456-7890"
printperson(person)
exit 0
}

--
Mike Sanders
www.peanut-software.com

Janis Papanagnou

unread,
May 21, 2016, 11:03:25 AM5/21/16
to
On 21.05.2016 11:14, Marc de Bourget wrote:
> Le samedi 21 mai 2016 10:53:47 UTC+2, Janis Papanagnou a écrit :
>
> Thank you, Janis!
>> What are "int views" supposed to be? I've never heard that; neither in
>> context of the C language nor generally. - You better try to use standard
>> terminology.
>
> "int" is "int" in C like in "int main(void)" and "views" is the variable
> Harriet used, so the declaration in C would be "int views".

Aha. But you don't need to try to explain me the C language. If you'd
posted context or if you'd written a complete statement ("int views;")
it might have been more obvious and not be taken for technical terminus.

>
>> (As pointed out already elsethread, "persistent" is also a wrong term
>> here.)
>
> Maybe this word is unusual in this context (I'm not a native English
> speaker) but I can't see it is wrong.

Someone else explained it already upthread. (I'm also no native speaker.)

>
>>
>> You mean a variable declaration of [fixed] integer type?
>
> Yes!
>
>> I suggest to abstain from complaints of the form "Why does Awk not
>> support feature XYZ that is supported in language ABC?" before
>> understanding (from lecture or own experience) what Awk is designed for.
>
> This wasn't meant in any way as a reproach or complaint, just a question
> derived from the original issue of the thread. Maybe if there was the need
> to declare variables I wouldn't like it anymore because this would no
> longer be "AWK".

Yes, it would be another language then. Declaring variables makes sense.
Strong typing makes sense. But you have to view language design decisions
in the respective contexts. Now what's the aim of your question which was
"Why can't you declare "int views" like in C?" ? - The simple answer is;
"Because there's neither a syntax nor an operational semantics defined
for that in Awk." Or even simpler: "It's just impossible to do in Awk.".
If the intent of your question was to get a rationale for the authors'
design decision in the 1970's, I suppose we can only speculate (or is
there some paper about that?). What we see are the actual advantages for
terse programs.

Janis

Janis Papanagnou

unread,
May 21, 2016, 11:12:06 AM5/21/16
to
On 21.05.2016 16:14, Mike Sanders wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
> [C code snipped]
>
> [rudimentary awk code emulation of the C code snipped]

On that line of argumentation some people say that C is Object Oriented
because with pointers, function-pointers, and structs you can built the
same operational semantics as in C++. (And this is of course nonsense.)

Have a look into true data structuring; elementary structured types and
compositions of such types. You simply can't express that in awk.

Janis

Mike Sanders

unread,
May 21, 2016, 11:34:28 AM5/21/16
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> (And this is of course nonsense.)

Such a stern teacher... Well, maybe he did not but I'd say he made a
good effort never-the-less towards disambiguating the idea of structs.

He (several of us) are not the awk guru you are Janis. We need patience
& understanding to coax these ideas into code & its always nice to have
help. =)

--
Mike Sanders
www.peanut-software.com

Kaz Kylheku

unread,
May 21, 2016, 2:15:06 PM5/21/16
to
On 2016-05-21, Marc de Bourget <marcde...@gmail.com> wrote:
> Neither in Ruby nor in Python you can declare "int views" like in C.
> There isn't anything better, the behaviour is even more cumbersome:
> I really don't like Ruby messages like "can't convert nil into String".

So you don't like strong types, and diagnostics.

> So what I meant by "persistent" is that I was wondering why it isn't possible to declare "int views" like in C to ensure an integer always to be an integer.
> This would avoid the type of "views" to be changed like in Harriet's example:
> gsub(/<\/?[^>]*>/,"",views)

But you want C declarations?

You do realize that if you could statically declare views to be integer,
then you would have a type mismatch. Something like:

foo.awk:42:argument 3 of gsub expected to be of type string, not int.

But this is "cumbersome", like Ruby complaining that a nil isn't a
String, right?

Or else, gsub would have to be overloaded and convert the int to
string for you. But you don't like that either because then it is not
"avoid[ing] the type of 'views' [being] changed".

I suspect you need to learn more about programming language design (not
to mention computer science in general), in order to know exactly what
you want: what the choices are and waht are the tradeoffs involved and
deeper implications of type systems on program structure and what is
easy to express and what isn't.

Marc de Bourget

unread,
May 21, 2016, 4:01:53 PM5/21/16
to
Le samedi 21 mai 2016 20:15:06 UTC+2, Kaz Kylheku a écrit :

> I suspect you need to learn more about programming language design (not
> to mention computer science in general), in order to know exactly what
> you want: what the choices are and waht are the tradeoffs involved and
> deeper implications of type systems on program structure and what is
> easy to express and what isn't.

Hi Kaz,

as Mike mentioned very correctly, Janis and you take this all too seriously.

I'm not an expert like you but I think I have enough experiences in programming (Delphi, AWK, Ruby, Python, VBScript, ...), so just thinking a bit about the pros and cons of real data types and their advantages and disadvantages and the ideal scripting language. With Delphi (Object Pascal) I don't have any problems with Integer and String data types and using inttostr and strtoint, so I'm wondering why they don't exist in Scripting languages.

Probably at the end of these reflections the result will be that the AWK handling of data types is the best. Maybe Arnold knows more about the original intentions of the inventors of AWK in the seventies (if he likes to join this conversation). AFAIK one of the AWK inventors was also one of the C inventors. It is really interesting how it is possible to create both a very complicated language like C (with pointers and structures) and a very primitive language (in the best sense of the word) like AWK.

Ed Morton

unread,
May 21, 2016, 8:26:41 PM5/21/16
to
The C vs Awk comparison is, at best, comparing apples and oranges. C is a
general purpose programming language. Awk is a domain specific language for
manipulating text. Although they have syntactic similarities, they're for 2 very
different purposes. So far we've been discussing what C has that Awk doesn't
(data structures and types) but you can flip that:

Why doesn't C have an implicit `while read` loop that splits each line it reads
into space-separated fields like Awk does?
Why does C force you to declare variables when Awk lets you just use them?
Why in C does it cause a memory error when you write past the end of an array
when Awk just creates the new entries for you?
And so on...

The answer is the same as why Awk doesn't have data structures and types -
because that's just not the purpose of that language. You can have the same
discussion about shell, or the language used in Makefiles, or anything else that
was created for a specific purpose in a specific domain and has a language
associated with it. If you try to create a language that's all things to all
people you end up with an incomprehensible mess, possibly starting with a p :-).

Ed.

Marc de Bourget

unread,
May 22, 2016, 3:38:55 AM5/22/16
to
Le dimanche 22 mai 2016 02:26:41 UTC+2, Ed Morton a écrit :
> The answer is the same as why Awk doesn't have data structures and types -
> because that's just not the purpose of that language. You can have the same
> discussion about shell, or the language used in Makefiles, or anything else that
> was created for a specific purpose in a specific domain and has a language
> associated with it. If you try to create a language that's all things to all
> people you end up with an incomprehensible mess, possibly starting with a p :-).
>
> Ed.

Hi Ed,
yes thank you, I a agree about the AWK and C language purpose differences.
About the incomprehensible mess, possibly starting with a p ...
Do you mean Perl or Python?

If you mean Python, I must admit I find Python in general very good.
I wouldn't consider it as messy, the mere existence of space identation creates clean code. Python created programs run much faster than Ruby created programs. Ruby is not bad but really slow.

There are some things I don't like but Python is good, and to tell the truth: it is extremely good.
However, there are still a lot of things much easier to get done with AWK.
So every language has its own main purpose where it is best.

Ed Morton

unread,
May 22, 2016, 8:43:30 AM5/22/16
to
On 5/22/2016 2:38 AM, Marc de Bourget wrote:
> Le dimanche 22 mai 2016 02:26:41 UTC+2, Ed Morton a écrit :
>> The answer is the same as why Awk doesn't have data structures and types -
>> because that's just not the purpose of that language. You can have the same
>> discussion about shell, or the language used in Makefiles, or anything else that
>> was created for a specific purpose in a specific domain and has a language
>> associated with it. If you try to create a language that's all things to all
>> people you end up with an incomprehensible mess, possibly starting with a p :-).
>>
>> Ed.
>
> Hi Ed,
> yes thank you, I a agree about the AWK and C language purpose differences.
> About the incomprehensible mess, possibly starting with a p ...
> Do you mean Perl or Python?

I do not mean Python (http://www.zoitz.com/archives/13).

Ed.

Andrew Schorr

unread,
May 22, 2016, 10:04:14 AM5/22/16
to
On Friday, May 20, 2016 at 12:09:24 AM UTC-4, Kaz Kylheku wrote:
> But in this case, it's just due to the way gawk printed it by default
> when no formatting was given, which we can fix:
>
> $ gawk 'BEGIN {x = "5.555555555"; y = x+0; printf("%.15g\n", y); print
> x}'
> 5.555555555
> 5.555555555
>
> 5.555555555 is well below the 15 digit limit of the ability of an IEEE
> 64 bit double to preserve the decimal digits.

The point is that it is impossible to set CONVFMT to a value that will always reproduce the original string. Here's another example that may help make this more clear:

bash-4.2$ gawk 'BEGIN {x = "1e2"; y = x+0; print x; print y; print (y == x)}'
1e2
100
0

The results are a bit different when x is a strnum instead of a string:

bash-4.2$ gawk 'BEGIN {split("1e2", f); x = f[1]; y = x+0; print x; print y; print (y == x)}'
1e2
100
1

The bottom line is that strings and strnums have subtly different properties that can impact your results.

Regards,
Andy






Mike Sanders

unread,
May 22, 2016, 2:36:54 PM5/22/16
to
Ed Morton <morto...@gmail.com> wrote:

> I do not mean Python (http://www.zoitz.com/archives/13).

Chuckle, recalls a sig line I once read:

Perl: The only language that looks the
same before and after AES encryption...

--
Mike Sanders
www.peanut-software.com

Marc de Bourget

unread,
May 23, 2016, 1:58:34 AM5/23/16
to
Hi Ed, thank you for clarifying which language you have meant :-)

0 new messages