On 23.09.2023 13:39, Michael Sanders wrote:
> On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:
> [...]
>> Is it guaranteed that the expression evaluates correctly in all
>> awks? The intermediate results can become as large as 70866960411
>> (0x108000001B, which is larger than 32 bit), which needs to be
>> representable in awk (not sure what POSIX awk defines/requires
>> here).
>
> Huh? From my initial post in this thread:
>
> # modulo simulates 32-bit unsigned integer
> hash = (hash * 33 + ord(char)) %
2147483647
>
> Modulo wraps at
2147483647, it'll never outgrow
> it limits. Maybe I'm not grokking what you meant.
With this formula the hash variable may (because of the modulus)
get values up to 2147483647-1, so within the expression the value
may grow beyond that limit to (2147483647-1)*33+126 because of
the previous (potentially large) hash value calculated.
(There's a good chance that modern systems and languages calculate
that without problem, but if we just assume "32-bit integer" then
it will (typically implicitly) overflow and produce wrong results.
A comment in the code may be useful and is often sufficient here,
so that folks using/porting the code to their environment may have
a visible caveat and may check that. - Myself I haven't inspected
the POSIX specs on that, that's why I formulated it as question.)
>> [...]
>>
>> (I would also try to avoid name clashes like ord() and ord[],
>> BTW. Here it's simple, because the function is superfluous;
>> just replace the ord(char) at the calling side by ord[char]
>> and remove the function completely.)
>
> And yet, this rational assumes an extension to the language,
> which implies *only* gawk though right? [...]
Maybe I misunderstand you. My suggestion here is not relying on
extensions; it's basically just a change of the expression '#<<'
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
hash = (hash * 33 + ord[char]) %
2147483647 #<<
}
# function ord(char) ## deleted
# once in the BEGIN section (as you've done it)
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
>
>> What will the code produce with, say, "extended" ASCII, or the
>> common ISO 8859-x family of character sets? Is there any reason
>> to restrict it here?
>>
>> What are the criteria for the chosen character subset? How will
>> or how should a TAB character in the data change the result?
>
> No issues with TAB (see my function ord()), & non 7bit character
> sets? Dunno... only need lower ASCII here, hence the hard-coded
> limit:
>
> min: 32, max: 126
I understand that. My question with TAB was aiming at what this
sample text should calculate
Hello<Blank>brave<Blank>new<Tab>World!<Newline>
Shall the Blank be part of the hash but the Tab not? - Appears to
me to be inconsistent, and if it is "as designed" it should have
an explicit and clear comment on that difference (and also that
(and maybe also why) e.g. [other] control characters [deliberately]
evaluate to 0).
>
> If so compelled, test other sets & post your findings here.
Not sure what that means - by "sets" you mean "code-sets"? - ...
> Would be interested to read what you come up with.
...if so then it would be simple; I'd include the whole range of
8-bit characters (including control characters), i.e. 0..255 to
cover all octet streams (and take also means to not exclude the
\n, or RS and FS in Awk parlance).
>> As I said, just a couple of more or less obvious questions.
>
> Thanks Janis, as always, appreciate your insights. Hoping to get
> back to using tin as my news reader soon, the google groups
> interface its perhaps not my cup of tea. :/
In your private environment at least you are free to choose. The
horror is when a company forces you to switch from Unix to Windows,
when all applications are getting Web based, and if you're not
allowed to install "local" tools.
Janis