Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

djb2 hash using awk

85 views
Skip to first unread message

Michael Sanders

unread,
Sep 22, 2023, 3:50:10 PM9/22/23
to

Greetings folks, just sharing the knowledge (& test posting using google groups).

Beware wordwrap...

function djb2(str, hash, i, char) {

# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)

# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}

return hash
}

function ord(char) {

return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

}

BEGIN {

for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

print djb2("my key is...")

}

notes:

https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

https://en.wikipedia.org/wiki/Daniel_J._Bernstein

Janis Papanagnou

unread,
Sep 23, 2023, 4:45:10 AM9/23/23
to
On 22.09.2023 21:50, Michael Sanders wrote:
>
> Greetings folks, just sharing the knowledge (& test posting using google groups).

Hello,

the post through Google groups seems technically fine - it certainly
wasn't in the past -, only Usenet posting conventions seem violated
(line length).

But the Awk code rises a couple questions...

> Beware wordwrap...
>
> function djb2(str, hash, i, char) {
>
> # initialize hash to an arbitrary prime number
> hash = 5381
> n = length(str)

Variable 'n' not declared as local (just for good measure, and in
case you want to use the code in other code contexts).

>
> # iterate over characters and compute hash
> for(i = 1; i <= n; i++) {
> char = substr(str, i, 1)
> # modulo simulates 32-bit unsigned integer

The previous comment suggests that the subsequent code should use
the constant 2147483648 if you want a 32-bit unsigned integer. On
the other hand such formulas often use primes (and 2147483647 is
one). So it leaves me with uncertainty what is actually intended.

> hash = (hash * 33 + ord(char)) % 2147483647

Is it guaranteed that the expression evaluates correctly in all
awks? The intermediate results can become as large as 70866960411
(0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).

> }
>
> return hash
> }
>
> function ord(char) {
>
> return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

Why use this costly pattern match and conditional expression
if you can anyway simply access the ord[] array where all data
is already present?

(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)

>
> }
>
> BEGIN {
>
> for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

What will the code produce with, say, "extended" ASCII, or the
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?

What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?

>
> print djb2("my key is...")
>
> }

As I said, just a couple of more or less obvious questions.

Janis

Janis Papanagnou

unread,
Sep 23, 2023, 4:49:01 AM9/23/23
to
On 23.09.2023 10:45, Janis Papanagnou wrote:
> On 22.09.2023 21:50, Michael Sanders wrote:
>> [...]
>> # modulo simulates 32-bit unsigned integer
>
> The previous comment suggests that the subsequent code should use
> the constant 2147483648 if you want a 32-bit unsigned integer. [...]

Erm..., _unsigned_ 32 bit integer would of course mean 4294967296.

Janis

Manuel Collado

unread,
Sep 23, 2023, 4:54:29 AM9/23/23
to
There is something wrong:

$ LC_ALL=C gawk 'function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}

BEGIN {print ord("x")}'
gawk: cmd. line:3: error: function `ord' called with space between name
and `(',
or used as a variable or an array

Please note ord[char] vs. ord(char)

HTH

El 22/9/23 a las 21:50, Michael Sanders escribió:
--
Manuel Collado - http://mcollado.z15.es

Michael Sanders

unread,
Sep 23, 2023, 7:39:07 AM9/23/23
to
On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

> Variable 'n' not declared as local (just for good measure, and in
> case you want to use the code in other code contexts).

Aye, I spotted this *after* I posted (of course...). Updated
function signature below:

function djb2(str, hash, n, i, char)

> Is it guaranteed that the expression evaluates correctly in all
> awks? The intermediate results can become as large as 70866960411
> (0x108000001B, which is larger than 32 bit), which needs to be
> representable in awk (not sure what POSIX awk defines/requires
> here).

Huh? From my initial post in this thread:

# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647

Modulo wraps at 2147483647, it'll never outgrow
it limits. Maybe I'm not grokking what you meant.

And down-thread you remarked:

> Erm..., _unsigned_ 32 bit integer would of course mean 4294967296

That's the max, but certainly not the only possibility...

> [...]
>
> (I would also try to avoid name clashes like ord() and ord[],
> BTW. Here it's simple, because the function is superfluous;
> just replace the ord(char) at the calling side by ord[char]
> and remove the function completely.)

And yet, this rational assumes an extension to the language,
which implies *only* gawk though right? What about the *standard*
awks (tested dbj2() using only mawk & busybox's awk thus far -
the smallest implementations out there). Still, have you by chance
invoked gawk as:

gawk --traditional

I cant on my end as the hard drive died on porkchop.bsd,
my OpenBSD box...

> What will the code produce with, say, "extended" ASCII, or the
> common ISO 8859-x family of character sets? Is there any reason
> to restrict it here?
>
> What are the criteria for the chosen character subset? How will
> or how should a TAB character in the data change the result?

No issues with TAB (see my function ord()), & non 7bit character
sets? Dunno... only need lower ASCII here, hence the hard-coded
limit:

min: 32, max: 126

If so compelled, test other sets & post your findings here.
Would be interested to read what you come up with.

> As I said, just a couple of more or less obvious questions.

Thanks Janis, as always, appreciate your insights. Hoping to get
back to using tin as my news reader soon, the google groups
interface its perhaps not my cup of tea. :/

Michael Sanders

unread,
Sep 23, 2023, 7:42:54 AM9/23/23
to
Hi Manuel.

Question, have you tried: --traditional or --posix options?

See also:

https://www.gnu.org/software/gawk/manual/html_node/Compatibility-Mode.html

Michael Sanders

unread,
Sep 23, 2023, 8:25:41 AM9/23/23
to
On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:

> (I would also try to avoid name clashes like ord() and ord[],
> BTW. Here it's simple, because the function is superfluous;
> just replace the ord(char) at the calling side by ord[char]
> and remove the function completely.)

One more quick question... does gawk have an ord() builtin
function or not?

Janis Papanagnou

unread,
Sep 23, 2023, 9:17:37 AM9/23/23
to
On 23.09.2023 13:39, Michael Sanders wrote:
> On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:
> [...]
>> Is it guaranteed that the expression evaluates correctly in all
>> awks? The intermediate results can become as large as 70866960411
>> (0x108000001B, which is larger than 32 bit), which needs to be
>> representable in awk (not sure what POSIX awk defines/requires
>> here).
>
> Huh? From my initial post in this thread:
>
> # modulo simulates 32-bit unsigned integer
> hash = (hash * 33 + ord(char)) % 2147483647
>
> Modulo wraps at 2147483647, it'll never outgrow
> it limits. Maybe I'm not grokking what you meant.

With this formula the hash variable may (because of the modulus)
get values up to 2147483647-1, so within the expression the value
may grow beyond that limit to (2147483647-1)*33+126 because of
the previous (potentially large) hash value calculated.

(There's a good chance that modern systems and languages calculate
that without problem, but if we just assume "32-bit integer" then
it will (typically implicitly) overflow and produce wrong results.
A comment in the code may be useful and is often sufficient here,
so that folks using/porting the code to their environment may have
a visible caveat and may check that. - Myself I haven't inspected
the POSIX specs on that, that's why I formulated it as question.)

>> [...]
>>
>> (I would also try to avoid name clashes like ord() and ord[],
>> BTW. Here it's simple, because the function is superfluous;
>> just replace the ord(char) at the calling side by ord[char]
>> and remove the function completely.)
>
> And yet, this rational assumes an extension to the language,
> which implies *only* gawk though right? [...]

Maybe I misunderstand you. My suggestion here is not relying on
extensions; it's basically just a change of the expression '#<<'

for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
hash = (hash * 33 + ord[char]) % 2147483647 #<<
}

# function ord(char) ## deleted

# once in the BEGIN section (as you've done it)
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

>
>> What will the code produce with, say, "extended" ASCII, or the
>> common ISO 8859-x family of character sets? Is there any reason
>> to restrict it here?
>>
>> What are the criteria for the chosen character subset? How will
>> or how should a TAB character in the data change the result?
>
> No issues with TAB (see my function ord()), & non 7bit character
> sets? Dunno... only need lower ASCII here, hence the hard-coded
> limit:
>
> min: 32, max: 126

I understand that. My question with TAB was aiming at what this
sample text should calculate

Hello<Blank>brave<Blank>new<Tab>World!<Newline>

Shall the Blank be part of the hash but the Tab not? - Appears to
me to be inconsistent, and if it is "as designed" it should have
an explicit and clear comment on that difference (and also that
(and maybe also why) e.g. [other] control characters [deliberately]
evaluate to 0).

>
> If so compelled, test other sets & post your findings here.

Not sure what that means - by "sets" you mean "code-sets"? - ...

> Would be interested to read what you come up with.

...if so then it would be simple; I'd include the whole range of
8-bit characters (including control characters), i.e. 0..255 to
cover all octet streams (and take also means to not exclude the
\n, or RS and FS in Awk parlance).

>> As I said, just a couple of more or less obvious questions.
>
> Thanks Janis, as always, appreciate your insights. Hoping to get
> back to using tin as my news reader soon, the google groups
> interface its perhaps not my cup of tea. :/

In your private environment at least you are free to choose. The
horror is when a company forces you to switch from Unix to Windows,
when all applications are getting Web based, and if you're not
allowed to install "local" tools.

Janis

Janis Papanagnou

unread,
Sep 23, 2023, 9:21:14 AM9/23/23
to
Not that I know of. - I seem to recall that when I once needed that
I implemented an array (as you did). - There's a chance that GNU Awk
has something in some extension libraries (but I'm too lazy to check).

Janis

Ben Bacarisse

unread,
Sep 23, 2023, 5:29:40 PM9/23/23
to
Michael Sanders <chrom...@gmail.com> writes:

> Greetings folks, just sharing the knowledge (& test posting using
> google groups).
>
> Beware wordwrap...
>
> function djb2(str, hash, i, char) {
>
> # initialize hash to an arbitrary prime number
> hash = 5381
> n = length(str)
>
> # iterate over characters and compute hash
> for(i = 1; i <= n; i++) {
> char = substr(str, i, 1)
> # modulo simulates 32-bit unsigned integer
> hash = (hash * 33 + ord(char)) % 2147483647

This modulus does nor do what you claim you want to do (as has been
pointed out). To simulate 32-bit unsigned arithmetic you need

hash = (hash * 33 + ord(char)) % 4294967296

here. Mind you, there is no formal definition of what the DJB2 hash is.
The only reliable reference is a C function that uses unsigned long int
calculations. This could be 32-bits wide, but is often in wider.

> }
>
> return hash
> }
>
> function ord(char) {
>
> return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

What version of AWK, run with what flags, accepts this? I can't get it
past gawk, nawk, mawk nor original-awk.

> }
>
> BEGIN {
>
> for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
>
> print djb2("my key is...")
>
> }

--
Ben.

Michael Sanders

unread,
Sep 23, 2023, 10:20:53 PM9/23/23
to
On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:

> This modulus does nor do what you claim you want to do (as has been
> pointed out).

No sir, it does, in fact.

# modulo simulates 32-bit unsigned integer

https://en.wikipedia.org/wiki/2,147,483,647

https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number

https://www.google.com/search?q=is+this+2147483647+a+32-bit+number

> What version of AWK, run with what flags, accepts this? I can't get it
> past gawk, nawk, mawk nor original-awk.

https://gnuwin32.sourceforge.net/packages/mawk.htm

https://frippery.org/busybox/

Michael Sanders

unread,
Sep 23, 2023, 10:50:44 PM9/23/23
to
On Friday, September 22, 2023 at 2:50:10 PM UTC-5, Michael Sanders wrote:

> Greetings folks, just sharing the knowledge (& test posting using google groups).

Note to self: built a version for gawk & tested online here:

https://ideone.com/ZZVQl2

script follows:

function djb2gawk(str, n, i, hash, ascii) {

hash = 5381
n = length(str)

for(i = 1; i <= n; ++i) {
char = substr(str, i, 1)
ascii = alphanum(char)
hash = (hash * 33 + ascii) % ((2^32) - 1)
}

return hash
}

function alphanum(char) {

return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", char)
}

BEGIN {
str = "comp.lang.awk"
print djb2gawk(str)
}

Janis Papanagnou

unread,
Sep 24, 2023, 12:39:04 AM9/24/23
to
On 24.09.2023 04:20, Michael Sanders wrote:
> On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:
>
>> This modulus does nor do what you claim you want to do (as has been
>> pointed out).
>
> No sir, it does, in fact.
>
> # modulo simulates 32-bit unsigned integer

For unsigned ints with a range that is implementable by binary words
x % 4294967296 (modulo) is equivalent to x & 4294967295 (bitwise-and),
x % 2147483648 (modulo) is equivalent to x & 2147483647 (bitwise-and).
(Mind the differing last digit of modulo and bitwise-and expressions!)
The expressions on the first line are for 32-bit unsigned integer and
the expressions on the second line are for 31-bit unsigned integer
calculations.

Your sample (the fact I pointed out in an earlier post and Ben repeated
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
(Compare it with the four forms above and think about it.)

To add something new with this post I want to point out that a modulus
with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
in random number generators or in cryptography. In these areas you may
have large numbers x and a comparable small prime modulus. Since you
don't have such large registers for the x you can iterate over parts
of the number sequentially (using a loop) to perform the operation.
This insight can also be applied in case of the issue you have with
your original code where internal overflows can arise; in case that
the prime number _2147483647_ was *intended* (as opposed to 31/32 bit
unsigned integer operations) you can avoid these computation errors
by that iterative method (here, in fact, a simple split of x in only
two parts would suffice for the "iteration").

Janis

Manuel Collado

unread,
Sep 24, 2023, 4:27:47 AM9/24/23
to
El 23/9/23 a las 15:21, Janis Papanagnou escribió:
Yes, ord() and chr() are included in the gawk distribution as a dynamic
extension.

gawk -l ordchr 'BEGIN{print ord("x")}'
120

HTH

Michael Sanders

unread,
Sep 24, 2023, 6:30:27 AM9/24/23
to
On Saturday, September 23, 2023 at 11:39:04 PM UTC-5, Janis Papanagnou wrote:

> Your sample (the fact I pointed out in an earlier post and Ben repeated
> here) uses x % 2147483647 which doesn't match a "32-bit simulation".

That's not correct & there is is no error using 2147483647, it works just
as I intended. But elsewhere... I do think my choice of naming ord()
& ord[] were poor, as these may be intrinsic to some versions of
a given awk's namespace.

> (Mind the differing last digit of modulo and bitwise-and expressions!)

See the dbj2gawk() version I posted elsewhere in this thread. But of
course the older awks lack bitwise...

> Since you don't have such large registers...

Exactly, hence the signed int masquerading as an unsigned int.
We dont want to get sidetracked on that point, there are lots of
other 32-bit numbers that can be chosen in any event (& ought to
be to for security reasons). I took one that works with mawk,
busybox awk. Just think about this & you'll see what I mean.

> in case that the prime number _2147483647_ was *intended*...

Yes, intended.

Thanks Janis.

Michael Sanders

unread,
Sep 24, 2023, 6:31:46 AM9/24/23
to
On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:

Yes it does help. Earnest thanks Manuel!

Janis Papanagnou

unread,
Sep 24, 2023, 9:28:06 AM9/24/23
to
On 24.09.2023 12:30, Michael Sanders wrote:
> On Saturday, September 23, 2023 at 11:39:04 PM UTC-5, Janis Papanagnou wrote:
>
>> Your sample (the fact I pointed out in an earlier post and Ben repeated
>> here) uses x % 2147483647 which doesn't match a "32-bit simulation".
>
> That's not correct & there is is no error using 2147483647, it works just
> as I intended.

I cannot see how "it works" if it is not "working" for the valid
31-bit number 2147483647; the modulus 2147483647 % 2147483647 == 0
thus it doesn't "simulate" 31-bit arithmetic (and also not 32-bit
arithmetic) with that modulus. But maybe you just have a different
understanding when speaking about unsigned integer "simulation".

For an operation x % 2147483647 you are (as I previously mentioned,
similar as in some coding theory applications) doing an arithmetic
reduction (or "folding") of longer numbers to smaller ones. That's
fine per se, but you are not "simulating" unsigned integer math. If
you wanted to say that you are restricting computation and numeric
output to 31 bit values then it's okay. With the already mentioned
flaw that the expressions exceeds 31 and 32 bit in the expression,
so the math processor needs to support larger integers or FP math
with sufficient resolution in the mantissa. Which is another point
why your expression is not [reliably] working [correctly] within a
32 bit range. To illustrate let's assume that the expression will
(as shown to be possible) overflow the 32 bit, then the result will
effectively have implicitly undergone an implicit x % 2147483648
operation and then an explicit x % 2147483647 operation; together
res = x % 2147483648 % 2147483647. By this there is algorithmically
a discontinuity introduced which is almost always not desired.

> But elsewhere... I do think my choice of naming ord()
> & ord[] were poor, as these may be intrinsic to some versions of
> a given awk's namespace.
>
>> (Mind the differing last digit of modulo and bitwise-and expressions!)
>
> See the dbj2gawk() version I posted elsewhere in this thread. But of
> course the older awks lack bitwise...

Yes, that's why you'd use the modulus; but x % 2147483648 are
necessary to handle all 31 bit unsigned integer values and not
x % 2147483647.

Notwithstanding that the use the latter may be algorithmically
needed. Such algorithms may choose also other primes p in x % p.
But still note the undesired introduced discontinuity mentioned
above in case of the overflows within expressions.

Janis

Janis Papanagnou

unread,
Sep 24, 2023, 9:37:24 AM9/24/23
to
On 24.09.2023 06:38, Janis Papanagnou wrote:
>
> To add something new with this post I want to point out that a modulus
> with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
> in random number generators or in cryptography. In these areas you may
> have large numbers x and a comparable small prime modulus. Since you
> don't have such large registers for the x you can iterate over parts
> of the number sequentially (using a loop) to perform the operation. [...]

I just found on an old disk some Awk code where I used this method
to compute and check IBANs some years ago. Here's the diff of two
approaches, one using external multiple-precision arithmetic using
'bc' co-process (GNU Awk) and one working on chunks of the numeric
data string.

$ diff iban1 iban2
59a60,72
> function mod(n, s, l, z, m) {
> while (length(s) > 4) {
> z = substr(s, 1, 4)
> m = z % n
> s = m substr(s, 5)
> }
> return s % n
> }
>
> function mod97(s) {
> return mod(97, s)
> }
>
66,67c79
< print "98 - ( " iban " % 97 )" |& "bc"
< "bc" |& getline checksum
---
> checksum = 98 - mod97(iban)
72,73c84
< print iban " % 97 == 1" |& "bc"
< "bc" |& getline ok
---
> ok = mod97(iban) == 1

(Just for illustration purposes in case someone's interested in that
method.)

Janis

Ben Bacarisse

unread,
Sep 24, 2023, 3:39:49 PM9/24/23
to
Michael Sanders <chrom...@gmail.com> writes:

> On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:
>
>> This modulus does nor do what you claim you want to do (as has been
>> pointed out).
>
> No sir, it does, in fact.

I can only repeat that it does not. A simple test shows that it does
not, so it is probable that you don't know what "simulates 32-bit
unsigned integer" means.

Positive integers are not "unsigned". 32-bit unsigned arithmetic has no
sign representation at all, and the operations are all done modulo
2**32.

Even if you interpreted "32-bit unsigned integers" to mean that the
range should be that of the positive (but signed) 32-bit integers, your
modulus is still wrong. For that, you would need to use 2147483648 as
the modulus, giving 0 to 2147483647 inclusive.
What do you think these links show other than the obvious and
uncontested fact that the largest signed 32-bit number is a 32-bit
number?

>> What version of AWK, run with what flags, accepts this? I can't get it
>> past gawk, nawk, mawk nor original-awk.
>
> https://gnuwin32.sourceforge.net/packages/mawk.htm

What version of mawk is that? On my system:

$ mawk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan

random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
$ mawk -f hash.awk
mawk: hash.awk: line 19: syntax error at or near [
mawk: hash.awk: line 25: syntax error at or near [

Lines 19 and 25 have 'ord['... where ord has been previously defined as
a function.

> https://frippery.org/busybox/

Ah, yes, busybox awk accepts it. I wonder why? It's definitely an
outlier. Don't use ord as the name of a function and of an array.

--
Ben.

Ben Bacarisse

unread,
Sep 24, 2023, 3:45:33 PM9/24/23
to
Michael Sanders <chrom...@gmail.com> writes:

> On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:
>
> Yes it does help. Earnest thanks Manuel!

Note that the gawk library extension 'ord' will not do what your ord
function does, particularly with digits. But treating digits
differently to other characters is not what DJB2 does, so the library
extension ord will fix a bug.

--
Ben.

Michael Sanders

unread,
Sep 24, 2023, 5:17:32 PM9/24/23
to
On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:

[...]

Ben, listen the number wraps (or folds as stated elsewhere)
to zero at its max using modulo. Thats obvious. I, you, Janis
(& likely others) see that.

Now the thing is, the comment I placed in the script was the
word 'simulates', if you want to fritter away your time debating
that, have at it. Me? Too busy to deal with the snarky attitude.

The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
on my end & it seems rich to me that you cant get them to run but
otherwise know that I'm wrong. Hmmm... None of this is the point,
we simply want a large number (32bit - 1 as is the case) to lesson
the chance of collisions in the hash & that's it... I cant believe
you & Janis cant see that for what it is. Good grief, I should've known...

Janis Papanagnou

unread,
Sep 24, 2023, 6:45:53 PM9/24/23
to
On 24.09.2023 23:17, Michael Sanders wrote:
> [...]

You are right, we all wasted our time.


Kenny McCormack

unread,
Sep 24, 2023, 7:36:52 PM9/24/23
to
In article <ueqe6u$1i163$1...@dont-email.me>,
And you've got no one to blame but yourself.

You & Ben.

It was pretty clear OP wasn't interested in the usual "human compiler"
treatment, but you guys went ahead with it anyway.

--
The book "1984" used to be a cautionary tale;
Now it is a "how-to" manual.

Janis Papanagnou

unread,
Sep 24, 2023, 8:38:06 PM9/24/23
to
On 25.09.2023 01:36, Kenny McCormack wrote:
> In article <ueqe6u$1i163$1...@dont-email.me>,
> Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> On 24.09.2023 23:17, Michael Sanders wrote:
>>> [...]
>>
>> You are right, we all wasted our time.
>>
>
> And you've got no one to blame but yourself.

Sure.

>
> You & Ben.
>
> It was pretty clear OP wasn't interested in the usual "human compiler"
> treatment, but you guys went ahead with it anyway.
>

The OP initially said about his interest it's "sharing the knowledge".
Anyone tell us what that would be in this thread. - Rhetorical, don't
bother.

He could as well have contributed BEGIN { print 42 } . "It's working."
And it's not rising all the questions the OP's code did and still does.

And (with your contribution and my reply) we continue wasting our time,
don't we, Kenny?

Ben Bacarisse

unread,
Sep 24, 2023, 9:27:50 PM9/24/23
to
Michael Sanders <chrom...@gmail.com> writes:

> On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:
>
> [...]
>
> Ben, listen the number wraps (or folds as stated elsewhere)
> to zero at its max using modulo. Thats obvious. I, you, Janis
> (& likely others) see that.

Indeed we do all know that.

> Now the thing is, the comment I placed in the script was the
> word 'simulates', if you want to fritter away your time debating
> that, have at it. Me? Too busy to deal with the snarky attitude.

No, I have no interest in debating it. I thought you might want to know
how to implement the behaviour described by the comment you wrote,
especially as that behaviour was a key part of how the original DJB hash
was defined.

> The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
> on my end & it seems rich to me that you cant get them to run but
> otherwise know that I'm wrong.

Of course I can get your code to run. It's trivial to fix the bug of
using ord for two incompatible purposes, but I find it rich that you
think it at all odd that someone could know your code is wrong without
running it. That's what programmers do every day.

> Hmmm... None of this is the point,
> we simply want a large number

Both the subject line of the thread, and the links you gave in the
original post, suggest you wanted to implement a specific hash function:
DJB's 32-bit "multiply by 33" hash. But your code does not do that.
For example, the DJB hash of the string "abcdef" is 4048079738, but your
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.

> (32bit - 1 as is the case) to lesson
> the chance of collisions in the hash & that's it...

This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.

And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.

> I cant believe
> you & Janis cant see that for what it is. Good grief, I should've
> known...

Well now you know!

1) DJB's has hash usually uses 32-bit unsigned arithmetic. (Very old
implementation might use 16-bit arithmetic, and some newer ones might
use 64 bits.)

2) You can do that in many modern awks with modulo 4294967296
arithmetic.

3) In gawk you could use its logical 'and' function along with its
hexadecimal numbers to make it very clear what's going on

hash = and(hash * 33 + ord(c), 0xFFFFFFFF)

--
Ben.

Michael Sanders

unread,
Sep 25, 2023, 3:27:01 AM9/25/23
to
On Sunday, September 24, 2023 at 8:27:50 PM UTC-5, Ben Bacarisse wrote:

> For example, the DJB hash of the string "abcdef" is 4048079738, but your
> code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
> the original one uses the full range of 32-bit unsigned int.
>
> [...]
>
> This is why halving the range by using only 31 bits is not a good idea.
> It's not a /terrible/ idea because 31 bits is still a huge range, but
> since 32-bit wrapping is free on most hardware, there is absolutely no
> upside to using modulo 2147483647 arithmetic for a hash.
>
> And your saying "32bit - 1 as is the case" means you still don't know
> what the code you wrote is really doing.

Reader's from the future, re-read the 1st paragraph...

Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own
words mind you, the importance of using of using a unique key number.

Michael Sanders

unread,
Sep 25, 2023, 3:28:26 PM9/25/23
to
On Friday, September 22, 2023 at 2:50:10 PM UTC-5, Michael Sanders wrote:

> Greetings folks, just sharing the knowledge (& test posting using google groups).

For the sake of completeness, a key generator (you supply a 32bit key):

# Michael Sanders 2023: key generator for dbj2() & dbj2gawk()
#
# creates a key block (25 keys, 5x5 matrix) as shown below:
#
# 2147483647 2147483646 2147483645 2147483644 2147483643
# 2147483642 2147483641 2147483640 2147483639 2147483638
# 2147483637 2147483636 2147483635 2147483634 2147483633
# 2147483632 2147483631 2147483630 2147483629 2147483628
# 2147483627 2147483626 2147483625 2147483624 2147483623

BEGIN {

KEYMAX = 2147483647 # your key's max number
BLOCKS = 5 # number of blocks to create

for (x = KEYMAX; x >= 0; x--) {
printf("%010d ", x);
if ((KEYMAX - x) % 5 == 4) {
printf("\n");
if (++y % 5 == 0) {
printf("\n")
if (--BLOCKS < 1) exit
}
}
}

}

Ben Bacarisse

unread,
Sep 25, 2023, 6:50:45 PM9/25/23
to
What's a key number?

--
Ben.

Michael Sanders

unread,
Sep 25, 2023, 11:32:24 PM9/25/23
to
On Monday, September 25, 2023 at 5:50:45 PM UTC-5, Ben Bacarisse wrote:

> What's a key number?

It doesn't matter Ben. I do appreciate your
willingness to help despite our p*ssing contest.
Moving forward, here's an intended usage example.

Given a CSV file of: Smith, John, HASH

And parsed as, oh something like:

if ($1 == "Smith" && $2 == "John") {

if ($3 == dbj2("policy_number")) {...}
}

We've gained simple a way to obfuscate small chucks
of patient records.

Ben Bacarisse

unread,
Sep 26, 2023, 7:20:30 AM9/26/23
to
Michael Sanders <chrom...@gmail.com> writes:

> On Monday, September 25, 2023 at 5:50:45 PM UTC-5, Ben Bacarisse wrote:
>
>> What's a key number?
>
> It doesn't matter Ben.

OK. I thought you might want to be understood.

--
Ben.

Mike Sanders

unread,
Sep 26, 2023, 3:11:33 PM9/26/23
to
Well, about to bring this little project to its conclusion =)
A special thanks (in no particular order) to:

Janis Papanagnou, Manuel Collado, & Ben Bacarisse.

Great insights, advice & help, appreciate you all.

tags: awk, sh, djb2, hash, keys, crypto

# Michael Sanders 2023: djb2 - simple string hash for awk
#
# https://porkchop.neocities.org/notes/djb2.txt
#
# usage example...
#
# given a CSV file of: Smith, John, HASH
#
# you could deploy djb2() in the following manner:
#
# if ($1 == "Smith" && $2 == "John" && $3 == djb2("medical_condition")) {...}
#
# if a unique number is desired, use one of the key generators below

function djb2(str, hash, x, y, ascii) {

hash = 5381
y = length(str)

for(x = 1; x <= y; x++) {
ascii = substr(str, x, 1)
hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
# hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
}

return hash

}

function alphanum(tmp) {

return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)

}

BEGIN {

print djb2("sensitive data")

}

# -----------------------------------------------------------------

# Michael Sanders 2023: key generator for djb2.awk
#
# creates 1250 random & unique keys, where each block
# contains 25 keys within a 5x5 matrix, sample output:
#
# 4044119583 2096446957 0146530746 2307162583 1145196028
# 2074570158 0782926302 0126577871 1957612719 1264962473
# 2768662034 3770197738 1165669763 3328823363 1160752685
# 1609089706 2699782284 0621370497 3966710650 2024400675
# 1453519246 1174954443 1588003602 3567230071 2877329117

BEGIN {

MIN = 0123456789 # min 10 digit num
MAX = 4294967295 # max 10 digit num
BLX = 50 # number of blocks

seed = systime(); srand(seed)
printf("\nSeed: %s\n\n", seed)

for (x = 1; x <= BLX; x++) {
for (y = 1; y <= 5; y++) {
s = ""
for (z = 1; z <= 5; z++) {
do {key = int(rand() * (MAX - MIN + 1)) + MIN} while(key in keys)
keys[key]
s = (s == "" ? sprintf("%010d", key) : s " " sprintf("%010d", key))
}
printf("%s\n", s)
}
if (x < BLX) printf("%s", "\n")
}
}

# -----------------------------------------------------------------

# Michael Sanders 2023: key generator2 for older awks using dbj2.awk
#
# use this key generator instead if your awk lacks the abilty to
# use either large integers or the systime() builtin
#
# creates key blocks (25 keys, 5x5 matrix) as shown below:
#
# 2147483647 2147483646 2147483645 2147483644 2147483643
# 2147483642 2147483641 2147483640 2147483639 2147483638
# 2147483637 2147483636 2147483635 2147483634 2147483633
# 2147483632 2147483631 2147483630 2147483629 2147483628
# 2147483627 2147483626 2147483625 2147483624 2147483623

BEGIN {

KEYMAX = 2147483647 # your 10 digit max number
BLOCKS = 50 # number of blocks to create

for (x = KEYMAX; x >= 0; x--) {
printf("%010d ", x);
if ((KEYMAX - x) % 5 == 4) {
printf("\n");
if (++y % 5 == 0) {
printf("\n")
if (--BLOCKS < 1) exit
}
}
}

}

# -----------------------------------------------------------------

notes:

https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

https://en.wikipedia.org/wiki/Daniel_J._Bernstein

https://groups.google.com/g/comp.lang.awk/c/FZNpn8Sxf9k

# eof

--
:wq
Mike Sanders

Mike Sanders

unread,
Sep 26, 2023, 3:21:21 PM9/26/23
to
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> OK. I thought you might want to be understood.

Well, I wake up in a new world everyday it seems =)

Have look up thread somewhere or another (subject
contains '[Complete]'.

--
:wq
Mike Sanders

Keith Thompson

unread,
Sep 26, 2023, 5:31:45 PM9/26/23
to
pork...@invalid.foo (Mike Sanders) writes:
[...]
> function djb2(str, hash, x, y, ascii) {
>
> hash = 5381
> y = length(str)
>
> for(x = 1; x <= y; x++) {
> ascii = substr(str, x, 1)
> hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
> # hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
> }
>
> return hash
>
> }
>
> function alphanum(tmp) {
>
> return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)
>
> }
[...]

Why do you call this "djb2"?

Here's a description of the djb2 hash function from
<https://www.eecs.yorku.ca/~oz/hash.html>:

this algorithm (k=33) was first reported by dan bernstein many years
ago in comp.lang.c. another version of this algorithm (now favored
by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
magic of number 33 (why it works better than many other constants,
prime or not) has never been adequately explained.

unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?

I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Mike Sanders

unread,
Sep 26, 2023, 6:20:58 PM9/26/23
to
Keith Thompson <Keith.S.T...@gmail.com> wrote:

> Why do you call this "djb2"?

Well, because that is what its called.

> Here's a description of the djb2 hash function from
> <https://www.eecs.yorku.ca/~oz/hash.html>:

Yes, one of the links in the notes section has a reference to that url.
Did you not read the notes?

> this algorithm (k=33) was first reported by dan bernstein many years
> ago in comp.lang.c. another version of this algorithm (now favored
> by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
> magic of number 33 (why it works better than many other constants,
> prime or not) has never been adequately explained.

I know, intesting stuff I thought!

> unsigned long
> hash(unsigned char *str)
> {
> unsigned long hash = 5381;
> int c;
>
> while (c = *str++)
> hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
>
> return hash;
> }


Here's my take in virtually indentical take in c (uint32_t):

uint32_t djb2(const char *str) {

uint32_t hash = 5381;
int c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

> I see some similarities between the original djb and your hash function,
> and yours was undoubtedly inspired by djb2, but they're very much not the
> same thing. Would you agree that calling it "djb2" is misleading?

Nope, wont change it either. =) I've provided my notes, so others can study it,
use it, extend it, trace it back to its source if so compelled... I make
no claims as to being its originator. Heck the *cough* notes *cough* alone
demonstrate that. I feel its 'nifty' for no reason than its short & sweet.
awk (or least I) needed this functionality. Besides, I'm hard-pressed to think
anyone will mistake it as anything misleading. I reckon no good deed will
go unpunished. Such is life on usenet...

> I find it odd that your function treats all non-alphanumeric characters
> as the same. Is there a reason for that?

Yes, just dealing (a-z, A-Z, 0-9) here. Personally don't want to imbue
it with extra, uneeded baggage. Feel free to modify it to suit your tastes.

--
:wq
Mike Sanders

Keith Thompson

unread,
Sep 26, 2023, 6:38:51 PM9/26/23
to
pork...@invalid.foo (Mike Sanders) writes:
> Keith Thompson <Keith.S.T...@gmail.com> wrote:
>> Why do you call this "djb2"?
>
> Well, because that is what its called.

You mean that's what you decided to call it. I'm asking why.

>> Here's a description of the djb2 hash function from
>> <https://www.eecs.yorku.ca/~oz/hash.html>:
>
> Yes, one of the links in the notes section has a reference to that url.
> Did you not read the notes?

I did. That's how I noticed that you've implemented a different
algorithm than any version of the original djb2.

[...]

>> I see some similarities between the original djb and your hash function,
>> and yours was undoubtedly inspired by djb2, but they're very much not the
>> same thing. Would you agree that calling it "djb2" is misleading?
>
> Nope, wont change it either. =)
[...]

Noted.

What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.

Mike Sanders

unread,
Sep 26, 2023, 7:12:14 PM9/26/23
to
Keith Thompson <Keith.S.T...@gmail.com> wrote:

> You mean that's what you decided to call it. I'm asking why.

Well, I don't think I have an anwser that will satisfy you.

> What you've posted is not djb2. You shouldn't claim that it is. You've
> made it clear that you intend to continue to do so.

What do you feel I ought to do to clarify the problem with this cavet:

It wont be taken down or renamed.

Perhaps add clarification to the document how my take only deals with
alphanumeric chars? Let's work it out.

--
:wq
Mike Sanders

Keith Thompson

unread,
Sep 26, 2023, 7:53:48 PM9/26/23
to
pork...@invalid.foo (Mike Sanders) writes:
> Keith Thompson <Keith.S.T...@gmail.com> wrote:
>> You mean that's what you decided to call it. I'm asking why.
>
> Well, I don't think I have an anwser that will satisfy you.

Then post an answer that doesn't satisfy me. So far you've said nothing.

>> What you've posted is not djb2. You shouldn't claim that it is. You've
>> made it clear that you intend to continue to do so.
>
> What do you feel I ought to do to clarify the problem

Rename it.

> with this cavet:
> It wont be taken down or renamed.

If you wrote a hash function that yields a 160-bit result that differs
from the result returned by the actual sha1 function, would you release
it under the name "sha1"? Assuming your answer is no, how is this
different? (Admittedly djb2 isn't as well established.)

> Perhaps add clarification to the document how my take only deals with
> alphanumeric chars? Let's work it out.

Work it out yourself.

Mike Sanders

unread,
Sep 26, 2023, 8:49:17 PM9/26/23
to
Keith Thompson <Keith.S.T...@gmail.com> wrote:

> Work it out yourself.

Will do.

--
:wq
Mike Sanders

Mike Sanders

unread,
Sep 26, 2023, 9:32:07 PM9/26/23
to
Mike Sanders <pork...@invalid.foo> wrote:

> Will do.

https://porkchop.neocities.org/notes/mHash.txt

--
:wq
Mike Sanders

Janis Papanagnou

unread,
Sep 27, 2023, 4:04:03 AM9/27/23
to
On 26.09.2023 23:31, Keith Thompson wrote:
> pork...@invalid.foo (Mike Sanders) writes:
> [...]
>
> Why do you call this "djb2"?
>
> Here's a description of the djb2 hash function from
> <https://www.eecs.yorku.ca/~oz/hash.html>:
> [...]
>
> I see some similarities between the original djb and your hash function,
> and yours was undoubtedly inspired by djb2, but they're very much not the
> same thing. Would you agree that calling it "djb2" is misleading?
>
> I find it odd that your function treats all non-alphanumeric characters
> as the same. Is there a reason for that?
>

Indeed. I as well was misleaded by the impression I've got that he
was trying to implement (just in a wrong way) an existing algorithm.
I noticed that (my) mistake not before I saw two versions of his code
in one post, each using different character set encodings - which of
course would lead to different results even if we compare only the
OP's own implementations' results. This arbitrariness in conjunction
with the inherent flaws the code versions had (and may still have;
I don't bother any more) makes clear that we obviously have a "Not
Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
which most likely makes all in depth discussions void anyway. Glad
it became obvious after all. (Bad that I noticed that too late. And I
see that the OP just recently followed your suggestion to rename his
hash function.)

It's always good to get information about the _intention_ of a piece
of posted code early and with the original post; that way bandworm
threads and misunderstandings could be avoided.

<OT>
Let's come back to the actual application and application domain ...

Inferred from the OP's posting where he wrote: dbj2("policy_number"))
and "simple a way to obfuscate small chucks of patient records." ...

In statistical medicine context there's demand to anonymize patient
data. For statistical purposes and cause-effect insights it's usually
necessary to be able to assign data to an "individual entity" without
disclosing his identity. But a hash code does not support that; it
can lead to "ID"-clashes that assigns different results to one same
subject. You need to create (real) keys. - Remember Ben's confusion
about (the OP's misnomer) "key number"! The OP may seem to require
keys but he's creating a hash value.

Of course if a hash value is sufficient - only the OP knows this! -
then I'd have used some existing and proven algorithm instead of
inventing my own (with possible flaws or non-obvious bad properties).
(But this as well has the OP to decide and is not an Awk question
anyway.)
</OT>

Janis

Mike Sanders

unread,
Sep 27, 2023, 6:45:20 AM9/27/23
to
Wow... I've simply attempted to learn & share that with others.
I cant recall every seeing such, I dunno, at a loss for words.
It would've taken me a few more iterations to bring it all togther
but with this environment, its not okay.
--
:wq
Mike Sanders

Michael Sanders

unread,
Sep 27, 2023, 2:13:23 PM9/27/23
to
On Wednesday, September 27, 2023 at 3:04:03 AM UTC-5, Janis Papanagnou wrote:

> Of course if a hash value is sufficient - only the OP knows this!

key = 32bitmax - random_lesser

(admin issues keys so no duplicates)

0 new messages