Random numbers in GAWK

Kenny McCormack

unread,

Sep 22, 2016, 12:53:49 PM9/22/16

to

I've become concerned about the "dispersion" of the random numbers
generated by GAWK. Now, I know this isn't statistically rigorous, so,
please, no "Well, it could be right"/"You don't know statistics"/"Blah,
blah, blah" type answers.

So, I do (think of this as rolling a single die):

$ echo "10 6" | gawk4 'BEGIN {srand()} { for (i=1; i<=$1; i++) x[int(rand()*$2)]++;for (i in x) print
i,x[i];delete x}'
0 2
1 3
4 2
5 3
$

And note with concern that neither 2 or 3 came up. The counts of 3 (two of
them!) look a little heavy, and, in fact, when I run this experiment
multiple times, I frequently get 4s or even 5s (as the count).

So, one starts to think in terms of other random number generators (since
there seems to be a general problem of the rand()/etc functions that come
with C not being very good). Note that gawk uses "random()", which is
supposed to be better than "rand()", but still doesn't look very good to
me.

So, naturally, one starts to think in terms of using /dev/urandom (on
Linux) and so I knocked this one up:

--- Cut Here ---
# Note: The input to this is: $1 == # of iterations, $2 = multiplier.
BEGIN { cmd = "tput setaf 1;printf \"*\";tput setaf 0";cmd | getline star;close(cmd) }
{
delete x; delete nums
mult = $2; n = 0
for (cmd = "od -b -N "$1" /dev/urandom"; cmd | getline;)
for (i=2; i<=NF; i++)
nums[++n] = strtonum("0" $i)
close(cmd)
for (i in nums) x[int(nums[i]/256*mult)]++
for (i=tot=0; i<mult; i++) print i,x[i] ? x[i] : star,tot+=x[i]
}
--- Cut Here ---
Run the above with inputs like "10 6" and variations thereof.

And the results seem to be somewhat better, but still not great.

What I want is pretty much that every number come up at least once (and
with roughly equal frequency) - that seems reasonable - but, obviously, I
don't want to do anything brute force-ish to ensure that outcome.

Anyone have any better ideas?

--
Marshall: 10/22/51
Jessica: 4/4/79

Janis Papanagnou

unread,

Sep 22, 2016, 1:40:44 PM9/22/16

to

On 22.09.2016 18:53, Kenny McCormack wrote:
> [...]

>
> And the results seem to be somewhat better, but still not great.
>
> What I want is pretty much that every number come up at least once (and
> with roughly equal frequency) - that seems reasonable - but, obviously, I
> don't want to do anything brute force-ish to ensure that outcome.
>
> Anyone have any better ideas?

You seem to be seeking some hybrid between a (P)RNG and a LFSR. I think
you can't get both, but you could probably combine the methods. A PRNG
(modulo weak algorithms) may produce sequences that are lacking certain
numbers, while LFSR will create a sequence that won't repeat numbers
unless all numbers have already been created but it's predictable. There
are other options, like cryptographic algorithms (see, e.g., "Numerical
recipes in C" where even DES is used for that purpose) that will produce
in a large domain practically no clashes. The problem is, though, that
once you apply the modulo operator to map the large number domain down
to the application domain you again can't be sure that some numbers are
not duplicated while others completely missing. Is what you want maybe
a LFSR-sequence where the to be used numbers are selected by any (even
primitive, say, a linear congruential) RNG?

Janis

Bruce Horrocks

unread,

Sep 22, 2016, 4:46:37 PM9/22/16

to

On 22/09/2016 17:53, Kenny McCormack wrote:
> Anyone have any better ideas?

Roll the dice (n-6) times and add 1 to each total. :-)

--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

Andrew Schorr

unread,

Sep 22, 2016, 9:05:22 PM9/22/16

to

Hi,

A couple of thoughts. First, the master branch, which eventually should be released as 4.2, has a revised rand() function that is designed to give "more random" results. Second, the gawk rand() function actually uses the C library random() function, not rand(), FWIW. Here are some results using the master branch:

bash-4.3$ echo "100 6" | ./gawk 'BEGIN {srand()} { for (i=1; i<=$1; i++) x[int(rand()*$2)]++;for (i in x) print i,x[i];delete x}'
0 21
1 16
2 18
3 14
4 15
5 16

Regards,
Andy

P.S. Your original sample size of 10 strikes me as quite small.

Kenny McCormack

unread,

Sep 23, 2016, 1:01:45 AM9/23/16

to

In article <6fff5276-1533-4966...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
>Hi,

Indeed. Hello.

>A couple of thoughts. First, the master branch, which eventually should be
>released as 4.2, has a revised rand() function that is designed to give "more
>random" results.

Good to hear.

>Second, the gawk rand() function actually uses the C library

>random() function, not rand().

Yes, understood. From the O.P.:

* Note that gawk uses "random()", which is
* supposed to be better than "rand()", but still doesn't look very good to
* me.

>FWIW. Here are some results using the master branch:
>
>bash-4.3$ echo "100 6" | ./gawk 'BEGIN {srand()} { for (i=1; i<=$1; i++)
>x[int(rand()*$2)]++;for (i in x) print i,x[i];delete x}'
>0 21
>1 16
>2 18
>3 14
>4 15
>5 16

This does look better. Looking forward to it!

>P.S. Your original sample size of 10 strikes me as quite small.

Intentionally so. Where this is all coming from is that I'm using a random
number to pick one of a small number of strings to print. The total number
of possible strings is about 10 and the number of picks is also about 10.
I've noticed in practice that the program almost always prints the same
string multiple times and skips many of them in any given run. This
obviously makes the output look like sensible.

--
The key difference between faith and science is that in science, evidence that
doesn't fit the theory tends to weaken the theory (that is, make it less likely to be
believed), whereas in faith, contrary evidence just makes faith stronger (on the
assumption that Satan is testing you - trying to make you abandon your faith).

Anton Treuenfels

unread,

Sep 23, 2016, 8:01:11 PM9/23/16

to

"Kenny McCormack" <gaz...@shell.xmission.com> wrote in message
news:ns2cvp$v42$1...@news.xmission.com...
> In article <6fff5276-1533-4966...@googlegroups.com>,

> Intentionally so. Where this is all coming from is that I'm using a
> random
> number to pick one of a small number of strings to print. The total
> number
> of possible strings is about 10 and the number of picks is also about 10.
> I've noticed in practice that the program almost always prints the same
> string multiple times and skips many of them in any given run. This
> obviously makes the output look like sensible.

If you want to print all the strings in a given run, discard each one you've
picked and choose the next from the smaller pool. Re-fill the pool if you
need more.

Or choose the next string from the whole pool each time, but each time a
particular string is chosen, weight it to reduce the probability of its
being chosen again.

- Anton Treuenfels

Manuel Collado

unread,

Sep 24, 2016, 8:22:58 AM9/24/16

to

El 24/09/2016 a las 2:00, Anton Treuenfels escribió:
>
> "Kenny McCormack" <gaz...@shell.xmission.com> wrote in message
> news:ns2cvp$v42$1...@news.xmission.com...
>> In article <6fff5276-1533-4966...@googlegroups.com>,
>> Intentionally so. Where this is all coming from is that I'm using a
>> random
>> number to pick one of a small number of strings to print. The total
>> number
>> of possible strings is about 10 and the number of picks is also about 10.
>> I've noticed in practice that the program almost always prints the same
>> string multiple times and skips many of them in any given run. This
>> obviously makes the output look like sensible.
>
> If you want to print all the strings in a given run, discard each one
> you've picked and choose the next from the smaller pool. Re-fill the
> pool if you need more.

To print once each element in a set, in a random order:

mcollado@PC6 /cygdrive/c/pruebas/awk
$ cat randomperm.awk
BEGIN {
size = 10
srand()
# fill a 1..size array
for (k=1; k<=size; k++) sample[k] = k
# print the elements in a random order
for (k=size; k>0; k--) {
x = int(rand()*k)+1
print size-k+1, sample[x]
sample[x] = sample[k]
}
}

mcollado@PC6 /cygdrive/c/pruebas/awk
$ awk -f randomperm.awk
1 1
2 5
3 7
4 8
5 10
6 3
7 4
8 9
9 2
10 6

mcollado@PC6 /cygdrive/c/pruebas/awk
$ awk -f randomperm.awk
1 1
2 9
3 6
4 2
5 5
6 7
7 3
8 8
9 10
10 4

HTH.