Re-tooling a Python PRNG data generator to use tuple hashes

20 views
Skip to first unread message

Phil Hibbs

unread,
Oct 23, 2017, 10:12:02 AM10/23/17
to Procedural Content Generation
I need to move away from sequential random numbers, and I've been inspired by this article:

First of all, is there anything in that article that rings alarm bells, or is it a good idea?

Right now I have a data generator written in Python that is structured like this:

num_people = randint(1,100)
people = [dict() for x in range(num_people)]
for person in people:
    person['surname'] = choice(surname_list)
    person['forename'] = choice(forename_list)

The problem with this is that I have to generate all the people and the same attributes in the same order in order to get repeatable data sets by setting the random seed.

I want to be able to insert a new line of code in between the two in the loop:

    person['midname'] = choice(forename_list) if randint(0,99) < 50 else ''

My intended approach to this is to have a tuple that describes each random number that is used, and to xxhash that with a seed to get a pseudorandom number.

For example:

h1_groupseed=1

h2_peoplecount=1
h2_people=2

h4_surname=1
h4_forename=2

num_people = pghash([h1_groupseed,h2_peoplecount]).hashint(1,100)
people = [dict() for x in range(num_people)]
for h3_index, person in enumerate(people,1):
    person['surname'] = pghash([h1_groupseed,h2_people,h3_index,h4_surname]).choice(surname_list)
person['forename'] = pghash([h1_groupseed,h2_people,h3_index,h4_forename]).choice
(forename_list)
The remaining problem is, how do I go about converting the hash into a random number?

My instinct is that I can just take the 64-bit hash and convert it into a double by dividing it by 2^64, and then into an integer range by multiplying by the maximum value and adding 1.

I could approximate a normal or gaussian distribution by splitting the hash into two 32-bit halves and doing a Box-Muller on them.

Would this be good enough, or is there anything I am missing about generating good pseudo-random numbers?

Phil Hibbs.

Peter Mawhorter

unread,
Oct 23, 2017, 11:40:21 AM10/23/17
to procedur...@googlegroups.com
What you're planning makes sense to me, and that article is indeed good advice. Since you're in Python, I think you can just use % to mod your hash values into a given range when you need to, without worrying about the number of bits involved (and this should be faster than converting to a float/double and back). Something like:

def in_range(hash_value, lower_bound, upper_bound):
  return lower_bound + hash_value % (upper_bound - lower_bound)

If you're using a high quality hash function like xxHash mentioned in that article, the chances that this modulus brings out a pattern in the results should be pretty low, and you can always do some tests or look at the source if you're worried about that. In that case you could also use something like this to select from a list:

person['surname'] = surname_list[in_range(<hash computation>, 0, len(surname_list))]

Another technique that might be useful at some point if name collisions are too frequent for your liking would be to use a hash value to seed a shuffle of your lists and then index directly by ID, which will make collisions rare-to-impossible depending on the lengths of your name lists and whether they're the same length or not.

Just an implementation note: I tend to write functions like "def surname(person_id, group_id):" to wrap up the logic into one place where I can re-use it if necessary across contexts, and then I'd just call that function in the for loop. That way it's easy to change internal hash values or re-compute a person's name in another part of the code if necessary (if the correct seeds are available) without potentially getting into copy/paste bugs. Writing things this way also forces you to define things as strict functions and allows you to see exactly which inputs are necessary, so you won't accidentally rely on a global variable or something like that that can later cause weird bugs when you decide to change it.

-Peter Mawhorter

--

---
You received this message because you are subscribed to the Google Groups "Procedural Content Generation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to proceduralcontent+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Phil Hibbs

unread,
Oct 24, 2017, 6:34:32 AM10/24/17
to Procedural Content Generation
Ok, here's the initial draft of my hash-based replacement for the random.Random class:

from __future__ import division
import xxhash
from numpy import sqrt, log, sin, cos, pi

def pgh(seed, tuple):
    x = xxhash.xxh64(seed=seed)
    x.update(','.join(tuple))
    return x.hexdigest()

def gaussian(u1, u2):
    z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
    z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
    return z1,z2

class pghash:
    def __init__(self, seed, tuple):
        self.hex = pgh(seed, tuple)

    def pgvalue(self):
        return int(self.hex, 16)

    def pghalves(self):
        return self.hex[:8], self.hex[8:]

    def pgvalues(self):
        return int(self.hex[:8], 16), int(self.hex[8:], 16)

    def random(self):
        return self.value() / 2**64

    def randint(self, min, max):
        return int(self.random() * max + min)

    def gauss(self, mu, sigma):
        xx = self.pgvalues()
        uu = [xx[0]/2**32, xx[1]/2**32]
        return gaussian(uu[0],uu[1])[0]
 
Phil.

Phil Hibbs

unread,
Oct 24, 2017, 6:35:28 AM10/24/17
to Procedural Content Generation
Hm, that didn't work. Lets try that again.

Phil Hibbs

unread,
Nov 1, 2017, 12:49:11 PM11/1/17
to Procedural Content Generation
I now have a working Python module!


Next step: extensive testing, and publish on pypi.

Phil.

Phil Hibbs

unread,
Nov 1, 2017, 1:04:12 PM11/1/17
to Procedural Content Generation
Reply all
Reply to author
Forward
0 new messages