Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Generating a unique identifier

14 views
Skip to first unread message

Steven D'Aprano

unread,
Sep 7, 2007, 8:03:23 AM9/7/07
to
I have an application that will be producing many instances, using them
for a while, then tossing them away, and I want each one to have a unique
identifier that won't be re-used for the lifetime of the Python session.

I can't use the id() of the object, because that is only guaranteed to be
unique during the lifetime of the object.

For my application, it doesn't matter if the ids are predictable, so I
could do something as simple as this:

def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = unique_id()

while Application_Is_Running:
make_an_object(id=unique_id())
do_stuff_with_objects()
delete_some_of_them()

which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.

--
Steven.

Will Maier

unread,
Sep 7, 2007, 8:14:34 AM9/7/07
to pytho...@python.org
On Fri, Sep 07, 2007 at 12:03:23PM -0000, Steven D'Aprano wrote:
[...]

> which is easy enough, but I thought I'd check if there was an existing
> solution in the standard library that I missed. Also, for other
> applications, I might want them to be rather less predictable.

2.5 includes the uuid module for RFC 4122 universally-unique IDs:

http://docs.python.org/lib/module-uuid.html

--

[Will Maier]-----------------[will...@ml1.net|http://www.lfod.us/]

Marc 'BlackJack' Rintsch

unread,
Sep 7, 2007, 8:45:12 AM9/7/07
to
On Fri, 07 Sep 2007 12:03:23 +0000, Steven D'Aprano wrote:

> I have an application that will be producing many instances, using them
> for a while, then tossing them away, and I want each one to have a unique
> identifier that won't be re-used for the lifetime of the Python session.
>
> I can't use the id() of the object, because that is only guaranteed to be
> unique during the lifetime of the object.
>
> For my application, it doesn't matter if the ids are predictable, so I
> could do something as simple as this:
>
> def unique_id():
> n = 1234567890
> while True:
> yield n
> n += 1
>
> unique_id = unique_id()
>
> while Application_Is_Running:
> make_an_object(id=unique_id())
> do_stuff_with_objects()
> delete_some_of_them()
>
> which is easy enough, but I thought I'd check if there was an existing
> solution in the standard library that I missed.

For that easy solution you can use `itertools.count()`.

Ciao,
Marc 'BlackJack' Rintsch

kyos...@gmail.com

unread,
Sep 7, 2007, 9:58:10 AM9/7/07
to
On Sep 7, 7:03 am, Steven D'Aprano <st...@REMOVE-THIS-

You could always use the md5 module along with time.time() to generate
a unique id.

Mike

Wildemar Wildenburger

unread,
Sep 7, 2007, 10:06:07 AM9/7/07
to
Will Maier wrote:
> On Fri, Sep 07, 2007 at 12:03:23PM -0000, Steven D'Aprano wrote:
> [...]
>> which is easy enough, but I thought I'd check if there was an existing
>> solution in the standard library that I missed. Also, for other
>> applications, I might want them to be rather less predictable.
>
> 2.5 includes the uuid module for RFC 4122 universally-unique IDs:
>
> http://docs.python.org/lib/module-uuid.html
>
Jesus Christ! What in all the world *doesn't* Python do already?!

Can't we now just write up a module for the standard lib that takes care
of writing all the other code as well and just forget about getting our
hands dirty with programming altogether?

/W
(just in case: ;)!)

Paul Rubin

unread,
Sep 7, 2007, 11:42:45 AM9/7/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
> def unique_id():
> n = 1234567890
> while True:
> yield n
> n += 1

unique_id = itertools.count(1234567890)

> which is easy enough, but I thought I'd check if there was an existing
> solution in the standard library that I missed. Also, for other
> applications, I might want them to be rather less predictable.

def unique_id():
return os.urandom(10).encode('hex')

Paul Rubin

unread,
Sep 7, 2007, 11:47:58 AM9/7/07
to
Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> def unique_id():
> return os.urandom(10).encode('hex')

Sorry, make that 32 or 40 instead of 10, if the number of id's is large,
to make birthday collisions unlikely.

If you don't want the id's to be that large, you can implement a
Feistel cipher using md5 or sha as the round function pretty
straightforwardly, then just feed successive integers through it.
That also guarantees uniqueness, at least within one run of the
program. I have some sample code around for that, let me know if you
need it.

Ben Finney

unread,
Sep 7, 2007, 7:47:42 PM9/7/07
to
Will Maier <will...@ml1.net> writes:

> On Fri, Sep 07, 2007 at 12:03:23PM -0000, Steven D'Aprano wrote:
> [...]
> > which is easy enough, but I thought I'd check if there was an
> > existing solution in the standard library that I missed. Also, for
> > other applications, I might want them to be rather less
> > predictable.
>
> 2.5 includes the uuid module for RFC 4122 universally-unique IDs:
>
> http://docs.python.org/lib/module-uuid.html

I second this recommendation. If you want arbitrary unique IDs that
are not a function of the data they identify, the simplest solution is
a monotonically-increasing serial number. If you want more than that,
you might as well go straight to the standard UUIDs.

Even if you're not using Python 2.5, grab the external 'python-uuid'
package for your operating system and use that. If you use the options
for including random elements, the generated UUIDs will even satisfy
your "unpredictable" requirement.

--
\ "This sentence contradicts itself -- no actually it doesn't." |
`\ -- Douglas Hofstadter |
_o__) |
Ben Finney

Steven D'Aprano

unread,
Sep 7, 2007, 9:31:46 PM9/7/07
to
On Fri, 07 Sep 2007 08:42:45 -0700, Paul Rubin wrote:

> Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
>> def unique_id():
>> n = 1234567890
>> while True:
>> yield n
>> n += 1
>
> unique_id = itertools.count(1234567890)

Sweet!

I really must make itertools second-nature. I always forget it.


>> which is easy enough, but I thought I'd check if there was an existing
>> solution in the standard library that I missed. Also, for other
>> applications, I might want them to be rather less predictable.
>
> def unique_id():
> return os.urandom(10).encode('hex')

Any time I see something using a random number to generate IDs, I worry
about collisions. Am I being paranoid? (But even paranoids write code
with bugs...)


Here's something which is a little less predictable than a straight
counter:

def unpredictable_counter(n=12345678):
while True:
n += random.randint(1, 68)
yield n

def more_unpredictable_counter(n=1234567):
uc = unpredictable_counter(n)
pool = []
while True:
if not pool:
pool = [None]*99
for i in range(99):
pool[i] = uc.next()
random.shuffle(pool)
yield pool.pop()


--
Steven.

Paul Rubin

unread,
Sep 7, 2007, 9:32:31 PM9/7/07
to
Ben Finney <bignose+h...@benfinney.id.au> writes:
> > http://docs.python.org/lib/module-uuid.html
> I second this recommendation. If you want arbitrary unique IDs that
> are not a function of the data they identify, the simplest solution is
> a monotonically-increasing serial number. If you want more than that,
> you might as well go straight to the standard UUIDs.

The standard UUID's as described in that url don't make any serious
attempt to be unguessable. It's better to use a string from os.urandom().

To Stephen: double oops, I was thinking of hex digits rather than bytes
when I suggested reading 32 (length of an md5 hash) or 40 (length of
an sha1 hash) from os.urandom. That should have said 16 or 20 bytes.

If k is the number of unique id's you'll actually use, and N is the
number of possible random uid's (so for 16 bytes, N=2**128 since
16 bytes is 128 bits), the probability of a collision occuring is
approximately exp(-k**2 / (2*N)), assuming k is small compared
with sqrt(N).

Paul Rubin

unread,
Sep 7, 2007, 10:17:44 PM9/7/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
> > unique_id = itertools.count(1234567890)
>
> Sweet!
>
> I really must make itertools second-nature. I always forget it.

Beware that count() output (like enumerate) is always <= sys.maxint.
It raises an exception if it overflows. IMO this is a bug.

> > def unique_id():
> > return os.urandom(10).encode('hex')
>
> Any time I see something using a random number to generate IDs, I worry
> about collisions. Am I being paranoid? (But even paranoids write code
> with bugs...)

Well, the idea is to make the string long enough (I shouldn't have
picked 10 bytes) to make the probability of collisions acceptably low.
The probability is about exp(-k**2/(2*N)) where k is the number of
id's you use and N is the number of possible ID's. So with
os.urandom(20), if you use 1 billion (= approx 2**30) id's,
the probability is about exp(-(2**60 / 2*2**160)) = 1/exp(2**101)
which is extremely small. Using random strings is probably safer
from collisions than maintain keep distinct initial counters
across multiple runs of a program, on multiple computers, etc.

The classic example is ethernet cards. Each one is supposed to have a
unique hardware 48-bit MAC address, with routers etc. relying on the
uniqueness. Some organization is supposed to assign a unique code to
each manufacturer, and then the manufacturer uses that code as a
prefix and assigns addresses in sequence within that space. That
works fine until sales go up, manufacturers start opening multiple
factories operating out of the same space, low-rent manufacturers
start making up their own codes, etc. So companies that buy large
numbers of cards often find multiple cards with the same address,
causing various hassles. If they just assigned 128-bit MAC addresses
at random it's very unlikely that there would ever be a collision.

> Here's something which is a little less predictable than a straight
> counter:

It's still very easy to generate valid id's from that, or guess the
next one with non-negligible probability. IMO it's almost never worth
messing with a "little less predictable". If you're trying to prevent
an actual opponent from guessing something, use full-strength
cryptography.

You could try something like this:

import sha, os, itertools

radix = 2**32 # keep this == 2**(some even number <= 80)
secret_key = os.urandom(20)

def prp(key, n):
# pseudo-random permutation (8-round Feistel network)
# transform n to a unique number < radix**2
assert 0 <= n < radix**2

def F(i,x):
return int(sha.new('%s,%x,%x'%(key,i,x)).hexdigest(), 16) % radix

a,b = divmod(n, radix)
for i in xrange(8):
a ^= F(i,b)
a,b = b,a
return radix*a + b

unique_id = (prp(secret_key, i) for i in itertools.count())

It should be pretty secure and (unless I made an error) is a
permutation from [0:radix**2] to [0:radix**2], similar to DES.
It's invertible if you know the secret key (I'll leave that as an
exercise). 8 rounds is probably overkill for radix 2**32.

Paul Rubin

unread,
Sep 7, 2007, 10:22:43 PM9/7/07
to
Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> the probability is about exp(-(2**60 / 2*2**160)) = 1/exp(2**101)

Bah. Got that backwards and calculated wrong. The probability of no
collisions is

exp(-(2**60) / (2*2**160)) = exp(-(2**-101)) = approx (1 - 2**-101)

which is very close to 1. The probability of collision is about
2**-101 which is close to 0. Proofread first, then post. I keep
getting that in the wrong order.

Steve Holden

unread,
Sep 7, 2007, 10:36:27 PM9/7/07
to pytho...@python.org

Right. The words "birthday attack" should have been in the back of your
mind while you were writing that message.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

Steven D'Aprano

unread,
Sep 7, 2007, 10:58:33 PM9/7/07
to
On Fri, 07 Sep 2007 08:47:58 -0700, Paul Rubin wrote:

> Paul Rubin <http://phr...@NOSPAM.invalid> writes:
>> def unique_id():
>> return os.urandom(10).encode('hex')
>
> Sorry, make that 32 or 40 instead of 10, if the number of id's is large,
> to make birthday collisions unlikely.

I did a small empirical test, and with 16 million ids, I found no
collisions.

However, I did find that trying to dispose of a set of 16 million short
strings caused my Python session to lock up for twenty minutes until I
got fed up and killed the process. Should garbage-collecting 16 million
strings really take 20+ minutes?


> If you don't want the id's to be that large, you can implement a Feistel
> cipher using md5 or sha as the round function pretty straightforwardly,
> then just feed successive integers through it. That also guarantees
> uniqueness, at least within one run of the program. I have some sample
> code around for that, let me know if you need it.

I'm not sure that I need it, but I would certainly be curious to see it.


Thanks,


--
Steven.

Steven D'Aprano

unread,
Sep 7, 2007, 11:24:54 PM9/7/07
to
On Fri, 07 Sep 2007 19:17:44 -0700, Paul Rubin wrote:

>> Here's something which is a little less predictable than a straight
>> counter:
>
> It's still very easy to generate valid id's from that, or guess the next
> one with non-negligible probability.

Well, in my specific application, the only consequences of guessing a
valid id is that you get to congratulate yourself for guessing a valid
id :)

As it turned out, I decided that since the only reason I had for avoiding
consecutive ids was that they didn't look random, and that this would not
really matter because no human being (including myself) was actually
going to be looking at them except during debugging, I used the
intertools.count() solution.


Thanks to all who replied.

--
Steven.

Paul Rubin

unread,
Sep 8, 2007, 12:15:50 AM9/8/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
> > Sorry, make that 32 or 40 instead of 10, if the number of id's is large,
> > to make birthday collisions unlikely.
>
> I did a small empirical test, and with 16 million ids, I found no
> collisions.

16 million 32-byte ids? With string and dictionary overhead that's
probably on the order of 1 GB. Anyway, 16 bytes is enough, as
mentioned elsewhere.

> However, I did find that trying to dispose of a set of 16 million short
> strings caused my Python session to lock up for twenty minutes until I
> got fed up and killed the process. Should garbage-collecting 16 million
> strings really take 20+ minutes?

Maybe your system was thrashing, or maybe the GC was happening during
allocation (there was some discussion of that a while back).

> > If you don't want the id's to be that large, you can implement a Feistel

> I'm not sure that I need it, but I would certainly be curious to see it.

I posted some code elsewhere in the thread.

Hrvoje Niksic

unread,
Sep 8, 2007, 5:38:35 AM9/8/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

> Should garbage-collecting 16 million strings really take 20+
> minutes?

It shouldn't. For testing purposes I've created a set of 16 milion
strings like this:

s = set()
for n in xrange(16000000):
s.add('somerandomprefix' + str(n)) # prefix makes the strings a bit larger

It takes maybe about 20 seconds to create the set. Quitting Python
takes 4-5 seconds. This is stock Python 2.5.1.

stephen.l...@googlemail.com

unread,
Sep 10, 2007, 6:44:09 PM9/10/07
to
On Sep 7, 1:03 pm, Steven D'Aprano <st...@REMOVE-THIS-

This is very simple. Use a class variable that increments every time a
new object is created. If you need to use it in a number of different
types of object then you could use the code below in a metaclass and
tag all of your classes with the metaclass. Unless you are making
billions of objects then i think that this should suffice.

class MyObj:
id_count = 0
def __init__(self):
self.id = self.id_count
MyObj.id_count += 1
print self.id, MyObj.id_count

MyObj()
MyObj()

0 new messages