creating unique numeric IDs in datastore (sample code)

2,871 views
Skip to first unread message

vrypan

unread,
Apr 27, 2008, 2:10:53 AM4/27/08
to Google App Engine
I wrote some code to implement the equivalent of a "unique
auto_increment index" in datastore (or better, to overcome the lack of
it).

You can find the code here: http://vrypan.net/log/2008/04/27/unique-integer-ids-in-google-datastore/

It looks like it's working, but I'm a Python newbie, so please point
out any mistakes or performance considerations!

Jeff Hinrichs

unread,
Apr 27, 2008, 11:28:46 AM4/27/08
to Google App Engine
I don't think that an auto-increment field is the way to go. It is
viable when you only have 1 database but I don't think that is how GAE
operates. Someone step in and correct me if I'm wrong. The datastore
for your app is going to/can be replicated out to other machines based
on geographic usage. This would mean that their exists times, when
datastore' != datastore'' -- over time datastore' would be sync'd with
datastore'' so that datastore' == datastore'' -- this would lead one
to believe that there will be times when the idea of an auto-increment
field will not be synchronizable or that the result of the
synchronization would be less than satisfactory. My belief that auto-
increment fields are the wrong idea in this environment is
strengthened by the fact that they are not offered as an intrinsic
data type in the Model or Expando classes.

The way to go, in my opinion, is to use UUIDs. (see links below)
http://docs.python.org/lib/module-uuid.html
http://www.faqs.org/rfcs/rfc4122.html

1) data access is very expensive, using a UUID should be faster
2) UUID1 or UUID4 would be the types to consider
3) UUID1 is preferable as it would introduce some machine significance
which should make the chances for a collision to be even more remote
than for a UUID4 (random)

Maybe one of the Google engineers would want to comment. Also, it
would be nice if GAE supplied a UUID property as one of the datastore
value types.

-Jeff

Lee O

unread,
Apr 27, 2008, 4:51:33 PM4/27/08
to google-a...@googlegroups.com
Note that there is also unique id's for each object in the datastore. They also increment. The only problem is it is not specific to model type, but rather simple based on objects in the store.
--
Lee Olayvar
http://www.leeolayvar.com

Dave Hanson

unread,
Apr 27, 2008, 6:11:46 PM4/27/08
to Google App Engine
When I need an id field, I use the unique id for each entity in the
datastore. It's a bit of a pain, because must save new objects once
before their ids are available, e.g.,

class Label(db.Model):
name = db.StringProperty(required=True)
...
id = db.IntegerProperty()

class LabelHandler(webapp.RequestHandler):
def post(self):
...
label = Label(name=self.request.get('name'))
label.put()
label.id = label.key().id()
label.put()
...

vrypan

unread,
Apr 27, 2008, 8:15:55 PM4/27/08
to Google App Engine
All mentioning object.key().id() are right. There are some
implications however:

1. Is key().id() really faster than my suggestion?
It should be. But how much?

2. I like Dave Hanson's example. However, what happens if someone
tries to access the label.id before it stored back?
Ensuring consistency there may take more resources -or not?

3. My "counter" approach has an extra bonus: it's useful when you need
to know the number of objects stored. Remember, there is no "select
count()" in datastore, and the key().id() sequence is not well defined
(the next object you store may have an id that's not last object's id
+1)

Jeff, I don't think your synchronization concerns are valid in this
case. It looks like the implementation I suggested is consistent (the
counter increments take place in a transaction) and the actual piece
of information that needs to be replicated between servers is a
relatively small object. That said, I have no insight on how google
infrastructure works, so I may be totally wrong. :-)

I would be nice if Google provided a stress-testing service for our
apps. Something like ab (Apache benchmarking tool) for AppEngine,
hosted by google?

--Panayotis.

Jeff Hinrichs

unread,
Apr 27, 2008, 9:02:15 PM4/27/08
to Google App Engine


On Apr 27, 7:15 pm, vrypan <vry...@gmail.com> wrote:
> All mentioning object.key().id() are right. There are some
> implications however:
>
> 1. Is key().id() really faster than my suggestion?
> It should be. But how much?
>
> 2. I like Dave Hanson's example. However, what happens if someone
> tries to access the label.id before it stored back?
> Ensuring consistency there may take more resources -or not?
>
> 3. My "counter" approach has an extra bonus: it's useful when you need
> to know the number of objects stored. Remember, there is no "select
> count()" in datastore, and the key().id() sequence is not well defined
> (the next object you store may have an id that's not last object's id
> +1)
>
> Jeff, I don't think your synchronization concerns are valid in this
> case. It looks like the implementation I suggested is consistent (the
> counter increments take place in a transaction) and the actual piece
> of information that needs to be replicated between servers is a
> relatively small object. That said, I have no insight on how google
> infrastructure works, so I may be totally wrong. :-)

I too, am hoping someone from Google would weigh in on the issue.
Since, I don't have a GAE account yet, I've had to do most of my
design/testing based on assumptions about synchronization and
replication. And I could be very wrong. ;) However, using UUIDs in
places where I would have used auto-incrementing is not that big of a
deal. It is a surrogate key, just as an auto-increment field is and
can be created prior to "put"ting the record, so additional db
interactions are not necessary. So the idea does have certain
advantages. As to keeping a count of records, that can be memoized
so the penalty for a COUNT(*) could be amortized out over a large
number of requests. (I'm not a big fan of surrogate keys but I am a
pragmatist on the topic.)

While the dev environment is apparently quite good in mocking the
python portion and basic table operations, it does leave a lot
unanswered for the implications of replication and synchronization --
I'm not sure if anyone(non-google) completely understands this with
regard to those two topics. And I don't fault the dev environment,
it's a nice piece of work but it isn't GAE proper<g>, nor could it be
expected to be.

Regards,

Jeff

Jeremey Barrett

unread,
Apr 28, 2008, 12:41:49 AM4/28/08
to google-a...@googlegroups.com
On Sun, Apr 27, 2008 at 5:11 PM, Dave Hanson <d...@drhanson.net> wrote:
>
> When I need an id field, I use the unique id for each entity in the
> datastore. It's a bit of a pain, because must save new objects once
> before their ids are available, e.g.,
>
> class Label(db.Model):
> name = db.StringProperty(required=True)
> ...
> id = db.IntegerProperty()

This would be better done with a property, I think:

class Label(db.Model):
name = db.StringProperty(required=True)

def get_id(self):
return self.key().id()
id = property(get_id)

Then you can do:

label = Label(name='foo')
label.put()
print label.id

You could put more intelligence into get_id() of course.

I'm not sure there's a compelling reason to do any of this, though,
since label.key().id() is quite simple if you want a simple numeric id.

Regards,
Jeremey.

Ben the Indefatigable

unread,
Apr 28, 2008, 8:43:17 AM4/28/08
to Google App Engine
On Apr 27, 11:28 am, dundeemt <dunde...@gmail.com> wrote:
> I don't think that an auto-increment field is the way to go. It is
> viable when you only have 1 database but I don't think that is how GAE
> operates. Someone step in and correct me if I'm wrong. The datastore
> for your app is going to/can be replicated out to other machines based
> on geographic usage. This would mean that their exists times, when
> datastore' != datastore'' -- over time datastore' would be sync'd with
> datastore'' so that datastore' == datastore'' -- this would lead one
> to believe that there will be times when the idea of an auto-increment
> field will not be synchronizable or that the result of the
> synchronization would be less than satisfactory. My belief that auto-
> increment fields are the wrong idea in this environment is
> strengthened by the fact that they are not offered as an intrinsic
> data type in the Model or Expando classes.

On Apr 27, 8:15 pm, vrypan <vry...@gmail.com> wrote:
> Jeff, I don't think your synchronization concerns are valid in this
> case. It looks like the implementation I suggested is consistent (the
> counter increments take place in a transaction) and the actual piece
> of information that needs to be replicated between servers is a
> relatively small object. That said, I have no insight on how google
> infrastructure works, so I may be totally wrong. :-)

Jeff, synchronization is not a concern. The datastore hides the issue
of multiple machines. Even when you don't have much data it is likely
that the master copies of different entities of your data reside on
different machines because it is never assigned or bound to one
machine; the BigTable infrastructure is shared by many applications
from the beginning. Rest assured that "datastore == datastore" as you
put it. The whole point of a transaction on an entity is that you can
rely on the integrity of that entity regardless of datastore's
underlying implementation.

Mahmoud

unread,
Apr 28, 2008, 10:37:39 AM4/28/08
to Google App Engine
Why do you need an auto_increment id?

The datastore already generates a globally _unique_ key for each
entity. Uniqueness is assured by choosing a long enough random
alphanumerical string, which makes collisions practically impossible
and eradicates the need for expensive transactions. Moreover, there is
also a unique numerical id, that is locally unique for that entity
type.

As for record/entity counts, numerous posts have suggested storing the
count somewhere in the datastore. However, one would have to update
the count whenever entities are added or deleted. This adds a penalty
to write operations, but makes fetching the count trivial.

-Mahmoud

vrypan

unread,
Apr 28, 2008, 11:09:26 AM4/28/08
to Google App Engine
On Apr 28, 5:37 pm, Mahmoud <mahmoud.ar...@gmail.com> wrote:
> Why do you need an auto_increment id?
>
> As for record/entity counts, numerous posts have suggested storing the
> count somewhere in the datastore. However, one would have to update
> the count whenever entities are added or deleted. This adds a penalty
> to write operations, but makes fetching the count trivial.
>

Storing the count is exactly the same problem! After all, if storing
the count was so trivial, you would use its current value as a unique,
auto-increment id, no?
You should update the counter in a transaction making sure that you
have the actual number and concurrent actions don't increase it from x
to x+1.
Back to where we started :-)

Ben the Indefatigable

unread,
Apr 28, 2008, 12:59:34 PM4/28/08
to Google App Engine
a better title for this post might have been "sequence number" or
"count" rather than unique ID. I imagine there are genuine uses for
this because it really just boils down to run_in_transaction on a
function that increments the number.

Jeremey Barrett

unread,
Apr 28, 2008, 3:35:32 PM4/28/08
to google-a...@googlegroups.com
On Mon, Apr 28, 2008 at 10:09 AM, vrypan <vry...@gmail.com> wrote:
>
> Storing the count is exactly the same problem! After all, if storing
> the count was so trivial, you would use its current value as a unique,
> auto-increment id, no?

An auto-incremented id is equal to the count only as long as you never
delete anything. :)

Here's a thought:

class Foo(db.Model):
name = db.StringProperty()
counter = db.ReferenceProperty(Counter, required=True)

class Counter(db.Model):
count = db.IntegerProperty()

def incr(self):
db.run_in_transaction(self.__increment)

def decr(self):
db.run_in_transaction(self.__decrement)

def __increment(self):
counter = Counter.get(self.key())
counter.count += 1
self.count = counter.count
counter.put()

def __decrement(self):
counter = Counter.get(self.key())
counter.count -= 1
self.count = counter.count
counter.put()

# to initialize once and only once
foo_count = Counter(count=0, key_name="foo_counter")
foo_count.put()

# Then:

f = Foo(name="some_foo", counter=foo_count)
f.put()
f.counter.incr()


Regards,
Jeremey.

Jeremey Barrett

unread,
Apr 28, 2008, 3:42:19 PM4/28/08
to google-a...@googlegroups.com
On Mon, Apr 28, 2008 at 2:35 PM, Jeremey Barrett
<jeremey...@gmail.com> wrote:
>
> Here's a thought:
>
> class Foo(db.Model):
> name = db.StringProperty()
> counter = db.ReferenceProperty(Counter, required=True)

FWIW, the relationship here is not required, you could do without it.
It might have its uses though, depending on what you're doing.

Regards,
Jeremey.

Dave Hanson

unread,
Apr 29, 2008, 12:52:21 AM4/29/08
to Google App Engine
On Apr 27, 9:41 pm, "Jeremey Barrett" <jeremey.barr...@gmail.com>
wrote:
> ... This would be better done with a property, I think:
>
> class Label(db.Model):
>     name = db.StringProperty(required=True)
>
>     def get_id(self):
>         return self.key().id()
>     id = property(get_id)

Good suggestion. At first, I thought I needed to store the ids in the
datastore in a db property so I could access them in templates, but
after seeing your suggestion, I realized that, with a little work, I
could omit the db property and use the Python property as you show
above. I switched to using properties for the id fields of my entities
and for other synthesized properties. I lost the caching effect of
storing the synthesized properties, but that benefit was tiny (and
nonexistent for one of my synthesized properties :-)

pick...@gmail.com

unread,
Apr 29, 2008, 4:38:03 AM4/29/08
to Google App Engine
class Sequence(db.Model):
@staticmethod
def next():
seq = Sequence()
seq.put()
id = seq.key().id()
seq.delete()
return id
print Sequence.next()

i'm using it in my gap application, and it seems ok.

Brett Morgan

unread,
Apr 29, 2008, 6:07:02 AM4/29/08
to google-a...@googlegroups.com
Let me get this straight in my head. You are creating an object,
stuffing it into the database, remembering it's key and then deleting
it?

So, what is the advantage over doing this in the class that you wanted
unique identifiers in?

--

Brett Morgan http://brett.morgan.googlepages.com/

Filip

unread,
Apr 29, 2008, 7:24:06 AM4/29/08
to Google App Engine
Wouldn't this approach create contention on the datastore, because
every call needs to update the counter? The new Google App Engine blog
post http://googleappengine.blogspot.com/2008/04/posted-by-ken-ashcraft-software.html
warns against doing that, because the application will have scaling
issues.

Filip

Jeff Hinrichs

unread,
Apr 29, 2008, 10:11:19 AM4/29/08
to Google App Engine
That is always good to know, but according to
http://googleappengine.blogspot.com/2008/04/posted-by-ken-ashcraft-software.html,
I still think I am doing the right thing by using UUIDs instead --
although my initial reasoning to go that route was incorrect. The
blog mentions avoiding contention to a single datastore element, which
a UUID would do, also to minimize the number of writes to the
database. Since the KeyName/ID can not be used like a normal property
in a gql query and if you wish to do so, you must put/get/put which
violates the rule of minimizing writes. Where as a UUID is computed
before the entity is stored it only requires a single put to save the
data and surrogate key.

Regards,

Jeff

vrypan

unread,
Apr 29, 2008, 10:41:41 AM4/29/08
to Google App Engine
Filip, you're right.

I've ended up "distributing" my counter to multiple counters (1024 in
my case), and my app makes use of one of them randomly.
It adds complexity, but in my case it may work, and the chance of
creating a bottleneck is very low. (Of course depending on your needs
you may use much more counters).

On the other hand calculating the total number of records is not
trivial, but still feasible, you just have to sum up the values of all
counters. Not as easy as a simple look up, but still better than going
through millions of records.

--Panayotis.


On Apr 29, 2:24 pm, Filip <filip.verhae...@gmail.com> wrote:
> Wouldn't this approach create contention on the datastore, because
> every call needs to update the counter? The new Google App Engine blog
> posthttp://googleappengine.blogspot.com/2008/04/posted-by-ken-ashcraft-so...

Ben the Indefatigable

unread,
Apr 29, 2008, 10:52:20 AM4/29/08
to Google App Engine


On Apr 29, 10:11 am, dundeemt <dunde...@gmail.com> wrote:
> I still think I am doing the right thing by using UUIDs instead --
> although my initial reasoning to go that route was incorrect.   The
> blog mentions avoiding contention to a single datastore element, which
> a UUID would do, also to minimize the number of writes to the
> database.  Since the KeyName/ID can not be used like a normal property
> in a gql query and if you wish to do so, you must put/get/put which
> violates the rule of minimizing writes.  Where as a UUID is computed
> before the entity is stored it only requires a single put to save the
> data and surrogate key.
>
> Regards,
>
> Jeff- Hide quoted text -
>

Yes, as I said in my next post, to me this technique is for sequence
numbers or counts, not unique IDs. I do not agree with using this
technique for unique IDs since they are already provided implicitly.

Jeremey Barrett

unread,
Apr 29, 2008, 11:11:42 AM4/29/08
to google-a...@googlegroups.com
It would depend on what you used it for. This approach or something
like it might make sense where counting is expensive and updating
the cached counter is not going to be a point of contention. You absolutely
would not use it for something like global unique IDs (which AppEngine
gives you already).

Regards,
Jeremey.

Ben the Indefatigable

unread,
Apr 29, 2008, 11:59:10 AM4/29/08
to Google App Engine
This thread is *not* about a "global counter" on every request to your
app in the sense that Ken is talking about, it is about a counter on
*writes* that are occuring because an entity is added or modified.

Ken Ashcrafts warning was this:

"Avoid contention on datastore entities. If every request to your app
reads or writes a particular entity, latency will increase as your
traffic goes up because reads and writes on a given entity are
sequential. One example construct you should avoid at all costs is the
gobal counter, i.e. an entity that keeps track of a count and is
updated or read on every request. There are some interesting ways of
simulating this behavior that don't require reads/writes on every
request, and we'll talk about a handy way to cache entities for reads
below."

Ken does not give a suggestion for implementing a global counter, he
only says to avoid it. Below he shows that you can reduce *reads* by
caching your read in a global variable.

So, Ken's concern about lock contention probably does not apply to
this thread. I understand that Ken pointed to "contention" which
evokes a lock "bottleneck" which is what you tried to address with a
"distributed" counter. But Ken was talking about avoiding a counter on
every request not every write. Since the ratio of reads to writes in
web applications is usually 1000 to 1 or so, what Ken is pointing out
is that you don't want to introduce a write on every request! That's
pretty simple actually; I think Ken's mention of "contention" actually
distracted from the point.

Panayotis, as you stated, your distributed counter solution only gives
you a count in such a way that determining the count is an expensive
query. This is only useful if you need some way of getting the count
without using the datastore count() query on a huge set of entities.
It will not work for applications of counters or sequence numbers
where you need to know the exact number.

If a true counter or sequence ID is critical to your application, the
contention of incrementing a counter in a single entity should be
okay. If that entity is being modified and utilized constantly, it
will remain in memory. The datastore commits changes by distributing
them to the memory of multiple machines and appending a commit log. If
you can do without it, then by all means avoid it. If you are trying
to keep a grand total of records you might use your distributed
counter or more likely you will have somewhere more natural to keep
subcounts, such as other categories or divisions within your site.
> > Filip- Hide quoted text -
>
> - Show quoted text -

Ben the Indefatigable

unread,
Apr 29, 2008, 12:32:04 PM4/29/08
to Google App Engine
In this post Marce of google describes a mechanism for generating a
real number ID used to select a random record, and avoid a "uniformly
incrementing counter." Which goes to show that if you use your
imagination you may find a satisfactory alternative to the sequence
number/counter problem, depending on your situation.

http://groups.google.com/group/google-appengine/msg/363502ea2291e01d
Reply all
Reply to author
Forward
0 new messages