Creating unique key names

80 views
Skip to first unread message

Andy Freeman

unread,
Nov 21, 2008, 1:01:11 PM11/21/08
to Google App Engine
I'd need to create a unique name to use to name an entity group
without using any datastore operations.

Yes, time.time() and datetime.datetime.now() are both pretty good, but
humor me.

Is os.getpid() available?

Is there any way to get the hostname, IP, or MAC address where the GAE
app is running? (The ways that I know involve the socket module,
which I doubt is available.)

David Symonds

unread,
Nov 21, 2008, 8:32:29 PM11/21/08
to google-a...@googlegroups.com
On Sat, Nov 22, 2008 at 5:01 AM, Andy Freeman <ana...@earthlink.net> wrote:

> I'd need to create a unique name to use to name an entity group
> without using any datastore operations.

I don't understand what you're trying to do. Entity groups don't have names.


Dave.

Andy Freeman

unread,
Nov 21, 2008, 9:10:28 PM11/21/08
to Google App Engine
Entity groups are named by the key of the root, which can be
constructed from a kind and a key name.

From http://code.google.com/appengine/docs/datastore/keysandentitygroups.html

"You can create an entity with an ancestor path without first creating
the parent entity. To do so, you create a Key for the ancestor using a
kind and key name, then use it as the parent of the new entity. All
entities with the same root ancestor belong to the same entity group,
whether or not the root of the path represents an actual entity."

Suppose that I want to atomically create an entity group with two
nodes, one the parent of the other.

class A(db.Model):
pass

class B(db.Model):
pass

a = A(key_name='unique')
b = B(parent=a.key())

db.put([a,b])

Note that there's a bug in the SDK (and maybe GAE as well) in this
area

http://code.google.com/p/googleappengine/issues/detail?id=883

David Symonds

unread,
Nov 22, 2008, 12:21:25 AM11/22/08
to google-a...@googlegroups.com
On Sat, Nov 22, 2008 at 1:10 PM, Andy Freeman <ana...@earthlink.net> wrote:

> Entity groups are named by the key of the root, which can be
> constructed from a kind and a key name.

Okay, that's not really the entity group being named, but the root
entity being named.

> Suppose that I want to atomically create an entity group with two
> nodes, one the parent of the other.

But *why* exactly do you want to do this? In your original email, you
said "Is there any way to get the hostname, IP, or MAC address where
the GAE app is running?", which is a fundamentally flawed question.
Your App Engine application doesn't run on a single machine, IP or MAC
address, because it runs on potentially hundreds of machines
concurrently. If you're wanting to use the aforementioned details to
get a unique identifier you're also doomed to failure, because there's
no guarantee that there will be only one instance of your application
on a machine at any one time.

If you tell us more about *what* you're trying to achieve, rather than
*how* you think you can achieve it, we can give you a better idea.


Dave.

Andy Freeman

unread,
Nov 22, 2008, 4:50:03 PM11/22/08
to Google App Engine
> > Suppose that I want to atomically create an entity group with two
> > nodes, one the parent of the other.
> But *why* exactly do you want to do this?

Because I want "a set of one or more entities that can be manipulated
in a single transaction. Entity group relationships tell App Engine to
store several entities in the same part of the distributed network. A
transaction sets up datastore operations for an entity group, and all
of the operations are applied as a group, or not at all if the
transaction fails."

> In your original email, you
> said "Is there any way to get the hostname, IP, or MAC address where
> the GAE app is running?", which is a fundamentally flawed question.
> Your App Engine application doesn't run on a single machine, IP or MAC
> address, because it runs on potentially hundreds of machines
> concurrently. If you're wanting to use the aforementioned details to
> get a unique identifier you're also doomed to failure, because there's
> no guarantee that there will be only one instance of your application
> on a machine at any one time.

The fact that GAE uses many machines and concurrently is why the full
hostname, IP, or MAC address or some other machine identifier is
useful in creating a unique identifier on GAE. (If my application
always ran on the same machine, the process id and time would be
sufficient.)

Network identifiers, such as MAC, full hostname, and IP, are very
likely to be unique at a given time within a given network
environment, such as Google. (Yes, even across datacenters.)
However, a given machine may be running multiple processes, so it's
possible that time.time() will return the same time in different
processes. However, since GAE instances run in single-threaded
processes and process ids at a given time are unique on a given
machine, the triple(machine id, time, processid,) is unique.

Yes, I realize that time being reset or time inconsistencies between
different machines (that happen to have the same IP or hostname at
different times) can cause duplication. That's low enough probability
that I can live with it (and I can use urandom to reduce the odds
still more).

David Symonds

unread,
Nov 22, 2008, 8:07:02 PM11/22/08
to google-a...@googlegroups.com
On Sun, Nov 23, 2008 at 8:50 AM, Andy Freeman <ana...@earthlink.net> wrote:

>> > Suppose that I want to atomically create an entity group with two
>> > nodes, one the parent of the other.
>> But *why* exactly do you want to do this?
>
> Because I want "a set of one or more entities that can be manipulated
> in a single transaction. Entity group relationships tell App Engine to
> store several entities in the same part of the distributed network. A
> transaction sets up datastore operations for an entity group, and all
> of the operations are applied as a group, or not at all if the
> transaction fails."

Yes, I understand transactions and entity groups. Why do you need to
create an entity group *atomically*?

> The fact that GAE uses many machines and concurrently is why the full
> hostname, IP, or MAC address or some other machine identifier is
> useful in creating a unique identifier on GAE. (If my application
> always ran on the same machine, the process id and time would be
> sufficient.)

If you create a new entity, it will automatically be assigned a unique
key at the datastore level. What's wrong with just using that?


Dave.

yejun

unread,
Nov 22, 2008, 8:43:44 PM11/22/08
to Google App Engine
UUID should be ok, which use system urandom as seed by default.
os.getpid isn't available nor unique across processes.

Andy Freeman

unread,
Nov 22, 2008, 9:38:13 PM11/22/08
to Google App Engine
> Yes, I understand transactions and entity groups. Why do you need to
> create an entity group *atomically*?

For the same reason that transactions are useful - incomplete groups
are wrong (in my application) and I'd rather not deal with them.

> If you create a new entity, it will automatically be assigned a unique
> key at the datastore level. What's wrong with just using that?

Each db.put has significant overhead. If I can generate a unique name
without a db.put, I can reduce the number of db.puts that my
application does by a factor of 2.

Andy Freeman

unread,
Nov 22, 2008, 9:46:13 PM11/22/08
to Google App Engine
> os.getpid isn't available

Thanks.

> nor unique across processes.

Huh? During the life of a process A on a machine B, no other process
on B will have the same process id as A.

Two different processes on a given machine may have the same id if
their lifetimes are disjoint and two processes on different machines
may have the same process id at the same time, but the latter is just
why some sort of machine identifier is important.
> > Dave.- Hide quoted text -
>
> - Show quoted text -

David Symonds

unread,
Nov 23, 2008, 12:16:40 AM11/23/08
to google-a...@googlegroups.com
On Sun, Nov 23, 2008 at 1:38 PM, Andy Freeman <ana...@earthlink.net> wrote:

>> Yes, I understand transactions and entity groups. Why do you need to
>> create an entity group *atomically*?
>
> For the same reason that transactions are useful - incomplete groups
> are wrong (in my application) and I'd rather not deal with them.

A two-stage insertion would work fine, then. Insert the first object
with a "not-ready" flag set, insert the second object, then go back
and flip the "not-ready" flag.

>> If you create a new entity, it will automatically be assigned a unique
>> key at the datastore level. What's wrong with just using that?
>
> Each db.put has significant overhead. If I can generate a unique name
> without a db.put, I can reduce the number of db.puts that my
> application does by a factor of 2.

Generating an application-globally-unique name is hard, and App Engine
is mainly focused on applications where the writes are few and the
reads are many. Until you've benchmarked it, it sounds like you are
prematurely optimising for an operation that should be relatively
rare, while seriously increasing the complexity of your system. Who
said the datastore would be on the same machine as the running
application, anyway?


Dave.

yejun

unread,
Nov 23, 2008, 6:46:13 AM11/23/08
to Google App Engine
There's no currently available method to identify process and machine.
Your best bet is random id generated during module initialization.

Jon McAlister

unread,
Nov 24, 2008, 7:59:47 PM11/24/08
to Google App Engine
Correct, we don't provide pid or hostname but os.urandom() will yield
a secure random seed, and every runtime process is uniquely seeded, so
you don't need to worry about any two distinct runtime processes ever
generating the same random numbers.

If you don't feel like random numbers give you enough of a guarantee
for your app, then you should use the datastore.

Josh Heitzman

unread,
Nov 24, 2008, 10:18:43 PM11/24/08
to Google App Engine
Like Jon McAlister said either use a random number or create a new
entity when one of your modules is loaded and treat that entity's key
as the globally unique process ID (i.e. MAC address + pid). The later
solution only requires one put during any given processes lifetime so
it shouldn't be a perf problem.

Andy Freeman

unread,
Dec 9, 2008, 11:32:54 AM12/9/08
to Google App Engine
> A two-stage insertion would work fine, then. Insert the first object
> with a "not-ready" flag set, insert the second object, then go back
> and flip the "not-ready" flag.

If the application has a datastore or quota timeout between the "store
to get a key" and the store for the actual data, the "store to get a
key" results in an incomplete group.

Note that the "not ready flag" adds a third store and another
opportunity for an incomplete group (namely a complete group that is
marked not-ready).

These are precisely the sorts of problems that transactions solve.

> Generating an application-globally-unique name is hard.

No, it's not hard at all. The MAC address, time, and process id
suffice (for single threaded applications like GAE, otherwise the
thread id is also needed) and there are other info combos that can be
used to generate such names. Moreover, there are plenty of methods
that can be used to generate a unique name from such information that
don't expose the underlying data. (Google may well want to make it
impossible to determine MAC addresses or process ids, for example.)

It's only hard within the current GAE API because that API doesn't
expose enough information or provide a method that generates a unique
name without exposing the information used to generate said name.

> Who said the datastore would be on the same machine as the running
> application, anyway?

Huh? Nothing that I've written said anything about the machines where
the datastore is running. That information isn't necessary for an
application to generate a unique name.

Asking that question suggests that you don't understand what I've been
talking about.

On Nov 22, 9:16 pm, "David Symonds" <dsymo...@gmail.com> wrote:
> On Sun, Nov 23, 2008 at 1:38 PM, Andy Freeman <ana...@earthlink.net> wrote:
> >> Yes, I understand transactions and entity groups. Why do you need to
> >> create an entity group *atomically*?
>
> > For the same reason that transactions are useful - incomplete groups
> > are wrong (in my application) and I'd rather not deal with them.
>
>

Andy Freeman

unread,
Dec 9, 2008, 11:47:55 AM12/9/08
to Google App Engine
On Nov 24, 4:59 pm, Jon McAlister <jon...@google.com> wrote:
> Correct, we don't provide pid or hostname but os.urandom() will yield
> a secure random seed, and every runtime process is uniquely seeded, so
> you don't need to worry about any two distinct runtime processes ever
> generating the same random numbers.

From http://docs.python.org/library/os.html#os.urandom
os.urandom(n)
Return a string of n random bytes suitable for cryptographic use.

I'm pretty sure that I'll get dups if I call os.urandom(1), or even
os.urandom(3), enough times in different processes. (For os.urandom
(1), I'm guaranteed a dup by the 256th call and should get a dup
before 20 tries; os.urandom(3) should yield dups before 4500 tries.)

So, how many bytes do I need to request to get a reasonable guarantee
of uniqueness?

And, wouldn't it be better to simply provide a function that
guarantees uniqueness without involving the datastore? It doesn't
even have to return the same string each time it's called within a
given sandbox instance's lifetime.

As I wrote above, such a function can be provided that doesn't reveal
anything about google's infrastructure (which might be why os.getpid()
and functions that reveal machine specific info are unavailable).
> > > > - Show quoted text -- Hide quoted text -

Andy Freeman

unread,
Dec 9, 2008, 11:57:11 AM12/9/08
to Google App Engine
> The later
> solution only requires one put during any given processes lifetime so
> it shouldn't be a perf problem.

It introduces a clean-up problem. I can't delete such an object
until after I delete all entity groups named using said object's key.
(GAE is free to reused generated keys.)

I shouldn't have mentioned the overhead of puts. The real problem is
cleanup and consistency, a problem that transactions are designed to
solve.

yejun

unread,
Dec 9, 2008, 1:01:08 PM12/9/08
to Google App Engine
You don't have to use global data entity. For example use a datastore
backed global count as base number.
Your unique id can be generated by that count multiply by a big number
+ a local count.

Andy Freeman

unread,
Dec 10, 2008, 9:56:11 AM12/10/08
to Google App Engine
A local count can be unique across URL requests that happen to hit the
same process, but it can't be unique across processes. In that, a
local count is like time.

A datastore counter eliminates the dangling object problem because a
single counter can be used to manage an arbitrarily large number of
entity group types (when combined with time), but its update (once for
every process that needs to generate unique names) is a bottleneck.
Of course, that bottleneck can be addressed with multiple counters (so
the count is actually a count/counter-name pair), reducing the problem
to one of uniquely naming the counters, but that's easily solvable.
(Figuring out when there's sufficient congestion to justify creating a
new counter is a little tricky but it's good enough to create "more
than enough" counters.)

Thanks.
> > > - Show quoted text -- Hide quoted text -

ryan

unread,
Dec 11, 2008, 5:40:23 PM12/11/08
to Google App Engine
hi all! there's lots of good information in this thread on how to
generate partially and fully unique ids, along with pros and cons. i'm
also curious about the actual use case, though. specifically...

On Nov 22, 6:38 pm, Andy Freeman <ana...@earthlink.net> wrote:
>
> Each db.put has significant overhead. If I can generate a unique name
> without a db.put, I can reduce the number of db.puts that my
> application does by a factor of 2.
...
> It introduces a clean-up problem. I can't delete such an object
> until after I delete all entity groups named using said object's key.
> (GAE is free to reused generated keys.)
>
> I shouldn't have mentioned the overhead of puts. The real problem is
> cleanup and consistency, a problem that transactions are designed to
> solve.

i'm curious how a named key buys you anything here beyond what an id-
based key gives you.

as background, from my perspective, named keys are useful when you
have an external identifier and you want to enforce and exploit
uniqueness in a namespace that you control. they also let you do a
limited form of querying inside transactions, via get() with a key
that you construct in memory. we found this useful enough that we
added get_by_key_name() as syntactic sugar for it.

i suspect most of the use cases for named keys fit into that mold. as
you mentioned, they don't generally help with optimization, since for
a given use case, named keys will require the same number of put()s as
id-based keys. i'm also not sure how they help with cleanup/garbage
collection or consistency. queries and transactions don't
differentiate between root entities with named keys vs. id-based keys.

btw, you probably already know this, but just in case, you can create
a new root entity and one or more children of that entity in the same
transaction. e.g.

def create():
parent = Foo()
parent.put()
child = Foo(parent=parent)
child.put()
return parent

parent = run_in_transaction(create)

i'm still curious, so i'm probably missing something...?

Andy Freeman

unread,
Dec 12, 2008, 1:06:26 PM12/12/08
to Google App Engine
> i'm curious how a named key buys you anything here beyond what an id-
> based key gives you.

With named keys, it's trivial to create all of the entities in a given
group with a single put, which is a transaction. With id-based keys,
multiple puts are required. In a case that I care about (see below),
I can't bundle those puts into a single user-defined transaction.

Which reminds me - is the transaction for db.put([a, b,]) (for a and b
in the same entity group) run in the datastore or is it run using an
application-side transaction like

def put_a_b_in_transaction():
a.put()
b.put()

db.run_in_transaction(put_a_b_in_transaction)

If db.put([a, b,]) is run in the datastore, are there any interesting
differences from the application-side transaction version? (I'd
expect that the application-side transaction requires some back and
forth with the datastore which offers some opportunities for the
transaction to be aborted for quota reasons.)

> def create():
> parent = Foo()
> parent.put()
> child = Foo(parent=parent)
> child.put()
> return parent
>
> parent = run_in_transaction(create)
>
> i'm still curious, so i'm probably missing something...?

You're missing the ability to read my mind. I haven't mentioned that
(my) parent has references to (some of) its children.

The "one put per entity in a transaction" rule means that I can't
update parent in that transaction after the child put. I can't use
parent as a parent until it has a valid key. I can't generate a
child's key until I can get a valid key for its parent. Since I want
parent to have (some) child keys and I can't update it (in the same
transaction) after I put it, I need to be able to generate all keys
before any puts.

Yes, I could add the parent's key to the relevant children and use
parent.child_set (the collection back-link) to get to those children,
but that's a query and even if it's as efficient as db.get, a parent
entity that doesn't have a child's key can't ask memcache for said
children memcache'd by their key.

Yes, I could memcache a parent's child keys, but we're starting to
pile hack upon hack.

Since globally unique keys have other uses, I'd rather use them here
as well and make things easy.

FWIW, caring (too much) about how named entities work is how I found
issue 883

http://code.google.com/p/googleappengine/issues/detail?id=883&can=4&colspec=ID%20Type%20Status%20Priority%20Stars%20Owner%20Summary%20Log%20Component
.

ryan

unread,
Dec 12, 2008, 8:11:03 PM12/12/08
to Google App Engine
On Dec 12, 10:06 am, Andy Freeman <ana...@earthlink.net> wrote:
>
> You're missing the ability to read my mind. I haven't mentioned that
> (my) parent has references to (some of) its children.
>
> The "one put per entity in a transaction" rule means that I can't
> update parent in that transaction after the child put. I can't use
> parent as a parent until it has a valid key. I can't generate a

ahhh...got it. ok. in that case, you're right. using id-based keys
would make your life easier, since we do the heavy lifting of
generating a unique identifier (the parent key), but it would require
an extra put().

we've actually recently come across the same use case internally, a
couple of times. to address it, we've considered adding a "get next
key" operation. you'd provide the kind and parent key, if any, and it
would allocate an id you can attach to that key for use in a later put
(). you'd use it to decouple id allocation and entity insertion, which
is what you want to do here.

the datastore reserves ids in batches, so in the common case this
operation would take well under 1ms, similar to memcache. if the
datastore has used up its last batch and needs to allocate a new one,
that will take around 10-20ms, but that's very rare.

having said that, i'm still a little skeptical that you truly need to
optimize this. relative to your overall traffic pattern, i'm guessing
this operation is fairly rare, so i wonder if avoiding a single extra
put() is really that important. still, let us know what you think of
the "get next key" operation.


> Which reminds me - is the transaction for db.put([a, b,]) (for a and b
> in the same entity group) run in the datastore or is it run using an
> application-side transaction like

good question! unlike transactions, batch puts, gets, and deletes are
run in the datastore. if they hit contention, they're retried similar
to transactions, except that the retries happen in the datastore, not
in the application side python API. also, the work for each entity
group is done in parallel, up to a point, as opposed to transactions.
given that, you should generally see both asymptotic and constant
factor speedups with batch writes and reads, relative to doing them
individually in application code.

as for quotas, those are only checked at the beginning of each call,
not while the call is running in the datastore itself.

Andy Freeman

unread,
Dec 13, 2008, 11:53:39 AM12/13/08
to Google App Engine
> having said that, i'm still a little skeptical that you truly need to
> optimize this. relative to your overall traffic pattern, i'm guessing
> this operation is fairly rare, so i wonder if avoiding a single extra
> put() is really that important. still, let us know what you think of
> the "get next key" operation.

It's not the cost of the extra put, it's that said extra put can't be
in the transaction that creates the rest of the entity group. Because
of the "only one put per entity in a transaction", that extra put is
either before the transaction that creates all of the children or
after. Either way, the sequence can result in an incomplete/
inconsistent/incorrect group that I have to write code to handle. (If
it's before, I can have a root that has no children. If it's after, I
can have a root that doesn't refer to children that exist.)

Also, there's a certain cleanliness in creating a bunch of db.Model
instances and storing them with a single db.put, and now there's the
"it's run in the datastore" pragmatic argument.

Of course, it isn't just how common an operation is, its position in
the execution path matters. For example, if it's part of startup and
it fails too often, that's a big deal even if the whole application
spends very little time in startup. In my case, it's part of a loop
preamble and my application spends almost all of its time in the
loop. In some cases, being able to start the loop quickly is a big
deal.

As a practical matter, the less often my application creates such
groups, the less effort that I want to expend on ensuring their
consistency/correctness/completeness during creation. (I don't even
want to spend time ensuring consistency/correctness/completeness of
common operations but ....) I'd much rather spend time speeding up
common operations or adding features. Plus, the less common something
is, the less often it will run into trouble, which can make it harder
to detect such trouble and to write the code that handles it
correctly. (In my case, it's on a critical code path, so it has to
work reliably.)

If db.get_next_key comes with the relevant changes to db.Model, I'd
probably be happy. (Your use case must be somewhat different if
db.get_next_key satisfies it but
get_application_process_instance_unique_string doesn't because "get
next key" changes faster.) The "probably" is because I don't know if
I'm going to take a hit because it's a datastore operation that can
increase the number of application failures.

> as for quotas, those are only checked at the beginning of each call,
> not while the call is running in the datastore itself.

I think of every datastore transaction, url fetch, and the "return the
result to the user" as an opportunity for GAE to abort my
application. Yes, I know that GAE can abort my application between
such operations but those are the only things that create "state" that
I have to manage if an abort happens.

While I can believe that there are cases where minimizing the number
of abort opportunities isn't the right approach, I suspect that
they're fairly rare so minimizing them is one of my rules of thumb.

Thanks for your help and attention,
-andy
Reply all
Reply to author
Forward
0 new messages