Assigning datastore IDs based on UUIDs from an external DB

Murph

unread,

Aug 4, 2011, 5:04:22 PM8/4/11

to google-a...@googlegroups.com

Hi folks,

I've been pondering the best approach to modelling objects where the objects in my GAE datastore correspond to a subset of objects in an external DB where primary keys are UUIDs. As I expect most of my records in GAE to really be quite small, I feel it's worth avoiding the storage size overhead of just using the 36 character UUID as the key_name, and have come up with the following to generate datastore uint63 IDs from UUIDs (Python, but the questions are more general GAE efficiency questions):

MASK_64 = 2**64-1

class UUID(uuid.UUID):
    def get_id(self):
        return abs((self.int & MASK_64) ^ (self.int >> 64))

    id = property(get_id)

class UUIDModel(Model):
    @classmethod
    def get_by_uuid(cls, uuids, **kwds):
        uuids, multiple = datastore.NormalizeAndTypeCheck(uuids, (UUID, str))
        def normalize(uuid):
            if isinstance(uuid, str):
                return UUID(uuid)
            else:
                return uuid
        uuids = [normalize(uuid) for uuid in uuids]
        ids = [uuid.id for uuid in uuids]
        entities = cls.get_by_id(ids, **kwds)
        for index, entity in enumerate(entities):
            if entity is not None and entity.uuid != uuids[index]:
                raise BadValueError('UUID hash collision detected!')
        if multiple:
            return entities
        else:
            return entities[0]

    @classmethod
    def get_or_insert_by_uuid(cls, uuid, **kwds):
        if isinstance(uuid, str):
            uuid = UUID(uuid)
        id = uuid.id
        def txn():
            entity = cls.get_by_id(id, parent=kwds.get('parent'))
            if entity is None:
                entity = cls(key=Key.from_path(cls.kind(), id,
                                               parent=kwds.get('parent')),
                             uuid=uuid,
                             **kwds)
                entity.put()
            elif entity.uuid != uuid:
                raise BadValueError('UUID hash collision detected!')
            return entity
        return db.run_in_transaction(txn)

    uuid = UUIDProperty('UUID')

I won't be using GAE's auto-assigned IDs for the model classes which have IDs assigned from the external DB's UUID, so I'm not terribly worried about the probability of ID collision, as 2**63 is still a very large number space compared to the number of records that I expect to have. My reason for using a custom hashing of the UUID into a uint63 is because Python's hash function isn't guaranteed to remain consistent with future Python versions. The reason for using uint63 is because the datastore classes throw an exception on negative int64s used as IDs. Had the datastore supported int128 or uint127 for IDs, I would have just used the UUIDs more directly with it. I'm using the UUID as the GAE key to allow direct get_by_id() calls when I already know the UUID, rather than having to do a filtered query on it.

So, on to the questions. The above seems to work just fine for me in early prototype stages of development, but I'm wondering if there's a downside to this technique? Will I hit any performance, space, or general efficiency penalties with the datastore by using IDs which are essentially randomly assigned throughout the entire 63bit ID space? Is there anything about this which strikes people as a terrible idea and would justify me having a major rethink about my approach? What techniques are others using when they have externally assigned UUIDs as primary keys for some of their model classes?

Murph

unread,

Aug 9, 2011, 6:31:04 AM8/9/11

to google-a...@googlegroups.com

After a little testing, I discovered a flaw in my previously posted technique, namely that it was generating uint64 IDs, when the datastore will only accept non-zero uint63s. I'd also forgotten to include my UUIDProperty class. Here's the updated version. The question remains whether there's likely to be any horrible loss of efficiency or problems by assigning IDs which will be essentially randomly distributed through the possible number space (the external DB mostly uses UUIDv4, so the distribution should be quite random). As previously stated, auto-assigned IDs will not be used for these models, so collisions or disruption of auto-assignment are not a major concerns. The code probably still needs the odd bit of polishing here and there.

MASK_64 = 2**64-1
MASK_63 = 2**63-1

class UUID(uuid.UUID):
    # Could use this, but Python doesn't guarantee future stability of
    # hash values
    # def get_id(self):
    #     return abs(hash(self.int))

    def get_id(self):
        """ Returns a positive, non-zero 63 bit integer from the
        UUID's int128 """
        x = ((self.int & MASK_64) ^ (self.int >> 64)) & MASK_63
        if (x == 0):
            x = 1
        return x

    id = property(get_id)

class UUIDProperty(Property):
    """A UUID property, stored as a 16 byte binary string."""

    data_type = UUID

    def get_value_for_datastore(self, model_instance):
        uuid = super(UUIDProperty,
                     self).get_value_for_datastore(model_instance)
        return ByteString(uuid.bytes)

    def make_value_from_datastore(self, value):
        if value is None:
            return None
        return UUID(bytes=value)

    def validate(self, value):
        if value is not None and not isinstance(value, self.data_type):
            try:
                value = self.data_type(value)
            except TypeError, err:
                raise BadValueError('Property %s must be convertible '
                                    'to a %s instance (%s)' %
                                    (self.name, self.data_type.__name__, err))
        value = super(UUIDProperty, self).validate(value)
        if value is not None and not isinstance(value, self.data_type):
            raise BadValueError('Property %s must be a %s instance' %
                                (self.name, self.data_type.__name__))
        return value

class UUIDModel(Model):
    @classmethod
    def get_by_uuid(cls, uuids, **kwds):
        uuids, multiple = datastore.NormalizeAndTypeCheck(uuids, (UUID, str))
        def normalize(uuid):
            if isinstance(uuid, str):
                return UUID(uuid)
            else:
                return uuid
        uuids = [normalize(uuid) for uuid in uuids]
        ids = [uuid.id for uuid in uuids]
        entities = cls.get_by_id(ids, **kwds)
        for index, entity in enumerate(entities):
            if entity is not None and entity.uuid != uuids[index]:

                raise BadKeyError('UUID hash collision detected (class %s): '
                                  '%s / %s'
                                  % (cls.kind(), entity.uuid, uuids[index]))

        if multiple:
            return entities
        else:
            return entities[0]

    @classmethod
    def get_or_insert_by_uuid(cls, uuid, **kwds):
        if isinstance(uuid, str):
            uuid = UUID(uuid)
        id = uuid.id
        def txn():
            entity = cls.get_by_id(id, parent=kwds.get('parent'))
            if entity is None:
                entity = cls(key=Key.from_path(cls.kind(), id,
                                               parent=kwds.get('parent')),
                             uuid=uuid,
                             **kwds)
                entity.put()
            elif entity.uuid != uuid:

                raise BadKeyError('UUID hash collision detected (class %s): '
                                  '%s / %s'
                                  % (cls.kind(), entity.uuid, uuid))

Tim Hoffman

unread,

Aug 9, 2011, 7:44:56 AM8/9/11

to google-a...@googlegroups.com

A couple of questions.

Why don't you use them as key_names rather than id's. Then you know you will never get a clash ?

Also random keys/ids are supposedly better than monotonically increasing id's as I understand it. There's been a bit written about not using an incrementing time to create ids, (trying to find the article). Had something to do with the key distribution across the underlying google infrastructure and sequential keys (ie incrementing on the right most side as in aaaaa1, aaaaa2 had an impact on write performance) Mind you I could be imagining this

Maybe someone else could chime in with this detail ;-)

Rgds

T

Tim Hoffman

unread,

Aug 9, 2011, 8:21:05 AM8/9/11

to google-a...@googlegroups.com

Here is the article by Ikai

http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

tempy

unread,

Aug 9, 2011, 8:32:47 AM8/9/11

to Google App Engine

Murph I've been doing exactly what you're doing, creating GAE entities
with UUIDs corresponding to UUID keys from an external DB. I just
stick the UUID into the key's name property, and it works just fine. I
think you're overcomplicating things - GAE storage is dirt cheap, its
cpu time you have to worry about, so you should spend your
optimization cycles on getting request speeds down and doing as much
work as possible asynchronously, as opposed to esoteric hashing which
will in the end win you peanuts.

Robert Kluin

unread,

Aug 10, 2011, 1:38:49 AM8/10/11

to google-a...@googlegroups.com

I tend to agree. I'm not sure how many records you're expecting, but
it sure seems like a lot of complicated overhead to avoid some storage
cost. Obviously if you're expecting a large number of entities or
you've got a lot of indexes and/or reference properties it might save
some storage space, but you'll need several tens of millions of
entities to save a few dollars. Have you calculated your expected
savings from this?

Robert

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>
>

Reply all

Reply to author

Forward