Hi folks,
I've been pondering the best approach to modelling objects where the objects in my GAE datastore correspond to a subset of objects in an external DB where primary keys are UUIDs. As I expect most of my records in GAE to really be quite small, I feel it's worth avoiding the storage size overhead of just using the 36 character UUID as the key_name, and have come up with the following to generate datastore uint63 IDs from UUIDs
(Python, but the questions are more general GAE efficiency questions):
MASK_64 = 2**64-1
class UUID(uuid.UUID): def get_id(self): return abs((self.int & MASK_64) ^ (self.int >> 64)) id = property(get_id)class UUIDModel(Model): @classmethod def get_by_uuid(cls, uuids, **kwds): uuids, multiple = datastore.NormalizeAndTypeCheck(uuids, (UUID, str)) def normalize(uuid): if isinstance(uuid, str): return UUID(uuid) else: return uuid uuids = [normalize(uuid) for uuid in uuids] ids = [uuid.id for uuid in uuids] entities = cls.get_by_id(ids, **kwds) for index, entity in enumerate(entities): if entity is not None and entity.uuid != uuids[index]: raise BadValueError('UUID hash collision detected!') if multiple: return entities else: return entities[0] @classmethod def get_or_insert_by_uuid(cls, uuid, **kwds): if isinstance(uuid, str): uuid = UUID(uuid) id = uuid.id def txn(): entity = cls.get_by_id(id, parent=kwds.get('parent')) if entity is None: entity = cls(key=Key.from_path(cls.kind(), id, parent=kwds.get('parent')), uuid=uuid, **kwds) entity.put() elif entity.uuid != uuid: raise BadValueError('UUID hash collision detected!') return entity return db.run_in_transaction(txn) uuid = UUIDProperty('UUID')I won't be using GAE's auto-assigned IDs for the model classes which have IDs assigned from the external DB's UUID, so I'm not terribly worried about the probability of ID collision, as 2**63 is still a very large number space compared to the number of records that I expect to have. My reason for using a custom hashing of the UUID into a uint63 is because Python's hash function isn't guaranteed to remain consistent with future Python versions. The reason for using uint63 is because the datastore classes throw an exception on negative int64s used as IDs. Had the datastore supported int128 or uint127 for IDs, I would have just used the UUIDs more directly with it. I'm using the UUID as the GAE key to allow direct get_by_id() calls when I already know the UUID, rather than having to do a filtered query on it.
So, on to the questions. The above seems to work just fine for me in early prototype stages of development, but I'm wondering if there's a downside to this technique? Will I hit any performance, space, or general efficiency penalties with the datastore by using IDs which are essentially randomly assigned throughout the entire 63bit ID space? Is there anything about this which strikes people as a terrible idea and would justify me having a major rethink about my approach? What techniques are others using when they have externally assigned UUIDs as primary keys for some of their model classes?