Ulid Vs Ksuid

0 views

Skip to first unread message

Barton Ostby

unread,

Aug 4, 2024, 6:51:41 PM8/4/24

to ittotwhiba

Inthe world of software development, unique identifiers play a crucial role in various applications, from databases to distributed systems. They serve as keys to uniquely identify entities, ensuring data integrity, efficient querying, and seamless integration between different systems. In this article, we'll delve into the concept of unique identifiers, exploring their importance, characteristics, and some popular implementations such as cuid, ksuid, nanoid, sid, slugid, suuid, ulid, uniqid, and uuid.

A unique identifier, often abbreviated as UID, is a string or number assigned to each entity within a system that ensures its distinctiveness from other entities. These identifiers are typically immutable, meaning they remain unchanged throughout the entity's lifecycle. Unique identifiers are essential in scenarios where distinguishing between entities is critical, such as database records, messaging systems, and distributed architectures.

Unique identifiers are fundamental building blocks in modern software systems, facilitating data management, interoperability, and scalability. Understanding the characteristics and nuances of different unique identifier implementations allows developers to choose the most suitable solution for their specific use cases. Whether it's generating compact, URL-safe IDs for web applications or ensuring global uniqueness in distributed systems, the right choice of unique identifier can significantly impact the performance, security, and reliability of software applications.

You would have to perform more complex comparisons to see if ids match, since you lose the binary representation. v1 and v1mc UUIDs are sort of ordered but not necessarily between multiple nodes, and they cycle fairly rapidly.

Maybe include milliseconds or more depending on how fast and frequently items are generated. Also maybe node ID before sequence number depending on whether you'd rather have ones with the same node ID together, or have the same sequence number together for a given timestamp.

The problem with using a sequence generator though is predictability and also collision, which to be avoided would require coordination which is not feasible. That's why you shouldn't use things like custom random functions for secrets :D

I guess I was assuming the nodes were in a relatively trusted environment. Like, maybe distributed geographically but communicating on a secure network, where nodes are owned by the same entity. Also being able to rely on some authorities for unique node IDs (and time synchronization).

This requires synchronization though, which is another problem in itself. It means having to handle a sort of "god machine" that releases node IDs everytime a node comes online, which still is tricky for serverless environments, unless you are considering every single process of the app a separate node. Keeping in mind that the authority could be offline or unreachable or too slow and yet another machine to handle, monitor, keep updated and secure and so on...

Unless tracing back the originating node from the ID is paramount (which could have security issues onto itself in case an ID leaks outside the trusted area), I believe letting go of the whole idea of embedding a node identifier in the final ID is a way to sidestep all of these things

I think I would need to know what this is for to go further. The simplest thing satisfying your original criteria would probably be a timestamp down to the nanosecond + random per-item hex string to avoid collisions.

I had the idea that I wanted to construct a scalable distributed event store on top of AWS and ran into similar questions (ordering in a distributed system). Dynamo for distributed event storage and S3 for initial replay storage (due to high cost of Dynamo replays).

I started thinking about how to place the events in S3 in such a way that they could be replayed in some semblance of order. Each event stream is totally ordered by itself by StreamID + Version, but there is no order indicated by across streams. Dynamo doesn't support any such order because streams are distributed across different servers.

I looked at using the event envelope to construct the S3 file name which would be in lexicographical order. That way listeners would ask S3 for a file listing by name and they would come back in order. That would include a node-specific timestamp and the node id in the file name. The timestamp would be the default "ordering", and the node id would be used as an arbitrary tie-breaker. And if we become aware of time skew on a specific node, we could also correct for it. Many listeners only care about specific types of events. So I included the event type in the name so that only needed events had to be fully fetched. Ultimately the design ended up with pretty long file names like timestamp/nodeid/streamid/event version/event type.

There are huge problems with this, however. I could nitpick a bunch of them, but I will jump to the overarching problem. The scale where I need to distribute the event store (and thus worry about cross-node ordering), full replays would be impractically time consuming and the volume of events would generally be unwieldy for listeners. That's even if I could pick the perfect ordered key up front. The scale of this really needs a change in tactics.

Ultimately I decided that cases where I need grouping and ordering of data had to be exercised in the small. And if these need to be aggregated to larger scale, the smaller systems would have to publish "external" events that rolled up events to a higher granularity. For example, a lot of events may go into placing an Order and the sequence in which they happen matters a lot. But external listeners will not be interested in handling all those details. They instead prefer a single OrderPlaced event with all details included. So in the Order system I'll have a listener go back and construct one for external publishing. These could be published to a stream processing platform such as Kafka for larger integration scenarios. And Kafka already has some metaphors for how to make that work from a subscriber's perspective.

Moral of the story (applicable to ordered GUIDs I think) is that ordering in a distributed scenario wasn't the best problem for me to solve. It is possible to power through the ordering problem (also dealing with clock skew or failed clocks that claim the event happened in 1970), but it requires extra work and ongoing upkeep. And I still don't end up with a system that has the right granularity for large scale.

Thanks for the detailed explanation and I can see how complicating the architecture to allow ordering is not worth it in your case, especially because you have what I believe is an "event sourced" architecture, with multiple levels of granularity of events.

Our case is very limited in scope, which ultimately will become a very simple pub/sub system. The topic of what sort of "universal" IDs to choose for those events transiting the inside of the app and to be sent outside arose and thus, I opened this discuss thread.

I don't foresee any real drawbacks in choosing a guid that's also sortable right from the start, what do you think? By having IDs that are inherently sortable the consumer can use that property or ignore it as they wish

The only real downside is that making them sortable will encourage you to use and depend on that feature. And when clock skew comes into play, then the code that depends on sortability will probably behave unexpectedly. When your code processes the data out of order you could observe strange things like issuing updates for data that hasn't been inserted yet. How: On a cloud provider, they may restart your code on a different node (for failure or maintenance or no reason) whose clock may be skewed from the previous node. They usually do have time sync but it is best effort -- no guarantees about clock accuracy between servers. The margin of error should be small enough that you don't have a problem normally. But if you ever do have the issue it will be hard to diagnose. Also note that if a hardware clock does fail, it is common for it to reset to zero. That would be a little easier to diagnose, but either way recovery (changing the IDs? mapping them to new IDs?) could be painful.

For an event store I use a big integer called Position (not an auto increment) to have total ordering for that node. It guarantees that whatever happens later has a larger Position than whatever happens before. But it doesn't provide a global perspective. You can get a global order by reading from nodes in Position order and then choosing the lowest timestamp among those events. When you do it that way, you know it is a "best effort" ordering, and you may be able to account for some known skewed timestamps. Even if timestamps are off, you are guaranteed that events from the same node are in order. I haven't actually had to do a global order across nodes yet, but there is one on the horizon to merge different event stores.

Instagram use a custom scheme (instagram-engineering.com/sharding...) as they need time-sortable unique IDs across multiple DB servers (shards). They implemented it in PL/SQL, composing an ID from the current time, the shard ID, and an auto-increment, giving 1024 IDs per shard per millisecond. Pretty cool, check out the link.

I didn't write the final goal explicitly but I wrote the requirements. The final goal is to traceable unique IDs for events in a distributed system that also are sortable which is a very handy property. There's no mistery to it :D

Anyway, back in topic. This a tricky question because globally unique and predictable are two properties in direct clashing with each other. A solution is taking a look at Microsoft SQL Server Sequential ID, a sortable GUID.

There a couple of disadvantages:

Sure, efficiency and convenience mostly. Let's say we use UUIDs, they work mostly well until these IDs land in a place far from your system. Someone decides to store events on S3 using the UUID as a file, suddenly you have gigabytes of events that can't sort well unless you peek inside the file to find the timestamp.