[ANN] Flake 0.4.0: Decentralized, k-ordered unique ID generator

460 views
Skip to first unread message

Max Countryman

unread,
Jun 2, 2016, 9:03:55 PM6/2/16
to clo...@googlegroups.com
Hi,

I’m happy to announce a new release of Flake, the decentralized, k-ordered unique ID generator.

Flake 0.4.0 includes a number of important breaking changes, but by far the most important is dropping `generate` in favor of `generate!` which now returns a ByteBuffer. Previously `generate` returned a BigInteger, however this arbitrarily limits how an application can handle IDs and goes against the spirit of the Erlang implementation. In order to maintain backwards compatibility, a helper `flake->bigint` was added to the core namespace. Applications which already consume flakes should update their calls to `generate` so they are `generate!` and wrap them with `flake->bigint` if BigIntegers are desirable or already used.

Github: https://github.com/maxcountryman/flake
Changes: https://github.com/maxcountryman/flake/blob/master/CHANGELOG.md

Thanks!


Max

Mark Engelberg

unread,
Jun 2, 2016, 11:39:17 PM6/2/16
to clojure
This is interesting.  Is it faster than uuid for generation and/or comparing for equality?

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Max Countryman

unread,
Jun 3, 2016, 11:40:26 AM6/3/16
to clo...@googlegroups.com
Hi Mark,

I haven’t done any benchmarking comparing Flakes to UUIDs. However the primary benefit of flake IDs, over a traditional UUID, e.g. UUID-1, is flakes do not require coordination (i.e. to avoid clock-skew and duplicate IDs), provide k-ordering (UUID-1’s bit ordering breaks this), and use the standard Unix epoch. It would be interesting to compare performance, but the features of flakes are certainly their primary selling points.


Max

Max Countryman

unread,
Jun 3, 2016, 12:24:14 PM6/3/16
to clo...@googlegroups.com
Hi again,

I just released flake 0.4.1, containing a bugfix where `timer/read-timestamp` would throw an uncaught exception when the provided path did not exist.

Thanks,


Max

Max Countryman

unread,
Jun 16, 2016, 10:52:50 AM6/16/16
to clo...@googlegroups.com
Hi again Mark,

I set up some basic micro-benchmarks out of curiosity. Full disclosure, I’m not entirely sure I’m using criterium correctly here, but the results on my machine seem to show comparable performance: Flake generation is slightly faster (by a few nanoseconds) and Flake comparison is slightly slower (also by a few nanoseconds) as compared to Java’s random UUID.

Just to reiterate what I said before, while it’s encouraging to see the performance is perhaps at least comparable to at least one implementation of a UUID, that really isn’t the reason to use Flakes in the first place.

Also if my tests aren’t set up correctly, I’d certainly appreciate any help correcting them!


Max

Bruno Bonacci

unread,
Jun 17, 2016, 9:14:47 AM6/17/16
to Clojure
Hi Max,

That's a interesting library thanks.

Does the library guarantee monotonically increasing IDs? Eg protection against clock reset and other clock fluctuations?

Another thing I've noticed is that you are using (System/currentTimeMillis) to get the wall clock on every generation.

(System/currentTimeMillis) causes a low level system call which in turn causes a context switch.

Maybe one way to improve could be use a initial (System/currentTimeMillis) on the first init! and then
use System/nanoTime to calculate the time elapsed from the init.
The advantage would be that System/nanoTime runs in the UserSpace (not Kernel Space) and it doesn't require
a system call (so no context switch).

This could really help the case of a bulk production of IDs and any other burst situation.

Bruno

Max Countryman

unread,
Jun 17, 2016, 10:31:31 AM6/17/16
to clo...@googlegroups.com
Hi Bruno,

On Jun 17, 2016, at 03:49, Bruno Bonacci <bruno....@gmail.com> wrote:

Hi Max,

That's a interesting library thanks.

Does the library guarantee monotonically increasing IDs? Eg protection against clock reset and other clock fluctuations?

Like the Erlang implementation, Flake asks the user to call an init function—this writes out the last known timestamp to disk in a separate thread. Next time the Flake process is run, it compares the current time to the time written to disk and throws an exception if time seems to moving in the wrong direction.

Apart from that, the generation function does a comparison of timestamps as well, again throwing an exception if the last known timestamp as compared to the current is in the future.

These are the same guarantees that Boundary’s implementation seems to make with what I believe is a slight improvement on the persisted timer (writing out the actual Flake timestamp instead of just an arbitrary interval). 

Another thing I've noticed is that you are using (System/currentTimeMillis) to get the wall clock on every generation.

(System/currentTimeMillis) causes a low level system call which in turn causes a context switch.

Maybe one way to improve could be use a initial (System/currentTimeMillis) on the first init! and then
use System/nanoTime to calculate the time elapsed from the init.
The advantage would be that System/nanoTime runs in the UserSpace (not Kernel Space) and it doesn't require
a system call (so no context switch).

This could really help the case of a bulk production of IDs and any other burst situation.

I really like this idea. I’m certainly open to pull requests if you wanted to take a stab at it otherwise I may try my hand at making this improvement. :)

Bruno Bonacci

unread,
Jun 21, 2016, 8:38:58 AM6/21/16
to Clojure


Another thing I've noticed is that you are using (System/currentTimeMillis) to get the wall clock on every generation.

(System/currentTimeMillis) causes a low level system call which in turn causes a context switch.

Maybe one way to improve could be use a initial (System/currentTimeMillis) on the first init! and then
use System/nanoTime to calculate the time elapsed from the init.
The advantage would be that System/nanoTime runs in the UserSpace (not Kernel Space) and it doesn't require
a system call (so no context switch).

This could really help the case of a bulk production of IDs and any other burst situation.

I really like this idea. I’m certainly open to pull requests if you wanted to take a stab at it otherwise I may try my hand at making this improvement. :)

Hi this change it is actually easier than it sounds. Looking at the code, I came across a couple of things which I think might be better.

1) use of filesystem persistence.

Not too sure that the file based persistence is a good idea. Maybe this is a good idiomatic design for Erlang, but definitely it doesn't look nice in Clojure.
 
In particular I'm not too sure that by storing the init time epoc we actually accomplish anything at all.
I would argue that there are a number of problems there, race conditions on data, tmp file purged out, and still doesn't protect against the case the clock drift during the use.

2) use of CAS (atom) for storing the VM state.
If if is truly decentralized then you shouldn't need an atom at all. The disadvantage of the CAS is that, when many thread race to the same change, only one will succeed and all the other ones will fail and retry. Which mean that if you have 100 threads (for example) only 1 will succeed all the other 99 will fail and retry. Again at the second round only 1 will succeed and 98 will retry, and so on.
Therefore the total number of attempts will be 


If you want to develop a real "decentralized" id generator, I think, you need to drop the atom in favour of a thread local store.
Now to do so and make collision impossible we need to add more bits:

  •     64 bits - ts (i.e. a timestamp )
  •     48 bits - worker-id/node (i.e. MAC address)
  •     32 bits - worker-id/process (pid) 
  •     64 bits - worker-id/thread (thread num)
  •     32 bits - seq-no (i.e. a counter)
By adding the process id (pid) and the thread id there is possibility of having two systems running and creating the same id at the same time.
Finally by using thread-local storage there is no need of process level coordination (atom) and no risk of retries because every process is stepping on each others toes.

With such setup 100 threads will be able to increment their own thread local counter independently (given that you have 100 execution cores).

What do you think?
Bruno

 

Bruno Bonacci

unread,
Jun 21, 2016, 8:41:01 AM6/21/16
to Clojure
Sorry, it looks like images are only visible in the google groups


Bruno

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/fRYCowf6VUg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Brian Platz

unread,
Jun 21, 2016, 1:01:00 PM6/21/16
to Clojure
Bruno,

I think the more you can reduce the chance of collision the better and the thread-local capability is a good idea, but in the process you've almost doubled the bits.

For me anyhow, an ID need to be produceable at a reasonable rate (1 million a second per machine is good for me), have near-zero probability of collision and take up the least amount of space possible.

Under those criteria, I think 128 bits is a reasonable target and the thread-safe atom I would expect to handle such volume (although I haven't tested).

If you need a billion per second and don't want 100 machines producing them, then I think you are at the point of needing to have thread independence and probably have to increase the bit-count, and your ideas provide a good path towards such a solution.

Your comment on the file persistence is a good one, I wonder if the potential problems are real enough to warrant the risks.

My other curiosity is if System/nanoTime is guaranteed to increment across threads. I know at least a while ago that this guarantee did not exist.

-Brian

Max Countryman

unread,
Jun 21, 2016, 7:29:54 PM6/21/16
to clo...@googlegroups.com
Brian,

I think you make good points here, especially with regard to the size of IDs.

I’d also like to point out that while adding the process and thread IDs helps, it doesn’t eliminate the possibility of duplicate IDs: this is why it’s necessary to write out the last used timestamp in a separate thread.

Just a clarification with regard to disk persistence: we aren’t writing out the epoch, we’re writing out the last used timestamp periodically, in its own thread. Yes, the `init!` API is cumbersome, but it’s an important safety valve which helps protect against duplicate IDs.

My understanding from reading the documentation and various StackOverflow answers is that System/nanoTime is monotonic, but I don’t know what guarantees it makes across threads.


Max


Max Countryman

unread,
Jun 21, 2016, 7:57:53 PM6/21/16
to clo...@googlegroups.com
I also released Flake 0.4.2 today which includes an important bugfix where two competing threads could have caused duplicate IDs in certain circumstances as well as a new method for deriving timestamps.

Bruno Bonacci

unread,
Jun 22, 2016, 7:25:50 AM6/22/16
to Clojure
Hi,

Yes adding more bits to the id is certainly undesirable, however it isn't uncommon. For example Linked just published a paper on Ambry a distributed datastore which uses a total of *40 bytes* to identify a blob (8 bytes for the partition + 32 bytes for UUID, see: http://dprg.cs.uiuc.edu/docs/SIGMOD2016-a/ambry.pdf

To answer Brian on the "potential" problem of the clock drift I would recommend to have a look to https://aphyr.com/posts/299-the-trouble-with-timestamps. Beside the hardware problems you have to account for things like ntpd daemon which is meant to synchronize clocks.
To keep them in sync it accelerates or decelerates the clock speed on your machine, however if it falls too much behind it will do a hard-reset,
so what you might see is that your System/currentTimeMillis calls jump back and forward (non-monotonic).
With VMs (such as cloud environment) this problem is way worse and more frequent.

The process ID in many platform is just 16bits, however some platform have 32bits (http://unix.stackexchange.com/questions/16883/what-is-the-maximum-value-of-the-pid-of-a-process), and the ThreadID in java is a long (https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#getId()) which is unfortunate. It would be nice if there was a way to read which CPU core physical thread is executing a particular thread (when the thread is running), i such way you could replace the processId and the threadId which just a CPU core ID. The number of physical threads (core + hyperthreads) 
is typically much smaller and it fits in a 16 bits. However to my knowledge there is no way in java to retrieve such value.

The other consideration to make is that even if running your benchmarks you can get over 1 million IDs per second on a single thread,
the required coordination of the CAS and the retry formula sent earlier will greatly reduce the performances as you scale the number of threads.
This falls under the Amdahl's law (https://en.wikipedia.org/wiki/Amdahl%27s_law) and more precisely under the USL Universal Scalability Law (http://www.perfdynamics.com/Manifesto/USLscalability.html and https://www.infoq.com/articles/top-10-performance-mistakes) for which as you increase the number of processor the performance hits a plateau (Amdahl) or even degrades (USL). If you run these tests I'm pretty sure you'll see similar effects.

As such is impossible to generate TOTALLY unique IDs without coordination, the real question is that what is the price of the coordination for a given risk of collisions. 

I guess each project should decide whether it is more important the monotonicity or the performance.

@Max
System/nanoTime has NO RELATION with the wall clock and i must not be used in such way.
System/nanoTime can have also negative values (this is platform dependent) it is guaranteed to be monotonic but not in respect of System/currentTimeMillis. The change i was suggesting is like following:

  * at init! you initialize your base timestamp with "start-timestamp = (System/currentTimeMillis) and in addition you create a new field called "timer = (system/nanoTime)"
  * when you create a new flake instead of reading the new value of System/currentTimeMillis you use a new call (system/nanoTime) and you check the time elapsed since "timer" so new-timestap = (start-timestamp + ((system/nanoTime) - timer) / 1000000)

There is another potential issue with the generate! in case of clock hard reset. If a reset happens exception will be thrown in tight loop potentially causing a DOS.

Bruno

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/fRYCowf6VUg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Brian Platz

unread,
Jun 22, 2016, 9:17:27 AM6/22/16
to Clojure


On Wednesday, June 22, 2016 at 7:25:50 AM UTC-4, Bruno Bonacci wrote:

To answer Brian on the "potential" problem of the clock drift I would recommend to have a look to https://aphyr.com/posts/299-the-trouble-with-timestamps. Beside the hardware problems you have to account for things like ntpd daemon which is meant to synchronize clocks.
To keep them in sync it accelerates or decelerates the clock speed on your machine, however if it falls too much behind it will do a hard-reset,
so what you might see is that your System/currentTimeMillis calls jump back and forward (non-monotonic).
With VMs (such as cloud environment) this problem is way worse and more frequent.

In Max's library he is only calling out to System/currentTimeMillis once at startup, he then determines the nanoTime offset and only uses nanoTime from there. nanoTime is supposed to be immune from wall clock time, and thus ntpd changes. 

Because it will ignore ntpd changes, there could be a delta from wall clock as ntpd changes the time. So that is a risk to be aware of, and if you were super concerned about it a method to re-sync could probably be devised (similar to what is used for leap-seconds). I've noticed this particular issue with a sleeping laptop. Assuming the process isn't extremely long-lived I think you'd be sufficiently close to not worry about it.

The process ID in many platform is just 16bits, however some platform have 32bits (http://unix.stackexchange.com/questions/16883/what-is-the-maximum-value-of-the-pid-of-a-process), and the ThreadID in java is a long (https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#getId()) which is unfortunate. It would be nice if there was a way to read which CPU core physical thread is executing a particular thread (when the thread is running), i such way you could replace the processId and the threadId which just a CPU core ID. The number of physical threads (core + hyperthreads) 
is typically much smaller and it fits in a 16 bits. However to my knowledge there is no way in java to retrieve such value.

If you were willing to take on another 64 bits, an idea could be to take the last 16 (or maybe 32) bits from the ThreadID combined with a random bits to round out the 64.

The random bits could be done once per thread combined with keeping a ThreadLocal counter where a CAS is still done to avoid issuing the same ID at the same time increment in the same thread -- or the counter could be ignored entirely and the random bits generated with every ID. I'm not sure which would perform better, but I like the randomness per ID which makes IDs harder to guess.

-Brian

Bruno Bonacci

unread,
Jun 22, 2016, 9:27:26 AM6/22/16
to Clojure
On Wed, Jun 22, 2016 at 2:17 PM, Brian Platz <brian...@place.works> wrote:


On Wednesday, June 22, 2016 at 7:25:50 AM UTC-4, Bruno Bonacci wrote:

To answer Brian on the "potential" problem of the clock drift I would recommend to have a look to https://aphyr.com/posts/299-the-trouble-with-timestamps. Beside the hardware problems you have to account for things like ntpd daemon which is meant to synchronize clocks.
To keep them in sync it accelerates or decelerates the clock speed on your machine, however if it falls too much behind it will do a hard-reset,
so what you might see is that your System/currentTimeMillis calls jump back and forward (non-monotonic).
With VMs (such as cloud environment) this problem is way worse and more frequent.

In Max's library he is only calling out to System/currentTimeMillis once at startup, he then determines the nanoTime offset and only uses nanoTime from there. nanoTime is supposed to be immune from wall clock time, and thus ntpd changes. 

Because it will ignore ntpd changes, there could be a delta from wall clock as ntpd changes the time. So that is a risk to be aware of, and if you were super concerned about it a method to re-sync could probably be devised (similar to what is used for leap-seconds). I've noticed this particular issue with a sleeping laptop. Assuming the process isn't extremely long-lived I think you'd be sufficiently close to not worry about it.


@Brian, The clock drift problem was in answer to your question and generally it was referring to the pure System/currentTimeMillis implementation. The System/nanoTime, as it is now, is completely broken as it offsets against the wall clock and not the previous System/nanotime, but that it is easy to fix.

 

The process ID in many platform is just 16bits, however some platform have 32bits (http://unix.stackexchange.com/questions/16883/what-is-the-maximum-value-of-the-pid-of-a-process), and the ThreadID in java is a long (https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#getId()) which is unfortunate. It would be nice if there was a way to read which CPU core physical thread is executing a particular thread (when the thread is running), i such way you could replace the processId and the threadId which just a CPU core ID. The number of physical threads (core + hyperthreads) 
is typically much smaller and it fits in a 16 bits. However to my knowledge there is no way in java to retrieve such value.

If you were willing to take on another 64 bits, an idea could be to take the last 16 (or maybe 32) bits from the ThreadID combined with a random bits to round out the 64.

The random bits could be done once per thread combined with keeping a ThreadLocal counter where a CAS is still done to avoid issuing the same ID at the same time increment in the same thread -- or the counter could be ignored entirely and the random bits generated with every ID. I'm not sure which would perform better, but I like the randomness per ID which makes IDs harder to guess.

I thought about adding randomness too, but not too sure that it will actually solve any problem at all rather than creating new ones (true random generator vs hybrid ones). Need more thinking on this.

Max Countryman

unread,
Jun 22, 2016, 11:00:35 AM6/22/16
to clo...@googlegroups.com

> On Jun 22, 2016, at 06:27, Bruno Bonacci <bruno....@gmail.com> wrote:
>
> @Brian, The clock drift problem was in answer to your question and generally it was referring to the pure System/currentTimeMillis implementation. The System/nanoTime, as it is now, is completely broken as it offsets against the wall clock and not the previous System/nanotime, but that it is easy to fix.

Note that the current implementation mirrors Skuld’s.

When calculating a timestamp we provide an epoch, derived at startup. The epoch is constructed by taking a sampling of the difference between System/currentTimeMillis and System/nanoTime and averaging the results: this is meant to detect jitter. We don’t use the wall clock at any point after that for constructing flakes.

Because timestamps are persisted to disk, the next time we run our generator, we ensure that the epoch used is always larger than the last timestamp written. (We could also use the epoch to derive a timestamp and ensure the derived timestamp is larger, which might be better thinking about it now.) This protects against clock skew between runs.

To address your question about clock skew during runs, it’s true that System/nanoTime is not necessarily monotonic. However the process of generating a new flake always rejects timestamps that appear in the past relative to the last flake. So a monotonic property is built into our program. As you note, this could result in thrashing if the delta is extreme or System/nanoTime is somehow always returning smaller values—but I don’t know of a better way of addressing that issue currently.

I think the salient point here though is that we do enforce a monotonic property at the application level which should protect us against duplicate IDs.
Reply all
Reply to author
Forward
0 new messages