what does "K-sorted" mean?

davidnicol

unread,

Oct 20, 2010, 10:13:02 AM10/20/10

to Twitter Development Talk

the matt harris informs us:
6) Why not restrict IDs to 53bits?
A Snowflake ID is composed:
* 41bits for millisecond precision time (69 years)
* 10bits for a configured machine identity (1024 machines)
* 12bits for a sequence number (4096 per machine)
The factor influencing the length of the ID is the time. For a 53bit
ID this
would mean only 31bits are available for the time. 31bits is only
enough for
24 days (2147483648/(1000*60*60*24)) of time.
Reducing the resolution of the timestamp would prevent a K-sorted
resolution
of 1 second or less.
__END__

Why does the precision need to be 1000x the resolution? That makes no
sense to me.
The only references I can find for "K sorted" are concerned with
questions about "K sorted lists" where K is the number of different
lists. Sorting the streams from all thousand machines into one stream
you will have K=1024, sure, but if all those streams have millisecond
precision timestamps, the sorted output will have resolution of two
milliseconds. Half-second precision is sufficient to keep everything
in the right second bucket, possibly even whole-second precision.

Or is "K sorted" something besides a mondegreen that happens to
include a letter?

John Kalucki

unread,

Oct 20, 2010, 12:17:29 PM10/20/10

to twitter-deve...@googlegroups.com

K-sorted means roughly sorted, where no item is no more than K positions from it's totally ordered position. A sequence is k-sorted IFF, for all i,r, 1<= i <= r <= n, i<= r-k implies that a(i) <= a(r).

The generation scheme has to allow sufficient IDs to be generated in a non-coordinated way to cover expected TPS well into the future. If you remove bits from the timestamp, you'll need to add those bits back in to the other fields, or you might starve for IDs. This scheme allows for 2^24 Ids to be generated per millisecond, but that 24 bit space must be sparse to allow for uncoordinated tweet generation.

-John

--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: http://groups.google.com/group/twitter-development-talk

M. Edward (Ed) Borasky

unread,

Oct 20, 2010, 12:48:13 PM10/20/10

to twitter-deve...@googlegroups.com, John Kalucki

Quoting John Kalucki <jo...@twitter.com>:

> K-sorted means roughly sorted, where no item is no more than K positions
> from it's totally ordered position. A sequence is k-sorted IFF, for all i,r,
> 1<= i <= r <= n, i<= r-k implies that a(i) <= a(r).
>
> The generation scheme has to allow sufficient IDs to be generated in a
> non-coordinated way to cover expected TPS well into the future. If you
> remove bits from the timestamp, you'll need to add those bits back in to the
> other fields, or you might starve for IDs. This scheme allows for 2^24 Ids
> to be generated per millisecond, but that 24 bit space must be sparse to
> allow for uncoordinated tweet generation.
>
> -John

If the tweets are being generated by an abstract stochastic or
deterministic process and *must* have IDs assigned within a certain
window after they arrive, yes. But if the tweets are being generated
by a *finite* number, say, 2^30, of *human* users, a small fraction of
whom are actually posting a tweet at any given time and who have been
given only a guarantee of *eventual* consistency and who are paying
Twitter nothing to cache, deliver and archive their tweets, I'm not so
sure.

This whole discussion reminds me of the situation a few years back
when the world as we knew it was going to come to a sorry end because
someone measured Internet traffic and discovered it was "fractal", for
some definition of that overused term. "OMG - how can we do capacity
planning when the distributions don't have all the moments our
textbooks said they should?"

It turned out that they *could* do Internet capacity planning,
although it is a bit harder than processor capacity planning (but not
as hard as the Linux kernel's memory manager.) ;-) And it turned out
that Internet traffic wasn't "really" fractal after all, just the
traffic they had measured when they published the papers.

Now ... when every houseplant has its own IPV6 address and sends a
tweet to all 7562 of its followers when it needs watering ... ;-)
--
M. Edward (Ed) Borasky
http://borasky-research.net http://twitter.com/znmeb

Too old to be a futurist and too young to be an historian

Reply all

Reply to author

Forward