RE: [twitter-dev] Upcoming changes to the way status IDs are sequenced

6,053 views
Skip to first unread message

Brian Smith

unread,
Mar 26, 2010, 4:48:17 PM3/26/10
to twitter-deve...@googlegroups.com

Any app that pages through timelines uses since_id or max_id depends responses being ordered by tweet ID. What will be the replacement for since_id and max_id?

 

Taylor Singletary wrote:

We are planning to replace our current sequential tweet ID generation routine with a simple, more scalable solution. IDs will still be 64-bit unsigned integers. However, this new solution is no longer guaranteed to generate sequential IDs.  Instead IDs will be derived based on time: the most significant bits being sourced from a timestamp and the least significant bits will be effectively random. 

 

For the majority of applications we think this scheme switch will be a non-event. Before implementing these changes, we'd like to know if your applications currently depend on the sequential nature of IDs. Do you depend on the density of the tweet sequence being constant?  Are you trying to analyze the IDs as anything other than opaque, ordered identifiers? Aside for guaranteed sequential tweet ID ordering, what APIs can we provide you to accomplish your goals?

Message has been deleted

M. Edward (Ed) Borasky

unread,
Mar 26, 2010, 5:28:11 PM3/26/10
to twitter-deve...@googlegroups.com
On 03/26/2010 01:41 PM, Taylor Singletary wrote:
> Hi Developers,

[snip]

> For the majority of applications we think this scheme switch will be a
> non-event. Before implementing these changes, we'd like to know if your
> applications currently depend on the sequential nature of IDs. Do you depend
> on the density of the tweet sequence being constant? Are you trying to
> analyze the IDs as anything other than opaque, ordered identifiers? Aside
> for guaranteed sequential tweet ID ordering, what APIs can we provide you to
> accomplish your goals?

I'm a mathematician. So yes, I *am* trying to analyze the IDs as other
than identifiers. ;-) As long as the status ID generation algorithm is
documented - how many bits are timestamp, how many bits are random, what
the granularity of the timestamp is, how the Spritzer and Gardenhose
sampling is done, etc. - I can do what I want to do without any API
additions.

--
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky
"A mathematician is a device for turning coffee into theorems." ~ Paul
Erdős

Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Arnaud Meunier

unread,
Mar 26, 2010, 7:13:44 PM3/26/10
to twitter-ap...@googlegroups.com, twitter-development-talk
Sequential ID generation is quite useful on my side, because I can trust these INT to compare tweets date (and sort them, for example).

Because of the "random part" you're mentioning, the "bigger ID == Older" rule won't always be true. This can be quite problematic.

All the best,
Arnaud.


Le 26 mars 2010 à 21:41, Taylor Singletary <taylorsi...@twitter.com> a écrit :

Hi Developers,

It's no secret that Twitter is growing exponentially. The tweets keep coming with ever increasing velocity, thanks in large part to your great applications.

Twitter has adapted to the increasing number of tweets in ways that have affected you in the past: We moved from 32 bit unsigned integers to 64-bit unsigned integers for status IDs some time ago. You all weathered that storm with ease. The tweetapoclypse was averted, and the tweets kept flowing.

Now we're reaching the scalability limit of our current tweet ID generation scheme. Unlike the previous tweet ID migrations, the solution to the current issue is significantly different. However, in most cases the new approach we will take will not result in any noticeable differences to you the developer or your users.

We are planning to replace our current sequential tweet ID generation routine with a simple, more scalable solution. IDs will still be 64-bit unsigned integers. However, this new solution is no longer guaranteed to generate sequential IDs.  Instead IDs will be derived based on time: the most significant bits being sourced from a timestamp and the least significant bits will be effectively random. 

Please don't depend on the exact format of the ID. As our infrastructure needs evolve, we might need to tweak the generation algorithm again.

If you've been trying to divine meaning from status IDs aside from their role as a primary key, you won't be able to anymore. Likewise for usage of IDs in mathematical operations -- for instance, subtracting two status IDs to determine the number of tweets in between will no longer be possible.

For the majority of applications we think this scheme switch will be a non-event. Before implementing these changes, we'd like to know if your applications currently depend on the sequential nature of IDs. Do you depend on the density of the tweet sequence being constant?  Are you trying to analyze the IDs as anything other than opaque, ordered identifiers? Aside for guaranteed sequential tweet ID ordering, what APIs can we provide you to accomplish your goals?

Taylor Singletary
Developer Advocate, Twitter

--
Twitter API documentation and resources: http://apiwiki.twitter.com
API updates via Twitter: http://twitter.com/twitterapi
Change your membership to this group: http://groups.google.com/group/twitter-api-announce?hl=en
 
To unsubscribe from this group, send email to twitter-api-announce+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

M. Edward (Ed) Borasky

unread,
Mar 26, 2010, 8:08:53 PM3/26/10
to Twitter Development Talk
On Mar 26, 4:01 pm, Josh Bleecher Snyder <joshar...@gmail.com> wrote:
> Having a universal counter is untenable, but having occasional,
> undiagnosable, unreproducible glitches also sucks. :) Thinking out
> loud, perhaps there is some middle ground -- a way to have generally
> monotonically increasing ids globally, and guaranteed monotonically
> increasing ids along some useful dimension, such as per user (this
> doesn't play nicely e.g. w/ Cassandra, but it is still reasonably
> scalable by other means). Not sure whether that would help folks or
> not...

I used to work at Goddard Space Flight Center. As you can well
imagine, accurate timekeeping was a hard requirement for many of the
projects and tasks there, though not all of them. The issue is cost.
Truly accurate timekeeping is achievable, but the cost to Twitter must
be passed on to its customers, and the last time I looked, "social
media" was an extremely competitive business. So I think we need to
allow Twitter some leeway here.

Right now, tweets carry a timestamp good to the nearest second. I
haven't looked recently, but the last published figure from Twitter
was that about 600 of them would have that timestamp on average. If
you truly need time resolution finer than that, make a business case,
apply for Firehose access, establish a business relationship with
Twitter, invest in the infrastructure on your end for the high-
precision timekeeping hardware and software, etc.

As far as occasional glitches are concerned, we have those now. Every
so often, we still get Fail Whales, 5xx errors, DDos attacks, etc. My
broadband sometimes doesn't work. Sometimes, we have a windstorm or an
ice storm and I lose electricity for a couple of hours. GMail goes
down sometimes. Amazon goes down sometimes. Water mains break. And
every so often, the astronomers add leap seconds to correct for
hitches in the Earth's gitalong. I think we can live with an
occasional clock error, or gap in the tweet IDs. And if you're
interested, I can point you at the fairly simple math needed to
correct for these glitches.

M. Edward (Ed) Borasky

unread,
Mar 26, 2010, 8:11:09 PM3/26/10
to Twitter Development Talk
That's awesome! How far back does your dataset go? Do you have the
Michael Jackson spike?

On Mar 26, 2:01 pm, jerememonteau <m...@jmoe.com> wrote:
> Whoops, accidentally just replied to author the first time...but...
>
> I build this little site about 9 months ago, depending on the
> monotonically increasing nature of tweet IDs :
>
> http://www.tweelocity.com
>
> This is a fun graph :
>
> http://tweelocity.com/chart/60/300/
>
> So it would be cool if some way were provided for me to gauge tweet
> volumes at regular intervals (currently every 2 minutes).
>
> I also think it's super cool that the twitter team is even giving a
> heads up like this.
>
> On Mar 26, 1:41 pm, Taylor Singletary <taylorsinglet...@twitter.com>

Message has been deleted

Aki

unread,
Apr 1, 2010, 2:01:01 AM4/1/10
to Twitter Development Talk
It actually makes sense to use tweet ID to sort tweets, because
timestamp is not a valid source of information for accurate sorting.
It is a very common case to have multiple tweets posted at the exact
same second, and it is not possible to reproduce the correct ordering
of tweets on the client side. This can be improved by having better
precision for timestamp (maybe milliseconds), but it is still possible
to get tweets posted at the exact same milliseconds (although it is
very rare).

If Twitter really needs to change the tweet ID scheme, I think better
solution for sorting is required to be provided through API.

On Mar 27, 7:41 am, Taylor Singletary <taylorsinglet...@twitter.com>

Mark McBride

unread,
Apr 1, 2010, 12:39:12 PM4/1/10
to twitter-deve...@googlegroups.com
Just out of curiosity, what applications are you building that require sub-second sorting resolution for tweets?


--
To unsubscribe, reply using "remove me" as the subject.

Dewald Pretorius

unread,
Apr 1, 2010, 1:47:05 PM4/1/10
to Twitter Development Talk
Mark,

It's extremely important where you have two bots that reply to each
others' tweets. With incorrectly sorted tweets, you get conversations
that look completely unnatural.

> > To unsubscribe, reply using "remove me" as the subject.- Hide quoted text -
>
> - Show quoted text -

M. Edward (Ed) Borasky

unread,
Apr 1, 2010, 3:41:42 PM4/1/10
to Twitter Development Talk
On Apr 1, 10:47 am, Dewald Pretorius <dpr...@gmail.com> wrote:
> Mark,
>
> It's extremely important where you have two bots that reply to each
> others' tweets. With incorrectly sorted tweets, you get conversations
> that look completely unnatural.

Uh ... bots talking to each other on Twitter? Is this something I can
watch today, or something that someone would build if the technology
existed in the API to support it? ;-)

M. Edward (Ed) Borasky

unread,
Apr 1, 2010, 3:43:44 PM4/1/10
to Twitter Development Talk

On Apr 1, 9:39 am, Mark McBride <mmcbr...@twitter.com> wrote:
> Just out of curiosity, what applications are you building that require
> sub-second sorting resolution for tweets?
>
>   ---Mark

Twitter's capacity planning? ;-)

Message has been deleted

Aki

unread,
Apr 1, 2010, 7:34:46 PM4/1/10
to Twitter Development Talk
I'm developing desktop Twitter client. I think accurate sorting is
needed, because the order of tweets may look different on every
application without accurate sorting. It's not that it would totally
kill my Twitter client, but I take accurate presentation of tweets
seriously, and I think it would be better to have consistent tweet
ordering across all applications.

If this scheme change is really needed (e.g. required to processing
new tweets simultaneously across multiple servers without
synchronising tweet ID), I would suggest adding time in milliseconds
to tweet information, which would have much better accuracy.

M. Edward (Ed) Borasky

unread,
Apr 2, 2010, 1:27:47 AM4/2/10
to Twitter Development Talk
On Apr 1, 4:34 pm, Aki <yoru.fuku...@gmail.com> wrote:
> I'm developing desktop Twitter client. I think accurate sorting is
> needed, because the order of tweets may look different on every
> application without accurate sorting. It's not that it would totally
> kill my Twitter client, but I take accurate presentation of tweets
> seriously, and I think it would be better to have consistent tweet
> ordering across all applications.
>
> If this scheme change is really needed (e.g. required to processing
> new tweets simultaneously across multiple servers without
> synchronising tweet ID), I would suggest adding time in milliseconds
> to tweet information, which would have much better accuracy.

No matter what the timestamp resolution is, you're still going to have
a
non-zero probability of multiple tweets per timestamp. And if you have
an "event" somewhere, like an earthquake or an orca killing his
trainer
in a show, you're going to see bursts of tweets from the scene,
assuming
the infrastructure survived the event. The probability of multiple
tweets per timestamp will increase dramatically in such a
circumstance.

But - I personally don't see how it would hurt Twitter to "publish"
average tweet inter-arrival times or average tweets per second on a
web
page for all the world to see. In fact, I'd love to be able to pull up
a
map of the world and see tweets-per-second mapped in (near) real time
-
say, refreshing every minute or so. Why make the world work to pull
this
out of the APIs? ;-)

How hard can it be?

http://earthquake.usgs.gov/earthquakes/recenteqsanim/world/

Waldron Faulkner

unread,
Apr 2, 2010, 8:56:00 AM4/2/10
to Twitter Development Talk
When?

On Mar 26, 4:41 pm, Taylor Singletary <taylorsinglet...@twitter.com>

Message has been deleted

M. Edward (Ed) Borasky

unread,
Apr 5, 2010, 7:18:43 PM4/5/10
to twitter-deve...@googlegroups.com
On 04/05/2010 12:55 AM, Tim Haines wrote:
> This made me laugh. Hard.

>
> On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius <dpr...@gmail.com> wrote:
>
>> Mark,
>>
>> It's extremely important where you have two bots that reply to each
>> others' tweets. With incorrectly sorted tweets, you get conversations
>> that look completely unnatural.
>>
>> On Apr 1, 1:39 pm, Mark McBride <mmcbr...@twitter.com> wrote:
>>> Just out of curiosity, what applications are you building that require
>>> sub-second sorting resolution for tweets?

Yeah - my bot laughed too ;-)

Mark McBride

unread,
Apr 8, 2010, 6:09:30 PM4/8/10
to twitter-deve...@googlegroups.com
Thank you for the feedback.  It's great to hear about the variety of use cases people have for the API, and in particular all the different ways people are using IDs. To alleviate some of the concerns raised in this thread we thought it would be useful to give more details about how we plan to generate IDs

1) IDs are still 64-bit integers.  This should minimize any migration pains.
2) You can still sort on ID.  Within a few millieconds you may get out of order results, but for most use cases this shouldn't be an issue.  
3) since_id will still work (within the caveats given above).  
4) We will provide a way to backfill from the streaming API.
5) You cannot use the generated ID to reverse engineer tweet velocity.  Note that you can still use the streaming API to determine the rate of public statuses.

Additional items of interest
1) At some point we will likely start using this as an ID for direct messages too
2) We will almost certainly open source the ID generation code, probably before we actually cut over to using it.
3) We STRONGLY suggest that you treat IDs as roughly sorted (roughly being within a few ms buckets), opaque 64-bit integers.  We may need to change the scheme again at some point in the future, and want to minimize migration pains should we need to do this.

Hopefully this puts you more at ease with the changes we're making.  If it raises new concerns, please let us know!

  ---Mark

http://twitter.com/mccv

Nick Arnett

unread,
Apr 8, 2010, 6:39:42 PM4/8/10
to twitter-deve...@googlegroups.com
On Thu, Apr 1, 2010 at 10:47 AM, Dewald Pretorius <dpr...@gmail.com> wrote:
Mark,

It's extremely important where you have two bots that reply to each
others' tweets. With incorrectly sorted tweets, you get conversations
that look completely unnatural.

I'd love to see an example of two bots replying to each other and looking entirely natural!

We all knew this sort of thing was going on, removing the pesky humans from the loop, but I always thought it was unintentional.

There's a science fiction story in there somewhere.

Nick

Lil Peck

unread,
Apr 8, 2010, 6:55:11 PM4/8/10
to twitter-deve...@googlegroups.com
On Thu, Apr 8, 2010 at 5:39 PM, Nick Arnett <nick....@gmail.com> wrote:
>
> I'd love to see an example of two bots replying to each other and looking
> entirely natural!
>
> We all knew this sort of thing was going on, removing the pesky humans from
> the loop, but I always thought it was unintentional.
>
> There's a science fiction story in there somewhere.
>
>

Do Twitterbots dream of electric sheep?


--
Subscription settings: http://groups.google.com/group/twitter-development-talk/subscribe?hl=en

Brian Smith

unread,
Apr 8, 2010, 7:01:20 PM4/8/10
to twitter-deve...@googlegroups.com

What does “within the caveats given above” mean? Either since_id will work or it won’t. It seems to me that if IDs are only in a “rough” order, since_id won’t work—in particular, there is a possibility that paging through tweets using since_id will completely skip over some tweets.

 

My concern is that, since tweets will not be serialized at the time they are written, there will be a race condition between me making a request and users posting new statuses. That is, I could get a response with the largest id in the response being X that gets evaluated just before a tweet (X-1) has been saved in the database; If so, when I issue a request with since_id=X, my program will never see the newer tweet (X-1).

 

Are you going to change the implementation of the timeline methods so that they never return a tweet with ID X until all nodes in the cluster guarantee that they won’t create a new tweet with an ID less than X?

 

I implement the following logic:

 

1.      Let LATEST start out as the earliest tweet available in the user’s timeline.

2.      Make a request with since_id={LATEST}, which returns a set of tweets T.

3.      If T is empty then stop.

4.      Let LATEST= max({ id(t), for all t in T}).

5.      Goto 2.

 

Will I be guaranteed not to skip over any tweets in the timeline using this logic? If not, what do I need to do to ensure I get them all?

 

Thanks,

Brian

 

 

From: twitter-deve...@googlegroups.com [mailto:twitter-deve...@googlegroups.com] On Behalf Of Mark McBride
Sent: Thursday, April 08, 2010 5:10 PM
To: twitter-deve...@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

 

Thank you for the feedback.  It's great to hear about the variety of use cases people have for the API, and in particular all the different ways people are using IDs. To alleviate some of the concerns raised in this thread we thought it would be useful to give more details about how we plan to generate IDs

 

1) IDs are still 64-bit integers.  This should minimize any migration pains.

2) You can still sort on ID.  Within a few millieconds you may get out of order results, but for most use cases this shouldn't be an issue.  

3) since_id will still work (within the caveats given above).  

4) We will provide a way to backfill from the streaming API.

5) You cannot use the generated ID to reverse engineer tweet velocity.  Note that you can still use the streaming API to determine the rate of public statuses.

 

Additional items of interest

1) At some point we will likely start using this as an ID for direct messages too

2) We will almost certainly open source the ID generation code, probably before we actually cut over to using it.

3) We STRONGLY suggest that you treat IDs as roughly sorted (roughly being within a few ms buckets), opaque 64-bit integers.  We may need to change the scheme again at some point in the future, and want to minimize migration pains should we need to do this.

 

Hopefully this puts you more at ease with the changes we're making.  If it raises new concerns, please let us know!

 

  ---Mark

http://twitter.com/mccv

 

On Mon, Apr 5, 2010 at 4:18 PM, M. Edward (Ed) Borasky <zn...@comcast.net> wrote:

On 04/05/2010 12:55 AM, Tim Haines wrote:
> This made me laugh.  Hard.
>

> On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius <dpr...@gmail.com> wrote:
>
>> Mark,
>>
>> It's extremely important where you have two bots that reply to each
>> others' tweets. With incorrectly sorted tweets, you get conversations
>> that look completely unnatural.
>>

>> On Apr 1, 1:39 pm, Mark McBride <mmcbr...@twitter.com> wrote:
>>> Just out of curiosity, what applications are you building that require
>>> sub-second sorting resolution for tweets?

Naveen

unread,
Apr 8, 2010, 7:27:15 PM4/8/10
to Twitter Development Talk
This was my initial concern with the randomly generated ids that I
brought up, though I think Brian described it better than I.

It simply seems very likely that when using since_id to populate newer
tweets for the user, that some tweets will never be seen, because the
since_id of the last message received will be larger than one
generated 1ms later.

With the random generation of ids, I can see two way guarantee
delivery of all tweets in a users timeline
1. Page forwards and backwards to ensure no tweets generated at or
near the same time as the newest one did not receive a lower id. This
will be very expensive for a mobile client not to mention complicate
any refresh algorithms significantly.
2. Given that we know how IDs are generated (i.e. which bits represent
the time) we can simply over request by decrementing the since_id time
bits, by a second or two and filter out duplicates. (again, not really
ideal for mobile clients where battery life is an issue, plus it then
makes the implementation very dependent on twitters id format
remaining stable)

Please anyone explain if Brian and I are misinterpreting this as a
very real possibility of never displaying some tweets in a time line,
without changing how we request data from twitter (i.e. since_id
doesn't break)

--Naveen Ayyagari
@knight9
@SocialScope
>  <http://twitter.com/mccv>http://twitter.com/mccv
Message has been deleted

Naveen

unread,
Apr 8, 2010, 8:03:29 PM4/8/10
to Twitter Development Talk
Ahh, yes, your workaround is a little better than mine, but it is
still a work around and requires changes to how since_id is currently
used by what I have assume is most applications. I understand the need
for change and am willing to work around it, I can imagine the
scalability issues of trying to use a synchronized id for all tweets.

However, I wanted to be clear and feel it should be made obvious that
with this change, there is a possibility that a tweet may not be
delivered to client if the implementation of how since_id is currently
used is not updated to cover the case. I still envision the situation
as more likely than you seem to believe and figure as tweet velocity
increases, the likelihood will also increase; But I am assuming have
better data to support your viewpoint than I and shall defer.

--Naveen Ayyagari
@knight9
@SocialScope

On Apr 8, 7:37 pm, Mark McBride <mmcbr...@twitter.com> wrote:
> It's a possibility, but by no means a probability.  Note that you can
> mitigate this by using the newest tweet that is outside your "danger zone".
>  For example in a sequence of tweets t1, t2 ... ti ... tn with creation
> times c1, c2 ... ci ... cn and a comfort threshold e you could use since_id
> from the latest ti such that c1 - ci > e.
>
>   ---Mark
>
> http://twitter.com/mccv
Message has been deleted

Dave Sherohman

unread,
Apr 9, 2010, 4:58:19 AM4/9/10
to twitter-deve...@googlegroups.com
On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> However, I wanted to be clear and feel it should be made obvious that
> with this change, there is a possibility that a tweet may not be
> delivered to client if the implementation of how since_id is currently
> used is not updated to cover the case. I still envision the situation
> as more likely than you seem to believe and figure as tweet velocity
> increases, the likelihood will also increase; But I am assuming have
> better data to support your viewpoint than I and shall defer.

Maybe I'm just missing something here, but it seems trivial to fix on
Twitter's side (enough so that I assume it's what they've been planning
from the start to do): Only return tweets from closed buckets.

We are guaranteed that the buckets will be properly ordered. The order
will only be randomized within a bucket. Therefore, by only returning
tweets from buckets which are no longer receiving new tweets, since_id
works and will never miss a tweet.

And, yes, this does mean a slight delay in getting the tweets out
because they have to wait a few milliseconds for their bucket to close
before being exposed to calls which can use since_id, plus maybe a
little longer for the contents of that bucket to be distributed to
multiple servers. That's still going to only take time comparable to
round-trip times for an HTTP request to fetch the data for display to a
user and be far, far less than the average refresh delay required by
those clients which fall under the API rate limit. I submit, therefore,
that any such delay caused by waiting for buckets to close will be
inconsequential.

--
Dave Sherohman

John Kalucki

unread,
Apr 9, 2010, 2:20:21 PM4/9/10
to twitter-deve...@googlegroups.com
Folks are making a lot of incorrect assumptions about the Twitter architecture, especially around how we materialize and present timeline vectors and just what QoS we're really offering. This new scheme does not significantly, or perhaps even observably, make the existing issues around since_id any better or any worse. And I'm being very precise here. The since_id situation is such that the few milliseconds skew possible in Snowflake are practically irrelevant and lost in the noise of a 4 to 6 orders-of-magnitude misconception. (That's a very big misconception.)

If you do not know the rough ordering of our event stream as it applied to the materialized timeline vectors and also the expected rate of change on the timeline in question, you cannot make good choices about making since_id perfect. But, neither you should you try to make it perfect, nor should you have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's advice: In the existing continuously increasing id generation scheme on the Twitter.com API, I'd subtract about 5000 ids from since_id to ensure sufficient overlap in nearly all cases, but even this could be lossy in the face of severe operational issues -- issues of a type that we haven't seen in many many months. The search API has a different K in its rough ordering, so you might need more like 10,000 ids. In the new Snowflake scheme, I'd overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for search APIs.

Despite all this, things still could go wrong. An engineer here is known for pointing out that even things that almost never ever happen, happen all the time on the Twitter system. Now, just because they are happening, to someone, all the time, doesn't mean that they'll ever ever happen to you or your users in a thousand years -- but some's getting hit with it, somewhere, a few times a day.

The above schemes no longer treat the id as an opaque unique ordered identifier. And woe lies in wait for you as changes are made to these ids. Woe. You also need to deduplicate. Be very careful and understand fully what you summon by breaking this semantic contract.

In the end, since_id issues go away on the Streaming API, and other than around various start-up discontinuities, you can ignore this issue. I'll be talking about Rough Ordering, among other things Streaming, at the Chirp conference. Come geek out.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

M. Edward (Ed) Borasky

unread,
Apr 9, 2010, 3:04:11 PM4/9/10
to twitter-deve...@googlegroups.com, John Kalucki
On 04/09/2010 11:20 AM, John Kalucki wrote:

[snip]


>
> In the end, since_id issues go away on the Streaming API, and other than
> around various start-up discontinuities, you can ignore this issue. I'll be
> talking about Rough Ordering, among other things Streaming, at the Chirp
> conference. Come geek out.

Thanks, John - that's the plan. ;-)

Brian Smith

unread,
Apr 9, 2010, 3:20:56 PM4/9/10
to twitter-deve...@googlegroups.com

John,

 

Thank you. That was one of the most informative emails on the Twitter API I have seen on the list.

 

Basically, even now, an application should not use an ID of a tweet for since_id if the tweet is less than 10 seconds old, ignoring service abnormalities. Probably a larger threshold (30 seconds or even a minute) would be better, especially when you take into consideration the likelihood of clock skew between the servers that generate the timestamps.

 

I think this is information that would be useful to have added to the API documentation, as I know many applications are taking a much more naive approach to pagination.

 

Thanks again,

Brian

 

From: twitter-deve...@googlegroups.com On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 1:20 PM
To: twitter-deve...@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

 

Folks are making a lot of incorrect assumptions about the Twitter architecture, especially around how we materialize and present timeline vectors and just what QoS we're really offering. This new scheme does not significantly, or perhaps even observably, make the existing issues around since_id any better or any worse. And I'm being very precise here. The since_id situation is such that the few milliseconds skew possible in Snowflake are practically irrelevant and lost in the noise of a 4 to 6 orders-of-magnitude misconception. (That's a very big misconception.)

John Kalucki

unread,
Apr 9, 2010, 4:30:32 PM4/9/10
to twitter-deve...@googlegroups.com
Your second paragraph doesn't quite make sense. The period between your next poll and the timestamp of the last status is irrelevant. The issue is solely the magnitude of K on the roughly sorted stream of events that are applied to the materialized timeline vector. As K varies, so do the odds, however infinitesimally small, that you will miss a tweet using the last status id returned. The period between your polls of the API does not affect this K.

My recommendation is to ignore this issue in nearly every use case. If you are, however, polling high velocity timelines (including search queries) and attempting to approximate an Exactly Once QoS, you should, basically, stop doing that. You are probably wasting resources and you'll probably never get Exactly Once behavior anyway. Use the Streaming API instead.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

Brian Smith

unread,
Apr 9, 2010, 5:28:39 PM4/9/10
to twitter-deve...@googlegroups.com

John,

 

I am not polling. I am simply trying to implement a basic “refresh” feature like every desktop/mobile Twitter app has. Basically, I just want to let users scroll through their timelines, and be reasonably sure that I am presenting them with an accurate & complete view of the timeline, while using as little bandwidth as possible.

 

When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I was referring to the age at the time the page of tweets was generated. So, basically, if the tweet’s timestamp – the response’s Last-Modified time more than 10,000 ms (from what you said below), you are almost definitely getting At Least Once behavior if Twitter is operating normally, and you can use that information to get At Least Once behavior that emulates Exactly Once behavior with little (usually no) overhead. Is that a correct interpretation of what you were saying?

 

Thanks,

Brian

John Kalucki

unread,
Apr 11, 2010, 2:23:42 PM4/11/10
to twitter-deve...@googlegroups.com
If you are writing a general purpose display app, I think, (but I am not at all certain), that you can ignore this issue. Reasonable polling frequency on modest velocity timelines will sometimes, but very rarely, miss a tweet. Also, over time, we're doing things to make this better for everyone. Many of our projects have the side-effect of reducing K, decreasing the already low since_id failure odds even further. Some tweet pipeline changes when live in the last few weeks that dramatically reduce the K distribution for various user types.

Since I don't know how the Last-Modified time exactly works, I'm going to restate your response slightly:

Assuming synchronized clocks (or solely the Twitter Clock, if exposed properly via Last-Modified), given a poll at time t, the newest status is at least t - n seconds old, and sufficient n, then even a naive since_id algorithm will be effectively Exactly Once. Assuming that Twitter is running normally. For a given poll, when the poll time and last update time delta drops below this n second period, there's a non-zero loss risk.

Just what is n? It is K expressed as time rather than as a discrete count. For some timelines types, with some classes of users, K is as much as perhaps 180 seconds. For others, K is less than 1 second. There's some variability here that we should characterize more carefully internally and then discuss publicly. I suspect there's a lot to be learned from this exercise.

Since_id really runs into trouble when any of the following are too great: the polling frequency, the updating frequency, the roughly-sorted K value. If you are polling often to reduce display latency, use the Streaming API. If the timeline moves too fast to capture it all exactly, you should reconsider your requirements or get a Commercial Data License for the Streaming API. Does the user really need to see every Bieber at 3 Biebers Per Second? How would they ever know if they missed 10^-5 of them in a blur? If you need them all for analysis, consider calculating the confidence interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total enumeration. Indistinguishable. If you need them for some other purpose, say CRM, the Streaming API may be the answer.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


Josh Bleecher Snyder

unread,
Apr 11, 2010, 7:35:39 PM4/11/10
to twitter-deve...@googlegroups.com
Hi John (et al.),

These emails from you are great -- they are exactly the sort of
thoughtful, detailed, specific, technical emails that I would
personally love to see accompany future announcements. I think they
would prevent a fair amount of FUD. Thank you.

I have one stupid question, if you don't mind, though. You refer in
every email to "K". What, precisely, does K refer to? What are its
units? (I think I know what it you mean by it, but I'd be interested
to hear precisely.)

Thanks,
Josh

John Kalucki

unread,
Apr 11, 2010, 8:14:52 PM4/11/10
to twitter-deve...@googlegroups.com
A sequence can be on a continuum from unsorted to partially sorted to roughly sorted to totally sorted. Totally sorted is what we mean when we say "sorted". Partially sorted could mean anything, I suppose, but roughly sorted is a stricter definition than partially sorted. Informally it means that each item is no more than K items out of position. So, to totally sort the sequence, you need only consider K items.

This is useful stuff for dealing with infinite sequences of events -- like, picking a random example, the insertion of new tweets into a materialized timeline (a cache of the timeline vector). The events get slightly jumbled as they go through the Twitter system and this causes confusion for developers who don't understand how we apply the CAP theorem. It's Brewer's world, we just live in it. And we haven't done a good job at explaining our QoS as we've made the CAP trade-offs, or how we've evolved them, etc. etc.

To make things one step more complicated, at Twitter, K is a function of a number of factors, including the timeline, the user tweeting, the phase of the moon, and the general state of the Twitter system. So, we have to think of the distribution of K over time as well.

Crazy. We should just move this all into a single instance of Oracle and go home.

http://twitter.com/jkalucki/statuses/10503736367
A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤ n, i ≤ r - k implies aᵢ ≤ aᵣ.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


Chad Etzel

unread,
Apr 11, 2010, 11:05:26 PM4/11/10
to twitter-deve...@googlegroups.com, twitter-deve...@googlegroups.com

I'd like to see more epsilon-delta proofs on this list personally :)

Chad


Nick Arnett

unread,
Apr 12, 2010, 12:33:55 AM4/12/10
to twitter-deve...@googlegroups.com
On Sun, Apr 11, 2010 at 5:14 PM, John Kalucki <jo...@twitter.com> wrote:

This is useful stuff for dealing with infinite sequences of events -- like, picking a random example, the insertion of new tweets into a materialized timeline (a cache of the timeline vector).

The Twitter stream is an infinite sequence of events... now that's serious optimism about how long Twitter will exist!

Sorry, just had to say it.

Of course, some infinities are bigger than others.

Nick

M. Edward (Ed) Borasky

unread,
Apr 12, 2010, 1:12:59 AM4/12/10
to twitter-deve...@googlegroups.com, Nick Arnett

Ah yes ... and the "tweet rate" is growing "exponentially" ... except
that such growth is economically implausible. Thanks for reminding me -
another Chirp question for Google Moderator. ;-)

--
M. Edward (Ed) Borasky

http://borasky-research.net/m-edward-ed-borasky/ @znmeb

"I've always regarded nature as the clothing of God." ~Alan Hovhaness

Taylor Singletary

unread,
Mar 26, 2010, 4:41:30 PM3/26/10
to twitter-development-talk, twitter-api-announce

Alam Sher

unread,
Mar 26, 2010, 4:51:19 PM3/26/10
to twitter-deve...@googlegroups.com
Yup, I am using since_id as well in my application to perform various sequential tasks. Hopefully new id generation scheme will have  this parameter support using some alternatives at least.

Alam Sher

On Sat, Mar 27, 2010 at 1:48 AM, Brian Smith <br...@briansmith.org> wrote: