Streaming API Best Practice (Multiple Connections or Single)

195 views
Skip to first unread message

Alam Sher

unread,
Feb 24, 2010, 1:46:37 PM2/24/10
to Twitter Development Talk
Hi,

What is the best practice if I have 20K twitter user base and I want
to track user's specific keywords via statuses/filter?

Should I distribute the processing on multiple nodes, lets say open a
streaming connection tracking keywords for 5K users each (on different
IPs or same IP with different authenticating users)?

Or just apply for a bigger access level and use a single connection to
get the whole thing.

Thanks,
Alam Sher

John Kalucki

unread,
Feb 24, 2010, 5:06:59 PM2/24/10
to twitter-deve...@googlegroups.com
The documentation should be pretty clear on this topic. One main connection, and perhaps an auxiliary connection to manage query velocity. You must not proliferate connections to work around rate limits. Instead, you must apply for higher access. Connections are expensive.

http://apiwiki.twitter.com/Streaming-API-Documentation#UpdatingFilterPredicates


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

Jonathan Strauss

unread,
Feb 24, 2010, 5:31:35 PM2/24/10
to Twitter Development Talk, Michael Orr, Corey Reece, lau...@snowballfactory.com
On Feb 24, 2:06 pm, John Kalucki <j...@twitter.com> wrote:
> The documentation should be pretty clear on this topic. One main connection,
> and perhaps an auxiliary connection to manage query velocity.

Hey John,

Do you recommend this kind of 2 connection setup for updating our user
list when using the follow predicate?

We've been trying unsuccessfully to use the count parameter when
reconnecting to add new users to our follow list. I've found several
oblique mentions of the count parameter only working in some cases,
but no specifics on how or why.

We currently have shadow role access for the TweetPo.st app. We're
trying to update our Streaming API connection when new users signup
for TweetPo.st without losing tweets for existing users during
reconnect. Any suggestions on the best way to do this would be greatly
appreciated.

Thanks!
-jonathan

=====
Jonathan Strauss, Co-Founder
http://snowballfactory.com

Campaign tracking for social media - http://awe.sm
A smarter way to update Facebook from Twitter - http://tweetpo.st
Sharecount button for Facebook - http://www.fbshare.me

John Kalucki

unread,
Feb 24, 2010, 10:44:25 PM2/24/10
to twitter-deve...@googlegroups.com, Michael Orr, Corey Reece, lau...@snowballfactory.com
This technique works for updating any filter predicate. The count parameter should work on a shadow account. It won't work on a default access account. We have a number of very large integrations using this technique with Birddog access -- it should scale down to Shadow access just fine.

The documentation makes it clear which cases are supported and which ones are not: http://apiwiki.twitter.com/Streaming-API-Documentation#count

The count parameter isn't supported on track streams for computational complexity reasons, and it isn't supported on the default access role for policy reasons.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.



Alam Sher

unread,
Feb 25, 2010, 2:57:33 AM2/25/10
to twitter-deve...@googlegroups.com
So in case, if I have 20K users and I have to, say track 60K keywords for them + also have to follow all of them. I should be applying for 2 higher access accounts one for track predicates and other for follow predicate. Does this make sense?

Thanks,
Alam Sher
--
_______________
Alam Sher Khan
+92 331 505 5549

Alam Sher

unread,
Feb 25, 2010, 5:06:30 AM2/25/10
to Twitter Development Talk
So in case, if I have 20K users and I have to, say track 60K keywords
for them + also have to follow all of them. I should be applying for 2
higher access accounts one for track predicates and other for follow
predicate. Does this make sense?

Thanks,

On Feb 25, 8:44 am, John Kalucki <j...@twitter.com> wrote:
> This technique works for updating any filter predicate. The count parameter
> should work on a shadow account. It won't work on a default access account.
> We have a number of very large integrations using this technique with
> Birddog access -- it should scale down to Shadow access just fine.
>
> The documentation makes it clear which cases are supported and which ones
> are not:http://apiwiki.twitter.com/Streaming-API-Documentation#count
>
> The count parameter isn't supported on track streams for computational
> complexity reasons, and it isn't supported on the default access role for
> policy reasons.
>

> -John Kaluckihttp://twitter.com/jkalucki


> Infrastructure, Twitter Inc.
>
> On Wed, Feb 24, 2010 at 2:31 PM, Jonathan Strauss <
>

> jonat...@snowballfactory.com> wrote:
> > On Feb 24, 2:06 pm, John Kalucki <j...@twitter.com> wrote:
> > > The documentation should be pretty clear on this topic. One main
> > connection,
> > > and perhaps an auxiliary connection to manage query velocity.
>
> > Hey John,
>
> > Do you recommend this kind of 2 connection setup for updating our user
> > list when using the follow predicate?
>
> > We've been trying unsuccessfully to use the count parameter when
> > reconnecting to add new users to our follow list. I've found several
> > oblique mentions of the count parameter only working in some cases,
> > but no specifics on how or why.
>
> > We currently have shadow role access for the TweetPo.st app. We're
> > trying to update our Streaming API connection when new users signup
> > for TweetPo.st without losing tweets for existing users during
> > reconnect. Any suggestions on the best way to do this would be greatly
> > appreciated.
>
> > Thanks!
> > -jonathan
>
> > =====
> > Jonathan Strauss, Co-Founder
> >http://snowballfactory.com
>

> > Campaign tracking for social media -http://awe.sm
> > A smarter way to update Facebook from Twitter -http://tweetpo.st

John Kalucki

unread,
Feb 25, 2010, 9:15:34 AM2/25/10
to twitter-deve...@googlegroups.com
Yes, this is indeed what you should be doing. If you have a low tolerance for data loss, you will then use a total of four accounts: 2 elevated and 2 default access accounts. If you can tolerate a few missing tweets on each reconnect, you can just use the two elevated accounts.

-John Kalucki

http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


Alam Sher

unread,
Feb 25, 2010, 3:25:24 PM2/25/10
to twitter-deve...@googlegroups.com
Sorry, but exactly this portion of the documentations goes above my head.

Can you please explain a bit more to me how a default access account can be used along with the elevated access account to minimize the data loss?

Thanks,
Alam Sher

John Kalucki

unread,
Feb 26, 2010, 4:31:58 PM2/26/10
to twitter-deve...@googlegroups.com
The elevated access account can reconnect much less frequently by adding new predicates to a default access stream that cycles based on demand. When the default access account cycles, very little data will be lost, as it receives a small fraction of your total feed. Once the default access account is too full, the elevated access account can be restarted with the current predicates.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


Alam Sher

unread,
Feb 27, 2010, 8:14:24 AM2/27/10
to twitter-deve...@googlegroups.com
Okay, great.

When we say a default access account or elevated access is "TOO FULL". Does that mean, we have started getting rate limit messages in stream? Or it is something else?


Thanks,
Alam Sher

John Kalucki

unread,
Feb 27, 2010, 11:31:47 PM2/27/10
to twitter-deve...@googlegroups.com
Each developer will come to understand Fullness in a unique inner-directed manner. One might decide that exhausting the predicate list constitutes adequate Fullness. Another might decide that data loss becomes unacceptable at another point, perhaps due to the rapid cycling. A third might develop another Fullness heuristic. We should not judge their reasons, rather their reasoning and purity of motive. And their careful adherence to the connection guidelines, as offered in the Wiki of Truth.

Jonathan Strauss

unread,
Mar 4, 2010, 4:14:25 AM3/4/10
to Twitter Development Talk, Corey Reece, Michael Orr, Laurie Voss
First of all John, that may be the best Saturday night reply ever :-).

We are trying to use the count parameter with the follow predicate on
an account with shadow access role and have been getting some curious
responses when testing.

Here is a brief description of the testing scenarios:
* following a single Twitter ID
* using a count parameter of -1000
* tweeting from the Twitter ID being followed and then immediately
starting the connection described
Scenario A - if the connection is started within 1-2 seconds of the
tweet, it will show up in the historical results and then the
connection will hang for ~10min before disconnecting
Scenario B - if the connection is started >10 seconds from the tweet,
it will not show up in the historical results

Questions:
1) In reading "On filtered streams, the number requested is the number
of statuses that are applied to the filter predicate, and not the
number of statuses returned." from http://apiwiki.twitter.com/Streaming-API-Documentation#count,
are we to understand that the count parameter for the follow predicate
should be keying off the expected volume of *all* tweets in the
reconnect period, not just the ones from users we're following?
2) If that is the case, won't our count parameter always need to be a
function of total Streaming API tweet volume as opposed to anything we
can measure within our app?
3) And finally, what would be the explanation for the hang we see in
testing Scenario A above?

Thanks,
-jonathan

=====
Jonathan Strauss, Co-Founder
http://snowballfactory.com

Campaign tracking for social media - http://awe.sm
A smarter way to update Facebook from Twitter - http://tweetpo.st


Sharecount button for Facebook - http://www.fbshare.me

On Feb 27, 8:31 pm, John Kalucki <j...@twitter.com> wrote:
> Each developer will come to understand Fullness in a unique inner-directed
> manner. One might decide that exhausting the predicate list constitutes
> adequate Fullness. Another might decide that data loss becomes unacceptable
> at another point, perhaps due to the rapid cycling. A third might develop
> another Fullness heuristic. We should not judge their reasons, rather their
> reasoning and purity of motive. And their careful adherence to the
> connection guidelines, as offered in the Wiki of Truth.
>
>
>
> On Sat, Feb 27, 2010 at 5:14 AM, Alam Sher <alamshe...@gmail.com> wrote:
> > Okay, great.
>
> > When we say a default access account or elevated access is "TOO FULL". Does
> > that mean, we have started getting rate limit messages in stream? Or it is
> > something else?
>
> > Thanks,
> > Alam Sher
>

> > On Sat, Feb 27, 2010 at 2:31 AM, John Kalucki <j...@twitter.com> wrote:
>
> >> The elevated access account can reconnect much less frequently by adding
> >> new predicates to a default access stream that cycles based on demand. When
> >> the default access account cycles, very little data will be lost, as it
> >> receives a small fraction of your total feed. Once the default access
> >> account is too full, the elevated access account can be restarted with the
> >> current predicates.
>
> >> -John Kalucki
> >>http://twitter.com/jkalucki
> >> Infrastructure, Twitter Inc.
>

> >> On Thu, Feb 25, 2010 at 12:25 PM, Alam Sher <alamshe...@gmail.com> wrote:
>
> >>> Sorry, but exactly this portion of the documentations goes above my head.
>
> >>> Can you please explain a bit more to me how a default access account can
> >>> be used along with the elevated access account to minimize the data loss?
>
> >>> Thanks,
> >>> Alam Sher
>

> >>> On Thu, Feb 25, 2010 at 7:15 PM, John Kalucki <j...@twitter.com> wrote:
>
> >>>> Yes, this is indeed what you should be doing. If you have a low
> >>>> tolerance for data loss, you will then use a total of four accounts: 2
> >>>> elevated and 2 default access accounts. If you can tolerate a few missing
> >>>> tweets on each reconnect, you can just use the two elevated accounts.
>
> >>>> -John Kalucki
>
> >>>>http://twitter.com/jkalucki
> >>>> Infrastructure, Twitter Inc.
>

John Kalucki

unread,
Mar 4, 2010, 11:49:08 AM3/4/10
to twitter-deve...@googlegroups.com, Corey Reece, Michael Orr, Laurie Voss
The FEs keep a circular buffer of the last 150,000 tweets. The count parameter controls how much of that buffer is examined to create the historical dump before transitioning to live streaming. If the current tps rate is, say, 600, then the buffer holds the last 250 seconds worth of tweets. With a count of 150,000, a firehose stream would receive all 150,000 tweets, and then the very next live tweet, effectively masking all disconnects of up to say 249 seconds. A filtered stream gets the exact same coverage, but, instead, you only receive those tweets that match on at least one predicate.

Yes, it's hard for a non-firehose consumer to estimate the optimal count size. However, given a highly selective predicate, there's usually little harm in requesting too much. Perhaps just request the full historical result set,  and dedup the overlap. This request-it-all approach is less practical for firehose and other high-volume streams, as receiving and parsing through the duplicates adds latency to the first non-duplicated tweet.

A negative count returns only the historical result set and does not transition to live streaming. In this case, your HTTP client should see a TCP close and exit gracefully immediately after receiving any matching historical tweets. The whole transaction should take perhaps 60ms (west coast + speed of light).

Other than the 10 minute hang (which is probably your client's default TCP socket timeout setting), your scenarios describe the desired behavior. I suspect that your client isn't detecting a TCP close in a timely manner. This flaw will lead to data loss when connections are cycled on our end. I strongly encourage all clients to detect a TCP close and reconnect within a few tens to hundreds of milliseconds.


-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


Reply all
Reply to author
Forward
0 new messages