~25% loss rate Streaming API vs. Search API

52 views
Skip to first unread message

Brian Maso

unread,
Jan 7, 2011, 12:42:07 AM1/7/11
to Twitter Development Talk
Hi All,

Using the Streaming API, I'm noticing about a 25% loss rate when
tracking multiple hashtags vs. using the good old Search API. I'm
fouind it hard to believe this is true, so I tested over and over, but
I keep getting the same results. The Streaming API just seems to not
provide a fair number of tweets.

Note that I have the lowest rate limit with the Streaming API --
perhaps highest rate limits have lower loss rates.

Has anyone else noticed the rate loss Streaming vs. Search API? Or am
I on crack?

Does the loss rate get lower with the higher Streaming API account
limits?

Brian Maso

Bess

unread,
Jan 7, 2011, 8:08:21 PM1/7/11
to Twitter Development Talk
This is hard to believe. Streaming API is an approved API that should
not have any limit. It should give you everything without any limit.
On the other hand Search API has rate-limitation.

Did you use any filter?

Brian Maso

unread,
Jan 9, 2011, 5:29:12 PM1/9/11
to Twitter Development Talk
What I did is opened up three separate normal browser tabs in Firefox,
each using the Twitter search web interface to search for three
different hashtags ("#ces", "ces11", and "nfl" -- examples of three
tags that should have decent ongoing traffic).

At the same time I have an application capturing tweets from the same
three hashtags using the streaming API ("filter.json?
q=#ces,#ces11,#nfl", with appropriate URL encoding).

Irregardless of the amount of time, the streaming application captured
about 25% fewer tweets. Detailed analysis of the tweet IDs captured by
the browsers vs. those captured by the standalone application
retrieving tweets via the streaming API verified that there were
tweets delivered through the browsers that did not appear through the
streaming API. There were no tweets delivered through the streaming
API that did not also appear in the set of tweets delivewred through
the browsers.

I would love it if anyone else would try a similar experiment and
report back results. Maybe I'm doing something wrong, or maybe this is
an anomaly, or maybe the streaming API just doesn't capture as much --
impossible for me to say.

I note that the streaming API documentation doesn't claim an intent to
match accuracy with the search API (nor vice versa). At this point I'm
thinking to get the greatest accuracy I should be collecting tweets
from *both* APIs.

Brian Maso

M. Edward (Ed) Borasky

unread,
Jan 9, 2011, 5:43:15 PM1/9/11
to twitter-deve...@googlegroups.com

Did you just recently start running these tests? Specifically, did you
run any tests / notice discrepancies *before* Twitter threw the switches
for the Gnip partnership? This might be an unintended consequence of the
"data plumbing" activities associated with Gnip.

By the way, I've seen tweets returned by search that *don't* appear to
match the search terms! Have you verified that all the tweets Search is
giving you do in fact match?

--
http://twitter.com/znmeb http://borasky-research.net

"A mathematician is a device for turning coffee into theorems." -- Paul
Erdős

Bess

unread,
Jan 9, 2011, 7:06:24 PM1/9/11
to Twitter Development Talk
Streaming API is build by Twitter while Search API is build by Startup
Summize acquired by Twitter. Search API is rate-limited.

If you just use Twitter search feature, you may see everything. Using
Search API to display API returned results is limited by your
developer API.

Streaming API may not show everything b/c it is optimized on the
content based on its logarithm.

Brian Maso

unread,
Jan 10, 2011, 7:48:37 PM1/10/11
to Twitter Development Talk
Sounds consistent with what I've been seeing. Where did you get your
impression of how the streaming API is "optimized"? I am having a hard
time finding any authoritative documentation describing what the
powers that be at Twitter *intend* to be included in the stream (as
opposed to what they actually *implemented*, which may differ from
intentions for a variety of reasons).

If what you say is true, it kind of limits to use-cases of the
streaming API to a far narrower set than what one would think by
reading the Streaming API documentation. There's one section of the
documentation that attempts to describe how to implement a system that
utilizes the streaming API and avoids missing any tweets. Obviously if
the stream of tweets is already a reduced subset, then it doesn't
matter very much if you miss a few.

Brian Maso

Matt Harris

unread,
Jan 10, 2011, 8:24:45 PM1/10/11
to twitter-deve...@googlegroups.com
Hey Brian,

When you use the Streaming API filter method we will stream to you all the Tweets which match your track terms - up to your allowed sample size. 

What this means is over the course of a sampling window we apply your track terms to the full firehose, and then return as many results as your sample rate allows. If you exceed your allowed sample size we will return a 'rate_limited' response containing the total number of matched Tweets missed.

When matching track terms we apply the 'track' keywords to the raw Tweet text. This is different to the Search API which applies the track terms to the raw Tweet text plus the expanded URL. (The Streaming API doesn't expand URLs because it would delay the delivery of the Tweet).

The issue you are describing is not caused by sampling limits or reduced subsets, but is instead due to a retweet parsing issue our engineers are looking into. What appears to be happening is the Streaming API is trying to match against the truncated RT version of the Tweet instead of the original Tweet text.

If you file this in our issue tracker we can let you know when the issue is resolved. The issue tracker can be found here:

Best,

@themattharris
Developer Advocate, Twitter
http://twitter.com/themattharris


--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: http://groups.google.com/group/twitter-development-talk

Brian Maso

unread,
Jan 12, 2011, 12:38:41 AM1/12/11
to Twitter Development Talk
Hi Matt,

Thanks for the explanation. I will file the bug report. I'd like to
hear more about the sample size. I've read through the Streaming API
docs a lot, and I haven't come across anything specific about the rate
limits. Where can I read more?

Brian Maso

On Jan 10, 5:24 pm, Matt Harris <thematthar...@twitter.com> wrote:
> Hey Brian,
>
> When you use the Streaming API filter method we will stream to you all the
> Tweets which match your track terms - up to your allowed sample size.
>
> What this means is over the course of a sampling window we apply your track
> terms to the full firehose, and then return as many results as your sample
> rate allows. If you exceed your allowed sample size we will return a
> 'rate_limited' response containing the total number of matched Tweets
> missed.
>
> When matching track terms we apply the 'track' keywords to the raw Tweet
> text. This is different to the Search API which applies the track terms to
> the raw Tweet text plus the expanded URL. (The Streaming API doesn't expand
> URLs because it would delay the delivery of the Tweet).
>
> The issue you are describing is not caused by sampling limits or reduced
> subsets, but is instead due to a retweet parsing issue our engineers are
> looking into. What appears to be happening is the Streaming API is trying to
> match against the truncated RT version of the Tweet instead of the original
> Tweet text.
>
> If you file this in our issue tracker we can let you know when the issue is
> resolved. The issue tracker can be found here:
>    http://code.google.com/p/twitter-api/issues/list
>
> Best,
>
> @themattharris
> Developer Advocate, Twitterhttp://twitter.com/themattharris

Matt Harris

unread,
Jan 18, 2011, 8:41:57 PM1/18/11
to twitter-deve...@googlegroups.com
Hi Brian,

The rate limit for the Streaming API restricts the amount of data you can receive, which is defined by the level of access granted to you. When using the Streaming API filter method we will apply your track/follow/geo parameters (predicates) in equal weight to the Tweets captured during that sampling interval. We will then stream all the Tweets that your allowed sample rate permits. If more Tweets match your predicates that your sample rate allows we will stream a limit message to you indicating the number of Tweets you didn't receive.

How we sample, and how the limits are applied to the filter method is described in our Streaming API concepts document:

Best,
@themattharris
Developer Advocate, Twitter
http://twitter.com/themattharris


Reply all
Reply to author
Forward
0 new messages