Streaming API vs. Search API: no API returns >95% of intented tweets

76 views
Skip to first unread message

Karussell

unread,
Feb 15, 2011, 6:36:13 AM2/15/11
to Twitter Development Talk
Hi,

this problem was already posted to the twitter4j mailing list [1]. Not
sure if it is an issue with my code, twitter4j or an API issue... user
reported similar problems in the past [2].

First:

I'm doing a 100 tweet search (without paging) every 5 minutes e.g.
against 'twitter search'. I get a set of tweets A - excluding the
duplicates, of course. I get approx 5 new tweets for every 5 minutes,
so 100 tweets as pageSize should be perfectly sufficient to get all
tweets.

Second:
When I'm doing a streaming filter request for the same terms 'twitter
search' then I'm getting a set of tweets B.

The problem is: combining A and B ('C=A v B') gives me a set C where
the count of C is more than 10% larger then A or B, which means that
neither with search nor streaming API I can catch a nearly complete
set of tweets.

E.g. doing this for 3 hours I'm getting 254 tweets (A) for the search
and 257 tweets (B) for the streaming but the combined set C has 337
tweets!

Is this a bug in my code or could this be an API issue?

BTW: I don't assume 100% correctness, I only want something above
90% :) especially for such relatively infrequent terms, where users
can, should and have noticed it.

Regards,
Peter.

[1]
http://groups.google.com/group/twitter4j/msg/d959e6257ceb452f

[2]
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/71ab5cc666113c9e

http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-they-dont-show-you-all-tweets/

--

http://jetwick.com Twitter Search without Noise

Adam Green

unread,
Feb 15, 2011, 10:13:38 AM2/15/11
to twitter-deve...@googlegroups.com
I don't get that big a discrepancy, but I do get different results
from search and streaming. I use streaming for real-time delivery, and
then either search or user timelines to backfill missing tweets. As
long as the flow makes this possible within rate limits this gets me
the greatest number of results, but still not 100%. I accept that 100%
ain't gonna happen. You should get within your desired 95% though.
That is a realistic goal.

> --
> Twitter developer documentation and resources: http://dev.twitter.com/doc
> API updates via Twitter: http://twitter.com/twitterapi
> Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
> Change your membership to this group: http://groups.google.com/group/twitter-development-talk
>

--
Adam Green
Twitter API Consultant and Trainer
http://140dev.com
@140dev

John Kalucki

unread,
Feb 15, 2011, 12:30:55 PM2/15/11
to twitter-deve...@googlegroups.com
If you examine set C, do they contain matches on fields other than the Tweet text? To increase recall, search sometimes includes keywords in followed links and other techniques.

Also, are you getting rate limit messages on the Streaming API?

-John Kalucki
Twitter, Inc.


Karussell

unread,
Feb 15, 2011, 2:33:44 PM2/15/11
to Twitter Development Talk
Hi John, hi Adam,

thanks for your responses.

> To increase recall, search sometimes includes keywords in followed links and other techniques

ah, ok. this would explain the differences between C and B (but not
betweet C and A). I'll investigate ...

> Also, are you getting rate limit messages on the Streaming API?

no.
I saw track limits (or something) when my keyword was 'java' or a
similar high frequent term.

Regards,
Peter.

On 15 Feb., 18:30, John Kalucki <j...@twitter.com> wrote:
> If you examine set C, do they contain matches on fields other than the Tweet
> text? To increase recall, search sometimes includes keywords in followed
> links and other techniques.
>
> Also, are you getting rate limit messages on the Streaming API?
>
> -John Kaluckihttp://twitter.com/jkalucki
> Twitter, Inc.
> >http://groups.google.com/group/twitter-development-talk/browse_thread...
>
> >http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-t...
>
> > --
>
> >http://jetwick.comTwitter Search without Noise

Colin Surprenant

unread,
Feb 15, 2011, 3:18:18 PM2/15/11
to Twitter Development Talk
First your test set is a bit small. Did you take into account the
extra data you will get in your first search api poll? Typically your
first poll will return 100 items then subsequent polls will return
only "new" data if using since_id and/or dedupping.

Make sure both your poller and stream reader start at the same item. A
trick, if you want to grab as much similar results at possible from
the start is to request only a single item on the first poll (using
rpp=1) (or use only the most recent item of your result) then use this
item to seed your since_id on the following polls. Another idea might
be to start your stream reader first and use the first item returned
by your reader to again seed your since_id in your poller. Also you
can simply ignore this during collection but cleanup your data once
your done collecting and make sure both data sets start and end with
the same item ID.

In any case, if your difference is in fact related to the handling of
the first poll, it will become marginal as your data grow.

I also ran some tests to compare results between both methods using a
single keyword. With result sets of about 15000 ids, both sets are
identical at 98.3%. For testing purposes both my poller and stream
reader only output IDs so I can use cat, sort, uniq, wc and diff to
compare results.

Colin
> > >http://jetwick.comTwitterSearch without Noise

Karussell

unread,
Feb 15, 2011, 3:50:06 PM2/15/11
to Twitter Development Talk
Hi Colin, hi John,

> To increase recall, search sometimes includes keywords in followed links and other techniques.

This is indeed the case. and 'twitter search' is a lot in urls ala:

http://search.twitter.com/search?q=jetwick

that is where the big differences came from. Can I turn off this
'feature'? It shouldn't take into account that. Although the title of
the web site should taken into account ... like it is done in
jetwick ;)

I'll investigate for other keywords now.

> Typically your first poll will return 100 items then subsequent polls
> will return only "new" data if using since_id and/or dedupping.

I already removed these early tweets, of course ...

> I also ran some tests

with which keywords do you ran the tests?

> For testing purposes both my poller and stream reader only output IDs
> so I can use cat, sort, uniq, wc and diff to compare results.

Yes, I went the same way :)

Regards,
Peter.

John Kalucki

unread,
Feb 16, 2011, 12:01:07 AM2/16/11
to twitter-deve...@googlegroups.com
On every occasion where I've tested the Firehose and track terms from the Streaming API against the Tweet database and against each other, there is no loss -- all the sources match exactly. Unless there's some unusual operational instability, the Streaming API returns 100% of the tweets requested, or sends a limit message to let you know what has been dropped.

-John Kalucki
Twitter Inc.




--

M. Edward (Ed) Borasky

unread,
Feb 16, 2011, 12:04:40 AM2/16/11
to twitter-deve...@googlegroups.com
On Tue, 15 Feb 2011 21:01:07 -0800, John Kalucki <jo...@twitter.com>
wrote:

> On every occasion where I've tested the Firehose and track terms from
> the Streaming API against the Tweet database and against each other,
> there is no loss -- all the sources match exactly. Unless there's
> some
> unusual operational instability, the Streaming API returns 100% of
> the
> tweets requested, or sends a limit message to let you know what has
> been dropped.

"What" has been dropped, or "how many" have been dropped? ;-)

--
http://twitter.com/znmeb http://borasky-research.net

"A mathematician is a device for turning coffee into theorems." -- Paul
Erdős

Karussell

unread,
Feb 16, 2011, 12:13:23 PM2/16/11
to Twitter Development Talk
Hi John,

Well, for a search term 'java' the async API is 'ok' and the
differences 'only in search' can be easily explained: the keywords are
in the URL.
But the differences 'only in async' (tweets grabbed only via streaming
API) are strange to me:

https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

Why are they lost?

You can build the java mini programm via (or via your favourite IDE):
mvn clean install

and call it via:
./myjava -Dtwitter4j.oauth.consumerKey=key -
Dtwitter4j.oauth.consumerSecret=value de.jetwick.tw.NewClass "java"
token tokenSecret

to see what I mean ...

Another strange fact is that a lot of long tweets retrieved via the
streaming api have a text which is ~15 character shorter than the
identical tweet from the search API !

Regards,
Peter.

Matt Harris

unread,
Feb 16, 2011, 3:10:20 PM2/16/11
to twitter-deve...@googlegroups.com, Karussell
Hi Peter,

Hi Peter,

Can you explain what you mean when you say the differences 'only in async' are strange to you. Examples of specific Tweets and the predicates you used would be particularly helpful.
Also can you give an example of what you mean by a long Tweet.

Thanks,
@themattharris
Developer Advocate, Twitter
http://twitter.com/themattharris



--

Karussell

unread,
Feb 17, 2011, 6:54:48 AM2/17/11
to Twitter Development Talk
Hi Matt,

'only in async' which tweets which were only retrieved via the
streaming API (asynchronous retrieval) but not via search API.
Examples** with ids are given in the file:
https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

Do you know what I mean? Why are those tweets missing when I use the
search API?

Regarding the long tweet: it seems to me that this minor issue only
appears for retweets. e.g. the following tweet is from the streaming
API (and shortened)

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman & Kirk Pepperdine is out! http://bit.ly/eikmux "is
#Java ...

and this tweet is from the search API:

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman & Kirk Pepperdine is out! http://bit.ly/eikmux "is #Java a
dead-end?"

Kind Regards,
Peter.

**
Wed Feb 16 17:19:50 CET 2011 INFO:### only in async ###
37847925648261120 Quiero el iPad que los de @atSistemas sortean en la
conferencia Spring I/O http://bit.ly/g7jlTP #springio #java #groovy
37848944344039424 a linguagem java é uma liguagem que usa com
paradigzzzzzzzZZZZzzZZ
37849380283355136 Java, JavaScripy, C++ and UnrealScript bring it on!
37849380727816193 Trying to get into Delphi, not as exciting as Java,
but maybe I can find some open course-ware with an eccentric lecturer
to make it better.
37849615566905344 スクリプト言語やめてPure Java化.Java連携が楽になると嬉しい / JavaFX
Roadmap http://t.co/1GcXC14
37849637519884288 กลัวโค้ด ๋๋๋Java >.<

Karussell

unread,
Feb 17, 2011, 6:56:31 AM2/17/11
to Twitter Development Talk
sorry, once more again:

With 'only in async' I meant tweets which were only retrieved via the
streaming API but not via search API

Karussell

unread,
Feb 17, 2011, 3:15:49 AM2/17/11
to Twitter Development Talk
Hi Matt,

sorry for being unspecific. By 'only in async' I meant tweets which
were only found by the streaming API ('asynchronous retrieval') but
were not in the search results **

Why are they missing when using search API?

> Also can you give an example of what you mean by a long Tweet.

I investingated this a bit more and it seems to be intendend (?):
these tweets are 'only' retweets. As example here is one too short
tweet returned from the streaming API:

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman & Kirk Pepperdine is out! http://bit.ly/eikmux "is
#Java ...

and the same tweet (id == 37959896615886848) was more complete when
returned from the search API:

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman & Kirk Pepperdine is out! http://bit.ly/eikmux "is #Java a
dead-end?"

So, when I use search API I'll miss tweets and when using streaming
API I'll miss text? Do I need to use both?

Regards,
Peter.

**
37952879822110720 Architecte Java J2EE: Priorité sera donnée à un
candidat de la région nantaise. Merci de tran... http://bit.ly/dQhIoK
#freelance #offres
37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال
رسمی API فیس بوک http://t.co/ICgTAXX
37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』
http://zennin.blog55.fc2.com/blog-entry-2773.html
37956641609621504 Mastering Grails: Grails in the enterprise
https://www.ibm.com/developerworks/java/library/j-grails12168/ #grails
37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant
Technologies: ( #Columbus , OH) http://bit.ly/e6ULEw #OpenSource
#Jobs #Job #TweetMyJOBS
37957325557989376 After a day of Java programming in Eclipse, C++
programming in Visual Studio just feels slow and crappy :(

more examples in the given file:
https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

John Kalucki

unread,
Feb 18, 2011, 1:08:11 PM2/18/11
to twitter-deve...@googlegroups.com
http://dev.twitter.com/pages/streaming_api_concepts#result-quality

Search filters for relevance and is not intended as a source of all tweets. Streaming provides the complete record to all you to perform whatever post-processing you'd like.

-John Kalucki
Twitter, Inc.

Karussell

unread,
Feb 20, 2011, 4:59:14 AM2/20/11
to Twitter Development Talk
Thanks for the update John!

On 18 Feb., 19:08, John Kalucki <j...@twitter.com> wrote:
> http://dev.twitter.com/pages/streaming_api_concepts#result-quality
>
> Search filters for relevance and is not intended as a source of all tweets.
> Streaming provides the complete record to all you to perform whatever
> post-processing you'd like.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Twitter, Inc.
>
> On Thu, Feb 17, 2011 at 12:15 AM, Karussell <tableyourt...@googlemail.com>wrote:
>
> > Hi Matt,
>
> > sorry for being unspecific. By 'only in async' I meant tweets which
> > were only found by the streaming API ('asynchronous retrieval') but
> > were not in the search results **
>
> > Why are they missing when using search API?
>
> > > Also can you give an example of what you mean by a long Tweet.
>
> > I investingated this a bit more and it seems to be intendend (?):
> > these tweets are 'only' retweets. As example here is one too short
> > tweet returned from the streaming API:
>
> > RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
> > @brjavaman & Kirk Pepperdine is out!http://bit.ly/eikmux"is
> > #Java ...
>
> > and the same tweet (id == 37959896615886848) was more complete when
> > returned from the search API:
>
> > RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
> > @brjavaman & Kirk Pepperdine is out!http://bit.ly/eikmux"is #Java a
> > dead-end?"
>
> > So, when I use search API I'll miss tweets and when using streaming
> > API I'll miss text? Do I need to use both?
>
> > Regards,
> > Peter.
>
> > **
> > 37952879822110720 Architecte Java J2EE: Priorité sera donnée à un
> > candidat de la région nantaise. Merci de tran...http://bit.ly/dQhIoK
> > #freelance #offres
> > 37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال
> > رسمی API فیس بوکhttp://t.co/ICgTAXX
Reply all
Reply to author
Forward
0 new messages