Woe is me, I can't seek what I find (or Search is failing me)

0 views
Skip to first unread message

@IDisposable

unread,
Oct 7, 2010, 5:10:58 PM10/7/10
to Twitter Development Talk
Over the last couple months, we've seen some wierd behavior in the
responses to search queries. First, I understand the rules about
search being non-covering, and that we are at the mercy of the index.
That said, I've noticed some odd behavior lately. As background
material, we run many searches (and we're white-listed by IP and OAuth
account), but the two I want to reference are the Mentions and the
Location searches.

The Mentions search seems pretty stable and uses this typical search
(and then we exclude a bunch of things like Bay St. Louis, etc.):
http://search.twitter.com/search.atom?rpp=100&q=stl+OR+%23stl+OR+stlouis+OR+%22St+Louis%22+OR+%22St.+Louis%22+OR+%22Saint+Louis%22+OR+SaintLouis&since_id=26682507745

The Location search has been VERY unstable, and uses this typical
search:
http://search.twitter.com/search.atom?rpp=100&geocode=38.627522%2C-90.19841%2C30mi&since_id=26679538876

As the day progresses, we move up the high-water mark in the since_id
to track what we've already received so we should be getting minimal
gaps. We almost never see two 100-entry polls in a row, so I think
we're keeping up with whatever coverage the search index is offering.

I've posted in a Google Spreadsheet a graph of the tweet counts we're
seeing since 7/1/2010 so you can see the trends http://bit.ly/9wnnFM
(sheet two is the graph). Some interesting things to note:

1) The Mentions search is very consistent.
2) The Location search likes to bounce around a bit.
3) In mid August, we started to have issues with more 403s and error
about since_id being too old. We were also getting rate-limited in our
calls to get the tweep details (since the ATOM feed is so meager). Due
to a bug, I wasn't committing all the tweets when this happened.
4) On or about Sept 1st, you guys did something that broke our ability
to stay caught up... we started getting almost no tweets and lots of
errors about since_id being too old. I thought this was due to your
"new tweet id" assignment being rolled out.
5) On Sept 5th, I got back from vacation and added logic to understand
and use the "no new tweets, roll the tweet id forward to this" driven
by parsing the <link rel="refresh"> node in the ATOM feed.
6) I also, around this time, added better logic to the tweep-lookup
detail, only asking you for tweeps I don't have at least a minimal row
on. This reduced the number of rate-limiting issues.
7) We were very stable and until 9/23 when volume falls off a lot, and
never really recovers. I think this is the "new search" engine
rollout.

To research a little more, I tried the Twitter advanced search page
and asking for the RSS (atom, really) feed from the advanced search
page I get this URL now:
http://search.twitter.com/search.atom?geocode=38.627522,-90.19841,30.0mi&lang=en&q=+near:38.627522,-90.19841+within:30mi

Which starts off like ours, but adds the (seemingly redundant) human-
readable search criteria "&q=+near:38.627522,-90.19841+within:30mi".

Oddly, if we remove that and do the same search at nearly the same
instant, I DO get vastly different tweets sets... probably due to
volume, possibly just sorting, but I would hope that with the same
since_id value, I would get the same tweets... but I don't.

So, I'm asking... what's going on?
Why are we seeing so much volume fall-off?
What can we do about it?
Should I be running both searches (my current one and one with the
human-readable query) to get better coverage?
Is there any hope/expectation of the volume returning to normal?
Doesn't anyone else care about tweep-location searches?

Now, before you tell me that I should be using Site Streams (which I
want to do), realize that I _NEED_ tweets from people whose profile
location says they are in St. Louis (and similar) like the old Summize
search honored. I can't just get by with the _tweet_ location being
STL.

Marc Brooks
Chief guy getting yelled at,
http://stltweets.com
http://taste.stltweets.com
http://loufest.stltweets.com

Mack D. Male

unread,
Oct 7, 2010, 9:26:37 PM10/7/10
to Twitter Development Talk
I'm seeing this problem too, but it only started today, around five
hours ago. Here's an example search: http://search.twitter.com/search?q=near%3Aedmonton

That's returning a fraction of the tweets it was before. This problem
happens occasionally, but not usually for this long.

On Oct 7, 3:10 pm, "@IDisposable" <idisposa...@gmail.com> wrote:
> Over the last couple months, we've seen some wierd behavior in the
> responses to search queries. First, I understand the rules about
> search being non-covering, and that we are at the mercy of the index.
> That said, I've noticed some odd behavior lately.  As background
> material, we run many searches (and we're white-listed by IP and OAuth
> account), but the two I want to reference are the Mentions and the
> Location searches.
>
> The Mentions search seems pretty stable and uses this typical search
> (and then we exclude a bunch of things like Bay St. Louis, etc.):http://search.twitter.com/search.atom?rpp=100&q=stl+OR+%23stl+OR+stlo...
>
> The Location search has been VERY unstable, and uses this typical
> search:http://search.twitter.com/search.atom?rpp=100&geocode=38.627522%2C-90...
> page I get this URL now:http://search.twitter.com/search.atom?geocode=38.627522,-90.19841,30....

@IDisposable

unread,
Oct 11, 2010, 1:21:55 PM10/11/10
to Twitter Development Talk
> The Location search has been VERY unstable, and uses this typical
> search:http://search.twitter.com/search.atom?rpp=100&geocode=38.627522%2C-90...

It's getting worse all the time!

Is this what we can expect going forward? If so, how can I follow all
200000+ people we used to get tweets from on the location search?
I'll happily create an account and manage the lists/follows... but I'm
pretty sure that will get me killed, and it will only be a snapshot
based on current profile location strings that we have...

Sure, I could suck the *-pipe, but without a filter criteria, I'm
going to be seeing all tweets from the entire universe, which seems
hella-wasteful to twitter and me...

Day Mentions Location Everything
2010-09-13 4985 46801 53503
2010-09-14 4719 48110 54589
2010-09-15 4779 47599 54209
2010-09-16 5143 47087 54312
2010-09-17 5256 48363 55581
2010-09-18 4888 40943 47237
2010-09-19 5871 46008 53843
2010-09-20 4990 46219 52826
2010-09-21 4444 49274 55933
2010-09-22 5364 51567 58999
2010-09-23 6866 42495 52967
2010-09-24 6191 41107 50679
2010-09-25 5673 36321 43950
2010-09-26 6784 35168 44664
2010-09-27 6346 32580 42192
2010-09-28 5448 32528 41792
2010-09-29 6038 40677 50472
2010-09-30 5964 38116 47713
2010-10-01 6615 38360 48302
2010-10-02 5612 23107 32024
2010-10-03 6728 22802 33328
2010-10-04 5528 23990 33491
2010-10-05 5116 38733 47023
2010-10-06 5427 39041 47856
2010-10-07 5733 30855 40742
2010-10-08 6355 9459 22235
2010-10-09 5894 8352 18691
2010-10-10 7240 8399 20861
2010-10-11 4017 5587 13010

themattharris

unread,
Oct 11, 2010, 3:43:20 PM10/11/10
to Twitter Development Talk
Thanks for publishing this information. There is another thread
discussing the issue with the Geocode search not respecting the radius
of a search here:
http://groups.google.com/group/twitter-development-talk/browse_thread/thread/a80db3eff77a88fe

From that thread ticket 1930 was filed on our issue tracker which we
will update when a fix is deployed:
http://code.google.com/p/twitter-api/issues/detail?id=1930

I understand your reasons for the location tracking using the Search
API but wondered if you knew that the mentions search you are doing
can be carried out on using the Streaming API filter method. That
should cut down on the number or REST queries you need to make. More
information on that method is here:
http://dev.twitter.com/pages/streaming_api_methods#statuses-filter

Out of curiosity what is the third column of your figures represent?
It may be possible to track that one using the Streaming API as well.

Best
@themattharris

@IDisposable

unread,
Oct 11, 2010, 6:04:36 PM10/11/10
to Twitter Development Talk
> From that thread ticket 1930 was filed on our issue tracker which we
> will update when a fix is deployed:
>    http://code.google.com/p/twitter-api/issues/detail?id=1930

Excellent, I hope it gets fixed while there is still time to back-fill
some of this data,,,otherwise we're going to have a silly-looking hole
in the next State of Twitter in St. Louis report :)

> I understand your reasons for the location tracking using the Search
> API but wondered if you knew that the mentions search you are doing
> can be carried out on using the Streaming API filter method. That
> should cut down on the number or REST queries you need to make. More
> information on that method is here:
>    http://dev.twitter.com/pages/streaming_api_methods#statuses-filter

Yes, I really need to switch to streaming for that... I just haven't
had he bandwidth as of yet... we are using a Search (nee Summize)
based infrastructure from a long while back and me being the "one guy
in the room", I've not had a chance to really skim through and update
our stuff for streaming.

> Out of curiosity what is the third column of your figures represent?
> It may be possible to track that one using the Streaming API as well.

We do about 68 searches (mostly hashtags, a couple keyword or user
searches--for legacy/coverage guarantees) and 64 timeline follows
(mostly lists, one hometimel). Each of these sources applies a
"label" based on the source of incoming data (which search/timeline)
for our various categories (see http://stltweets.com and click the
category menus e.g. Blues). For ALL of these searches, we also apply
a top-level category (e.g. Sports) and finally ALL of the tweets get a
label of "Everything" for ease of seperating various sub-sites. Thus,
the "Everything" column in my numbers is the overall volume of tweets
from all sources.

SO, am I to assume that the geocode search bug, once fixed, will go
back to returning the tweets from people whose _profile location_
reads something "near St. Louis" like before?

Thanks,
Marc

Nick

unread,
Oct 12, 2010, 9:47:20 PM10/12/10
to Twitter Development Talk
Hi,

I too have looked at the streaming API for our use but the
restrictions of single keywords has stopped us from implementing it.

We are also having issues in the other thread related to this issue
and almost 0 data for some of our geolocated search terms.

Nick

On Oct 12, 8:04 am, "@IDisposable" <idisposa...@gmail.com> wrote:
> > From that thread ticket 1930 was filed on our issue tracker which we
> > will update when a fix is deployed:
> >    http://code.google.com/p/twitter-api/issues/detail?id=1930
>
> Excellent, I hope it gets fixed while there is still time to back-fill
> some of this data,,,otherwise we're going to have a silly-looking hole
> in the next State of Twitter in St. Louis report :)
>
> > I understand your reasons for the location tracking using the Search
> > API but wondered if you knew that the mentions search you are doing
> > can be carried out on using the Streaming API filter method. That
> > should cut down on the number or REST queries you need to make. More
> > information on that method is here:
> >    http://dev.twitter.com/pages/streaming_api_methods#statuses-filter
>
> Yes, I really need to switch to streaming for that... I just haven't
> had he bandwidth as of yet... we are using a Search (nee Summize)
> based infrastructure from a long while back and me being the "one guy
> in the room", I've not had a chance to really skim through and update
> our stuff for streaming.
>
> > Out of curiosity what is the third column of your figures represent?
> > It may be possible to track that one using the Streaming API as well.
>
> We do about 68 searches (mostly hashtags, a couple keyword or user
> searches--for legacy/coverage guarantees) and 64 timeline follows
> (mostly lists, one hometimel).  Each of these sources applies a
> "label" based on the source of incoming data (which search/timeline)
> for our various categories (seehttp://stltweets.comand click the
Reply all
Reply to author
Forward
0 new messages