Over the last couple months, we've seen some wierd behavior in the
responses to search queries. First, I understand the rules about
search being non-covering, and that we are at the mercy of the index.
That said, I've noticed some odd behavior lately. As background
material, we run many searches (and we're white-listed by IP and OAuth
account), but the two I want to reference are the Mentions and the
Location searches.
The Mentions search seems pretty stable and uses this typical search
(and then we exclude a bunch of things like Bay St. Louis, etc.):
http://search.twitter.com/search.atom?rpp=100&q=stl+OR+%23stl+OR+stlouis+OR+%22St+Louis%22+OR+%22St.+Louis%22+OR+%22Saint+Louis%22+OR+SaintLouis&since_id=26682507745
The Location search has been VERY unstable, and uses this typical
search:
http://search.twitter.com/search.atom?rpp=100&geocode=38.627522%2C-90.19841%2C30mi&since_id=26679538876
As the day progresses, we move up the high-water mark in the since_id
to track what we've already received so we should be getting minimal
gaps. We almost never see two 100-entry polls in a row, so I think
we're keeping up with whatever coverage the search index is offering.
I've posted in a Google Spreadsheet a graph of the tweet counts we're
seeing since 7/1/2010 so you can see the trends
http://bit.ly/9wnnFM
(sheet two is the graph). Some interesting things to note:
1) The Mentions search is very consistent.
2) The Location search likes to bounce around a bit.
3) In mid August, we started to have issues with more 403s and error
about since_id being too old. We were also getting rate-limited in our
calls to get the tweep details (since the ATOM feed is so meager). Due
to a bug, I wasn't committing all the tweets when this happened.
4) On or about Sept 1st, you guys did something that broke our ability
to stay caught up... we started getting almost no tweets and lots of
errors about since_id being too old. I thought this was due to your
"new tweet id" assignment being rolled out.
5) On Sept 5th, I got back from vacation and added logic to understand
and use the "no new tweets, roll the tweet id forward to this" driven
by parsing the <link rel="refresh"> node in the ATOM feed.
6) I also, around this time, added better logic to the tweep-lookup
detail, only asking you for tweeps I don't have at least a minimal row
on. This reduced the number of rate-limiting issues.
7) We were very stable and until 9/23 when volume falls off a lot, and
never really recovers. I think this is the "new search" engine
rollout.
To research a little more, I tried the Twitter advanced search page
and asking for the RSS (atom, really) feed from the advanced search
page I get this URL now:
http://search.twitter.com/search.atom?geocode=38.627522,-90.19841,30.0mi&lang=en&q=+near:38.627522,-90.19841+within:30mi
Which starts off like ours, but adds the (seemingly redundant) human-
readable search criteria "&q=+near:38.627522,-90.19841+within:30mi".
Oddly, if we remove that and do the same search at nearly the same
instant, I DO get vastly different tweets sets... probably due to
volume, possibly just sorting, but I would hope that with the same
since_id value, I would get the same tweets... but I don't.
So, I'm asking... what's going on?
Why are we seeing so much volume fall-off?
What can we do about it?
Should I be running both searches (my current one and one with the
human-readable query) to get better coverage?
Is there any hope/expectation of the volume returning to normal?
Doesn't anyone else care about tweep-location searches?
Now, before you tell me that I should be using Site Streams (which I
want to do), realize that I _NEED_ tweets from people whose profile
location says they are in St. Louis (and similar) like the old Summize
search honored. I can't just get by with the _tweet_ location being
STL.
Marc Brooks
Chief guy getting yelled at,
http://stltweets.com
http://taste.stltweets.com
http://loufest.stltweets.com