For background, here's my blog post from three months ago: http://www.blogarithms.com/index.php/archives/2009/06/26/searchadventures/
The facts are that (a) /search doesn't work particularly well, (b) I'm
*only* supporting it in the API; the site no longer uses it, and (c)
it's very slow and heavy on CPU resources. For example, a first (non-
cached) search for programs with 'china' in the title or description
currently takes about 4.5 seconds, and that is only going to get
worse as the number of programs increases. During that 4.5 seconds,
that server is essentially locked up.
I'm actually thinking about a whacky idea: To use Google to search,
then screen-scrape the results and use that for the API. I told you:
whacky!
So I need to hear from anyone who is already using the /search method.
How are you using it and how will it affect you if I kill it?
Thanks.
...doug
Doug Kaye, Executive Director
The Conversations Network
A 501(c)(3) Non-Profit
do...@rds.com
v: 415.868.5461
twitter: dougkaye
facebook.com/doug.kaye
And probably a violation of Google TOS.
Yahoo! has excellent open search APIs, but they're limited to 5,000
searches per IP address per day.
Don't know what they might charge to lift the limit.
I think our better long-term solution is to use Solr, but I need to
find the budget to add a dedicated server for it and for our sysadmin
to work on it.
...doug
Another approach to API search might be:
1. Mark up Spokenword.org pages using RDFa and/or Microformats.
2. Wait for Yahoo! to crawl those pages, extracting and indexing
the structured data.
3. Recommend Yahoo! Search BOSS to developers instead of trying to
provide a search API:
http://developer.yahoo.com/search/boss/structureddata.html
This is a different Yahoo! API than the search API I mentioned
earlier. I don't know whether it has similar rate limits, but even
if so, the rate limits wouldn't be as limiting, because they are
imposed on each developer's application, not on the servers running
Spokenword.org's API.
We added RDFa to Digg.com's existing Microformats mark-up largely in
response to Yahoo!'s efforts: We were working on a SearchMonkey app
that would enhance Digg.com pages in Yahoo! search results. That
never got deployed, but I still think it's a great idea.
At Digg, we went ahead with our own search API, because we have
developers and admins to support our search infrastructure:
http://apidoc.digg.com/SearchStories
But for Spokenword.org, it may not be feasible to keep putting
resources into developing search or serving search results,
especially if structured mark-up makes it unnecessary.
I apologize for pushing Yahoo! stuff all the time, but I really do
think they've done some impressive work and made it openly
available. And they did it in ways that make sense for the internet
as a whole. Sure, the index behind structured data search needs to
be distributed somehow, but for now, at least Yahoo! is demonstrating
the value of structured data, albeit with a centralized index.
What will happen to Search BOSS with the Yahoo!/Microsoft deal? I
don't know. That risk is why I observe that we need these indexes to
be distributed.
Anyway, that's my pitch for the day. Let me know if you'd like help
with the RDFa mark-up.
Really? Huh. Looking at the API, it appears I could still ask "What programs have titles containing the string 'xyz'", right?
Even with the (presumably) many-to-many joining table, the limitation of only one feed ought to drop the number of required rows to check the string in so much that it's a pretty easy query compared to the full table scan to check every program. I'd be interested in seeing the query plan...I thought MySQL's optimizer was getting better. Are all the relevant keys indexed?