Search API

0 views
Skip to first unread message

Doug Kaye

unread,
Sep 21, 2009, 3:42:55 PM9/21/09
to spokenw...@googlegroups.com
So now that I'm really getting into it, I'm thinking of killing the
whole /search method.

For background, here's my blog post from three months ago: http://www.blogarithms.com/index.php/archives/2009/06/26/searchadventures/

The facts are that (a) /search doesn't work particularly well, (b) I'm
*only* supporting it in the API; the site no longer uses it, and (c)
it's very slow and heavy on CPU resources. For example, a first (non-
cached) search for programs with 'china' in the title or description
currently takes about 4.5 seconds, and that is only going to get
worse as the number of programs increases. During that 4.5 seconds,
that server is essentially locked up.

I'm actually thinking about a whacky idea: To use Google to search,
then screen-scrape the results and use that for the API. I told you:
whacky!

So I need to hear from anyone who is already using the /search method.
How are you using it and how will it affect you if I kill it?

Thanks.

...doug

Doug Kaye, Executive Director
The Conversations Network
A 501(c)(3) Non-Profit
do...@rds.com
v: 415.868.5461
twitter: dougkaye
facebook.com/doug.kaye

Steve Williams

unread,
Sep 21, 2009, 4:20:09 PM9/21/09
to spokenw...@googlegroups.com
At 12:42 PM 9/21/2009, Doug Kaye wrote:
>... use Google to search, then screen-scrape the results and use
>that for the API. I told you: whacky!

And probably a violation of Google TOS.

Yahoo! has excellent open search APIs, but they're limited to 5,000
searches per IP address per day.

http://www.sbw.org/search/

Don't know what they might charge to lift the limit.

Doug Kaye

unread,
Sep 21, 2009, 4:24:57 PM9/21/09
to spokenw...@googlegroups.com
You're probably right about Google's ToS, Steve.

I think our better long-term solution is to use Solr, but I need to
find the budget to add a dedicated server for it and for our sysadmin
to work on it.

...doug

drnull

unread,
Sep 22, 2009, 9:52:07 AM9/22/09
to SpokenWord.org APIs
Ack. The /search method was the one of primary interest to me.

As for the blog post, the main use case I have for searching is to
find feeds. And that's rare. I only listen to a few feeds, and I
might occasionally add a new one.

I understand, though, that the idea you propose is that feeds are just
another type of collection, and some people may not care about a feed,
but may just care that they can hear all particular programs about a
certain topic, regardless of the feed.

I know everybody is different, but podcast listening to me is a very
passive process. I don't like to go to a site every week and pick and
choose what I'm going to listen to. I set up feeds and expect to have
new episodes available for me to listen to when they come out. I take
it there are people who like picking and choosing and setting up
"custom" feeds every day/week?

I hope that you can keep the /search method for feeds, even if you
implement a more google/Solr based solution. I've done a bit of work
with Solr, btw, and it's a great search interface.

On Sep 21, 3:42 pm, Doug Kaye <d...@rds.com> wrote:
> So now that I'm really getting into it, I'm thinking of killing the  
> whole /search method.
>
> For background, here's my blog post from three months ago:http://www.blogarithms.com/index.php/archives/2009/06/26/searchadvent...

Doug Kaye

unread,
Sep 22, 2009, 10:06:44 AM9/22/09
to spokenw...@googlegroups.com
Things are looking a little better this morning. I'm a bit more optimistic. I *think* I can salvage search by eliminating a few filtering options:

&group=1 is already gone. This is a necessary side effect of allowing a program to be part of more than one feed, which I think is more important that the group option. Unfortunately, this will also eliminate grouping for the UI on the site.

&feed=<n> is going away today. You'll still be able to run searches for feeds (/search/feeds/...) but you won't be able to filter a program search by feed. IOW, you won't be able to ask the question "What programs in feed 123 have titles containing the string 'xyz'?" But you'll still be able to ask "What feeds have titles with 'xyz'?"

With more than 350,000 programs in the database, the MySQL overhead of the former type of query is just too great.


    ...doug

Doug Kaye, Executive Director
The Conversations Network
A 501(c)(3) Non-Profit
do...@rds.com
v: 415.868.5461
twitter: dougkaye
facebook.com/doug.kaye

Steve Williams

unread,
Sep 22, 2009, 12:33:37 PM9/22/09
to spokenw...@googlegroups.com
Continuing on my one-man Yahoo! bandwagon ...

Another approach to API search might be:

1. Mark up Spokenword.org pages using RDFa and/or Microformats.
2. Wait for Yahoo! to crawl those pages, extracting and indexing
the structured data.
3. Recommend Yahoo! Search BOSS to developers instead of trying to
provide a search API:

http://developer.yahoo.com/search/boss/structureddata.html

This is a different Yahoo! API than the search API I mentioned
earlier. I don't know whether it has similar rate limits, but even
if so, the rate limits wouldn't be as limiting, because they are
imposed on each developer's application, not on the servers running
Spokenword.org's API.

We added RDFa to Digg.com's existing Microformats mark-up largely in
response to Yahoo!'s efforts: We were working on a SearchMonkey app
that would enhance Digg.com pages in Yahoo! search results. That
never got deployed, but I still think it's a great idea.

At Digg, we went ahead with our own search API, because we have
developers and admins to support our search infrastructure:

http://apidoc.digg.com/SearchStories

But for Spokenword.org, it may not be feasible to keep putting
resources into developing search or serving search results,
especially if structured mark-up makes it unnecessary.

I apologize for pushing Yahoo! stuff all the time, but I really do
think they've done some impressive work and made it openly
available. And they did it in ways that make sense for the internet
as a whole. Sure, the index behind structured data search needs to
be distributed somehow, but for now, at least Yahoo! is demonstrating
the value of structured data, albeit with a centralized index.

What will happen to Search BOSS with the Yahoo!/Microsoft deal? I
don't know. That risk is why I observe that we need these indexes to
be distributed.

Anyway, that's my pitch for the day. Let me know if you'd like help
with the RDFa mark-up.

Doug Kaye

unread,
Sep 22, 2009, 1:35:58 PM9/22/09
to spokenw...@googlegroups.com
You continue to provide some superb ideas, Steve. I like the way you think. I'll study the Yahoo! Search BOSS stuff when I get a chance.

Your idea is very much in keeping with what I did for interactive search. By putting everything into the pages and generating sitemaps for Google, that's working really well.

Long-term, I do think we're headed towards a Solr installation. I need to figure out how to pay for it (probably on a dedicated server) and for a developer/sysadmin to get it up and running. Yahoo! Search BOSS could be a good second-step approach, with Solr being the third step. For now (step one) I'm going to do some triage on the existing search. The good news is that the APIs are very lightly used so far, so even those high-overhead queries aren't too bad.

   ...doug

Ken Kennedy

unread,
Sep 22, 2009, 7:49:48 PM9/22/09
to spokenw...@googlegroups.com
Honestly, I'm primarily interested in searching on the filename and/or feed item guid. (so that I can walk backwards from a mp3 on my player, to the program to rate, basically). If I can do that, the rest (to me, anyway) is just gravy. The search against filename and/or item guid should be nice and indexable w/o a join.
--
Ken Kennedy
Contact info: http://kenzoid.com/me/contact

Ken Kennedy

unread,
Sep 22, 2009, 8:59:55 PM9/22/09
to spokenw...@googlegroups.com
>.IOW, you won't be able to ask the question "What programs in feed 123 have titles containing the string 'xyz'?" But you'll 
> still be able to ask "What feeds have titles with 'xyz'?"

> With more than 350,000 programs in the database, the MySQL overhead of the former type of query is just too great.

Really? Huh. Looking at the API, it appears I could still ask "What programs have titles containing the string 'xyz'", right? Even with the (presumably) many-to-many joining table, the limitation of only one feed ought to drop the number of required rows to check the string in so much that it's a pretty easy query compared to the full table scan to check every program. I'd be interested in seeing the query plan...I thought MySQL's optimizer was getting better. Are all the relevant keys indexed?

Doug Kaye

unread,
Sep 23, 2009, 1:50:38 AM9/23/09
to spokenw...@googlegroups.com
FYI, here's one of those ugly queries.

SELECT SQL_CALC_FOUND_ROWS DISTINCT o.id
FROM programs o,
((SELECT object_id
   FROM descriptions_programs
   WHERE MATCH (text) AGAINST ('+classic ' IN BOOLEAN MODE)
   AND descriptions_programs.current=1)
UNION ALL
(SELECT object_id
   FROM titles_programs
   WHERE MATCH (text) AGAINST ('+classic ' IN BOOLEAN MODE)
   AND titles_programs.current=1))
t, categories_programs cat 
WHERE t.object_id=o.id AND o.id=cat.object_id
AND MATCH (cat.text) AGAINST('0005 ' IN BOOLEAN MODE)
AND cat.current=1
ORDER BY o.created DESC
LIMIT 0,100

Told ya they were ugly! I'll decipher: It's trying to count all and return  the most-recently added 100 programs with titles or descriptions containing the word 'classic' that are also in category 00005 (Arts & Literature).

On Tue, Sep 22, 2009 at 5:59 PM, Ken Kennedy <ken...@gmail.com> wrote:
Really? Huh. Looking at the API, it appears I could still ask "What programs have titles containing the string 'xyz'", right?

Correct.
 
Even with the (presumably) many-to-many joining table, the limitation of only one feed ought to drop the number of required rows to check the string in so much that it's a pretty easy query compared to the full table scan to check every program. I'd be interested in seeing the query plan...I thought MySQL's optimizer was getting better. Are all the relevant keys indexed?

Unfortunately, there are feeds with many thousands of programs, so the JOIN of the programs table and the program_feed_map table can be quite large.

   ...doug

Doug Kaye

unread,
Sep 23, 2009, 2:01:00 AM9/23/09
to spokenw...@googlegroups.com
I should add that the phrase 'ORDER BY o.created DESC' causes the entire programs table to be used. This query hangs the entire database. More work to do here!

   ...doug

Doug Kaye

unread,
Sep 23, 2009, 3:50:27 AM9/23/09
to spokenw...@googlegroups.com
Unfortunately I've had to disable the /search API method again. It was working reasonably well, but every once in a while we were getting a query that complete hung the database server. These queries are valid. It's my code that isn't.

And in a few hours I'm heading off to a family wedding and won't be able to work on this until Monday 9/28. Sorry to those that are using it, but I can't leave it running as it occasionally hangs the web site.

   ...doug
Reply all
Reply to author
Forward
0 new messages