Further research on advanced search

Rob Keane

unread,

Jul 22, 2010, 11:54:06 AM7/22/10

to Eureka Streams Development

Based on the previous blog post about searching strategies, we did
some further research and ran some performance tests.

The approach we're recommending uses Lucene, Memcached, or both in
different scenarios depending on what makes the most sense.

Lucene is mainly used for keyword searching, while memcached is used
to keep track of individual lists.

A typical scenario would involve searching multiple lists of activity
(for example, 5 different streams) for a keyword.

Our recommended approach would handle this scenario as follows:

* Search all activity for the keyword (a very simple Lucene query)
* Do an OR on the 5 streams (from memcached) to get a combined list of
all activity
* Do an AND on the search results and the combined list

We were most concerned about how the AND between the two lists would
scale, since the search results are not necessarily sorted the same
way as the combined activity depending on the query. However, we were
able to take advantage of the fact that the combined list will always
be sorted in a predictable way.

The critical part of this "list colliding" algorithm is the code used
to search the sorted list. We played around with a couple of
algorithms and ended up settling on Interpolation Search (http://
en.wikipedia.org/wiki/Interpolation_search).

Here are the performance results:
100,000 items collided with 10,000-100,000 items, both lists
randomized: http://i.imgur.com/uNEN7.png
50,000-250,000 items collided with 2 million items, non-overlapping
sets: http://i.imgur.com/XlDdh.png

Using the interpolation search even the worst cases came in below 40ms
on the common dev environment, which is less powerful than the perf
environment or production.

Blake did some research and found that our approach closely mirrored
what Twitter used (at least at one point) based on some slides he
found here: http://www.slideshare.net/mobile/nkallen/q-con-3770885#66
which made us feel more confident that we were going down a good path.

Anthony Romano

unread,

Jul 28, 2010, 3:53:43 PM7/28/10

to Eureka Streams Development

We also need to figure out exactly what gain we have from requesting
more and more information. It might be more efficient to just ask
search for 10,000 things and be done with it. Once that research is
done, it could simplify things a lot.

Anthony Romano

unread,

Jul 30, 2010, 4:48:44 PM7/30/10

to Eureka Streams Development

So I think for now we're going the route of 2x the bath size and
doubling each time. So if we want 10 results, we'll ask for 20, the
40..80....until we have at least 10 results.

Blake Caldwell

unread,

Jul 31, 2010, 1:03:47 PM7/31/10

to Eureka Streams Development

That's what we were doing before. I like that method, and a
configurable starting batch, so we can start with 10,000 if we decide
to. I think we'd be surprised how quickly Lucine can dump out 10,000
longs that it must have already computed since it's returning a sorted
list.

Blake Caldwell

unread,

Aug 3, 2010, 1:14:20 PM8/3/10

to Eureka Streams Development

We should consider some tab-separated log statements containing
diagnostic info on how many lucene queries we're making in a request
so we can pull it out to determine their optimum value.

For example:
[starting index]\t[request count]\t[number of queries]\t[number of
results found]\t[duration in ms]

Reply all

Reply to author

Forward

Message has been deleted