Based on the previous blog post about searching strategies, we did
some further research and ran some performance tests.
The approach we're recommending uses Lucene, Memcached, or both in
different scenarios depending on what makes the most sense.
Lucene is mainly used for keyword searching, while memcached is used
to keep track of individual lists.
A typical scenario would involve searching multiple lists of activity
(for example, 5 different streams) for a keyword.
Our recommended approach would handle this scenario as follows:
* Search all activity for the keyword (a very simple Lucene query)
* Do an OR on the 5 streams (from memcached) to get a combined list of
all activity
* Do an AND on the search results and the combined list
We were most concerned about how the AND between the two lists would
scale, since the search results are not necessarily sorted the same
way as the combined activity depending on the query. However, we were
able to take advantage of the fact that the combined list will always
be sorted in a predictable way.
The critical part of this "list colliding" algorithm is the code used
to search the sorted list. We played around with a couple of
algorithms and ended up settling on Interpolation Search (http://
en.wikipedia.org/wiki/Interpolation_search).
Here are the performance results:
100,000 items collided with 10,000-100,000 items, both lists
randomized:
http://i.imgur.com/uNEN7.png
50,000-250,000 items collided with 2 million items, non-overlapping
sets:
http://i.imgur.com/XlDdh.png
Using the interpolation search even the worst cases came in below 40ms
on the common dev environment, which is less powerful than the perf
environment or production.
Blake did some research and found that our approach closely mirrored
what Twitter used (at least at one point) based on some slides he
found here:
http://www.slideshare.net/mobile/nkallen/q-con-3770885#66
which made us feel more confident that we were going down a good path.