Based on the previous blog post about searching strategies, we did
some further research and ran some performance tests.
The approach we're recommending uses Lucene, Memcached, or both in
different scenarios depending on what makes the most sense.
Lucene is mainly used for keyword searching, while memcached is used
to keep track of individual lists.
A typical scenario would involve searching multiple lists of activity
(for example, 5 different streams) for a keyword.
Our recommended approach would handle this scenario as follows:
* Search all activity for the keyword (a very simple Lucene query)
* Do an OR on the 5 streams (from memcached) to get a combined list of
all activity
* Do an AND on the search results and the combined list
We were most concerned about how the AND between the two lists would
scale, since the search results are not necessarily sorted the same
way as the combined activity depending on the query. However, we were
able to take advantage of the fact that the combined list will always
be sorted in a predictable way.
The critical part of this "list colliding" algorithm is the code used
to search the sorted list. We played around with a couple of
algorithms and ended up settling on Interpolation Search (http://
en.wikipedia.org/wiki/Interpolation_search).
Here are the performance results:
100,000 items collided with 10,000-100,000 items, both lists
randomized: http://i.imgur.com/uNEN7.png 50,000-250,000 items collided with 2 million items, non-overlapping
sets: http://i.imgur.com/XlDdh.png
Using the interpolation search even the worst cases came in below 40ms
on the common dev environment, which is less powerful than the perf
environment or production.
Blake did some research and found that our approach closely mirrored
what Twitter used (at least at one point) based on some slides he
found here: http://www.slideshare.net/mobile/nkallen/q-con-3770885#66 which made us feel more confident that we were going down a good path.
We also need to figure out exactly what gain we have from requesting
more and more information. It might be more efficient to just ask
search for 10,000 things and be done with it. Once that research is
done, it could simplify things a lot.
On Jul 22, 11:54 am, Rob Keane <rob.ke...@gmail.com> wrote:
> Based on the previous blog post about searching strategies, we did
> some further research and ran some performance tests.
> The approach we're recommending uses Lucene, Memcached, or both in
> different scenarios depending on what makes the most sense.
> Lucene is mainly used for keyword searching, while memcached is used
> to keep track of individual lists.
> A typical scenario would involve searching multiple lists of activity
> (for example, 5 different streams) for a keyword.
> Our recommended approach would handle this scenario as follows:
> * Search all activity for the keyword (a very simple Lucene query)
> * Do an OR on the 5 streams (from memcached) to get a combined list of
> all activity
> * Do an AND on the search results and the combined list
> We were most concerned about how the AND between the two lists would
> scale, since the search results are not necessarily sorted the same
> way as the combined activity depending on the query. However, we were
> able to take advantage of the fact that the combined list will always
> be sorted in a predictable way.
> The critical part of this "list colliding" algorithm is the code used
> to search the sorted list. We played around with a couple of
> algorithms and ended up settling on Interpolation Search (http://
> en.wikipedia.org/wiki/Interpolation_search).
> Here are the performance results:
> 100,000 items collided with 10,000-100,000 items, both lists
> randomized:http://i.imgur.com/uNEN7.png > 50,000-250,000 items collided with 2 million items, non-overlapping
> sets:http://i.imgur.com/XlDdh.png
> Using the interpolation search even the worst cases came in below 40ms
> on the common dev environment, which is less powerful than the perf
> environment or production.
> Blake did some research and found that our approach closely mirrored
> what Twitter used (at least at one point) based on some slides he
> found here:http://www.slideshare.net/mobile/nkallen/q-con-3770885#66 > which made us feel more confident that we were going down a good path.
So I think for now we're going the route of 2x the bath size and
doubling each time. So if we want 10 results, we'll ask for 20, the
40..80....until we have at least 10 results.
On Jul 28, 3:53 pm, Anthony Romano <domina...@gmail.com> wrote:
> We also need to figure out exactly what gain we have from requesting
> more and more information. It might be more efficient to just ask
> search for 10,000 things and be done with it. Once that research is
> done, it could simplify things a lot.
> On Jul 22, 11:54 am, Rob Keane <rob.ke...@gmail.com> wrote:
> > Based on the previous blog post about searching strategies, we did
> > some further research and ran some performance tests.
> > The approach we're recommending uses Lucene, Memcached, or both in
> > different scenarios depending on what makes the most sense.
> > Lucene is mainly used for keyword searching, while memcached is used
> > to keep track of individual lists.
> > A typical scenario would involve searching multiple lists of activity
> > (for example, 5 different streams) for a keyword.
> > Our recommended approach would handle this scenario as follows:
> > * Search all activity for the keyword (a very simple Lucene query)
> > * Do an OR on the 5 streams (from memcached) to get a combined list of
> > all activity
> > * Do an AND on the search results and the combined list
> > We were most concerned about how the AND between the two lists would
> > scale, since the search results are not necessarily sorted the same
> > way as the combined activity depending on the query. However, we were
> > able to take advantage of the fact that the combined list will always
> > be sorted in a predictable way.
> > The critical part of this "list colliding" algorithm is the code used
> > to search the sorted list. We played around with a couple of
> > algorithms and ended up settling on Interpolation Search (http://
> > en.wikipedia.org/wiki/Interpolation_search).
> > Here are the performance results:
> > 100,000 items collided with 10,000-100,000 items, both lists
> > randomized:http://i.imgur.com/uNEN7.png > > 50,000-250,000 items collided with 2 million items, non-overlapping
> > sets:http://i.imgur.com/XlDdh.png
> > Using the interpolation search even the worst cases came in below 40ms
> > on the common dev environment, which is less powerful than the perf
> > environment or production.
> > Blake did some research and found that our approach closely mirrored
> > what Twitter used (at least at one point) based on some slides he
> > found here:http://www.slideshare.net/mobile/nkallen/q-con-3770885#66 > > which made us feel more confident that we were going down a good path.
That's what we were doing before. I like that method, and a
configurable starting batch, so we can start with 10,000 if we decide
to. I think we'd be surprised how quickly Lucine can dump out 10,000
longs that it must have already computed since it's returning a sorted
list.
On Jul 30, 4:48 pm, Anthony Romano <domina...@gmail.com> wrote:
> So I think for now we're going the route of 2x the bath size and
> doubling each time. So if we want 10 results, we'll ask for 20, the
> 40..80....until we have at least 10 results.
> On Jul 28, 3:53 pm, Anthony Romano <domina...@gmail.com> wrote:
> > We also need to figure out exactly what gain we have from requesting
> > more and more information. It might be more efficient to just ask
> > search for 10,000 things and be done with it. Once that research is
> > done, it could simplify things a lot.
> > On Jul 22, 11:54 am, Rob Keane <rob.ke...@gmail.com> wrote:
> > > Based on the previous blog post about searching strategies, we did
> > > some further research and ran some performance tests.
> > > The approach we're recommending uses Lucene, Memcached, or both in
> > > different scenarios depending on what makes the most sense.
> > > Lucene is mainly used for keyword searching, while memcached is used
> > > to keep track of individual lists.
> > > A typical scenario would involve searching multiple lists of activity
> > > (for example, 5 different streams) for a keyword.
> > > Our recommended approach would handle this scenario as follows:
> > > * Search all activity for the keyword (a very simple Lucene query)
> > > * Do an OR on the 5 streams (from memcached) to get a combined list of
> > > all activity
> > > * Do an AND on the search results and the combined list
> > > We were most concerned about how the AND between the two lists would
> > > scale, since the search results are not necessarily sorted the same
> > > way as the combined activity depending on the query. However, we were
> > > able to take advantage of the fact that the combined list will always
> > > be sorted in a predictable way.
> > > The critical part of this "list colliding" algorithm is the code used
> > > to search the sorted list. We played around with a couple of
> > > algorithms and ended up settling on Interpolation Search (http://
> > > en.wikipedia.org/wiki/Interpolation_search).
> > > Here are the performance results:
> > > 100,000 items collided with 10,000-100,000 items, both lists
> > > randomized:http://i.imgur.com/uNEN7.png > > > 50,000-250,000 items collided with 2 million items, non-overlapping
> > > sets:http://i.imgur.com/XlDdh.png
> > > Using the interpolation search even the worst cases came in below 40ms
> > > on the common dev environment, which is less powerful than the perf
> > > environment or production.
> > > Blake did some research and found that our approach closely mirrored
> > > what Twitter used (at least at one point) based on some slides he
> > > found here:http://www.slideshare.net/mobile/nkallen/q-con-3770885#66 > > > which made us feel more confident that we were going down a good path.
We should consider some tab-separated log statements containing
diagnostic info on how many lucene queries we're making in a request
so we can pull it out to determine their optimum value.
For example:
[starting index]\t[request count]\t[number of queries]\t[number of
results found]\t[duration in ms]
On Jul 30, 4:48 pm, Anthony Romano <domina...@gmail.com> wrote:
> So I think for now we're going the route of 2x the bath size and
> doubling each time. So if we want 10 results, we'll ask for 20, the
> 40..80....until we have at least 10 results.
> On Jul 28, 3:53 pm, Anthony Romano <domina...@gmail.com> wrote:
> > We also need to figure out exactly what gain we have from requesting
> > more and more information. It might be more efficient to just ask
> > search for 10,000 things and be done with it. Once that research is
> > done, it could simplify things a lot.
> > On Jul 22, 11:54 am, Rob Keane <rob.ke...@gmail.com> wrote:
> > > Based on the previous blog post about searching strategies, we did
> > > some further research and ran some performance tests.
> > > The approach we're recommending uses Lucene, Memcached, or both in
> > > different scenarios depending on what makes the most sense.
> > > Lucene is mainly used for keyword searching, while memcached is used
> > > to keep track of individual lists.
> > > A typical scenario would involve searching multiple lists of activity
> > > (for example, 5 different streams) for a keyword.
> > > Our recommended approach would handle this scenario as follows:
> > > * Search all activity for the keyword (a very simple Lucene query)
> > > * Do an OR on the 5 streams (from memcached) to get a combined list of
> > > all activity
> > > * Do an AND on the search results and the combined list
> > > We were most concerned about how the AND between the two lists would
> > > scale, since the search results are not necessarily sorted the same
> > > way as the combined activity depending on the query. However, we were
> > > able to take advantage of the fact that the combined list will always
> > > be sorted in a predictable way.
> > > The critical part of this "list colliding" algorithm is the code used
> > > to search the sorted list. We played around with a couple of
> > > algorithms and ended up settling on Interpolation Search (http://
> > > en.wikipedia.org/wiki/Interpolation_search).
> > > Here are the performance results:
> > > 100,000 items collided with 10,000-100,000 items, both lists
> > > randomized:http://i.imgur.com/uNEN7.png > > > 50,000-250,000 items collided with 2 million items, non-overlapping
> > > sets:http://i.imgur.com/XlDdh.png
> > > Using the interpolation search even the worst cases came in below 40ms
> > > on the common dev environment, which is less powerful than the perf
> > > environment or production.
> > > Blake did some research and found that our approach closely mirrored
> > > what Twitter used (at least at one point) based on some slides he
> > > found here:http://www.slideshare.net/mobile/nkallen/q-con-3770885#66 > > > which made us feel more confident that we were going down a good path.