Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Further research on advanced search
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Rob Keane  
View profile  
 More options Jul 22 2010, 11:54 am
From: Rob Keane <rob.ke...@gmail.com>
Date: Thu, 22 Jul 2010 08:54:06 -0700 (PDT)
Local: Thurs, Jul 22 2010 11:54 am
Subject: Further research on advanced search
Based on the previous blog post about searching strategies, we did
some further research and ran some performance tests.

The approach we're recommending uses Lucene, Memcached, or both in
different scenarios depending on what makes the most sense.

Lucene is mainly used for keyword searching, while memcached is used
to keep track of individual lists.

A typical scenario would involve searching multiple lists of activity
(for example, 5 different streams) for a keyword.

Our recommended approach would handle this scenario as follows:

* Search all activity for the keyword (a very simple Lucene query)
* Do an OR on the 5 streams (from memcached) to get a combined list of
all activity
* Do an AND on the search results and the combined list

We were most concerned about how the AND between the two lists would
scale, since the search results are not necessarily sorted the same
way as the combined activity depending on the query.  However, we were
able to take advantage of the fact that the combined list will always
be sorted in a predictable way.

The critical part of this "list colliding" algorithm is the code used
to search the sorted list.  We played around with a couple of
algorithms and ended up settling on Interpolation Search (http://
en.wikipedia.org/wiki/Interpolation_search).

Here are the performance results:
100,000 items collided with 10,000-100,000 items, both lists
randomized: http://i.imgur.com/uNEN7.png
50,000-250,000 items collided with 2 million items, non-overlapping
sets: http://i.imgur.com/XlDdh.png

Using the interpolation search even the worst cases came in below 40ms
on the common dev environment, which is less powerful than the perf
environment or production.

Blake did some research and found that our approach closely mirrored
what Twitter used (at least at one point) based on some slides he
found here: http://www.slideshare.net/mobile/nkallen/q-con-3770885#66
which made us feel more confident that we were going down a good path.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anthony Romano  
View profile  
 More options Jul 28 2010, 3:53 pm
From: Anthony Romano <domina...@gmail.com>
Date: Wed, 28 Jul 2010 12:53:43 -0700 (PDT)
Local: Wed, Jul 28 2010 3:53 pm
Subject: Re: Further research on advanced search
We also need to figure out exactly what gain we have from requesting
more and more information. It might be more efficient to just ask
search for 10,000 things and be done with it. Once that research is
done, it could simplify things a lot.

On Jul 22, 11:54 am, Rob Keane <rob.ke...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anthony Romano  
View profile  
 More options Jul 30 2010, 4:48 pm
From: Anthony Romano <domina...@gmail.com>
Date: Fri, 30 Jul 2010 13:48:44 -0700 (PDT)
Local: Fri, Jul 30 2010 4:48 pm
Subject: Re: Further research on advanced search
So I think for now we're going the route of 2x the bath size and
doubling each time. So if we want 10 results, we'll ask for 20, the
40..80....until we have at least 10 results.

On Jul 28, 3:53 pm, Anthony  Romano <domina...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Blake Caldwell  
View profile  
 More options Jul 31 2010, 1:03 pm
From: Blake Caldwell <blakecaldw...@gmail.com>
Date: Sat, 31 Jul 2010 10:03:47 -0700 (PDT)
Local: Sat, Jul 31 2010 1:03 pm
Subject: Re: Further research on advanced search
That's what we were doing before.  I like that method, and a
configurable starting batch, so we can start with 10,000 if we decide
to.  I think we'd be surprised how quickly Lucine can dump out 10,000
longs that it must have already computed since it's returning a sorted
list.

On Jul 30, 4:48 pm, Anthony  Romano <domina...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Blake Caldwell  
View profile  
 More options Aug 3 2010, 1:14 pm
From: Blake Caldwell <blakecaldw...@gmail.com>
Date: Tue, 3 Aug 2010 10:14:20 -0700 (PDT)
Local: Tues, Aug 3 2010 1:14 pm
Subject: Re: Further research on advanced search
We should consider some tab-separated log statements containing
diagnostic info on how many lucene queries we're making in a request
so we can pull it out to determine their optimum value.

For example:
[starting index]\t[request count]\t[number of queries]\t[number of
results found]\t[duration in ms]

On Jul 30, 4:48 pm, Anthony  Romano <domina...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »