Bloom Filter

223 views
Skip to first unread message

jamra

unread,
Aug 12, 2012, 3:33:27 AM8/12/12
to cleo-ty...@googlegroups.com
I am writing a Golang implementation of your Cleo search.  I was wondering why you need the bloom filter.

I have an inverted index of prefixes pointing to document IDs.  I match the prefix of the query against the inverted index.  Once I get the list of document IDs from the inverted index, why do I need to check if the query is in the bloom filter?  I already verified that the prefix of the query matches the prefix of my terms.  

jamra

unread,
Aug 12, 2012, 11:32:04 PM8/12/12
to cleo-ty...@googlegroups.com
I feel like I need to narrow down my question.  I see why I have to use the bloom filter with the original, full length query.  I don't understand how I will use the bloom filter to compare the query and the potential hit from the inverted index.  

Jingwei

unread,
Aug 13, 2012, 12:25:47 AM8/13/12
to cleo-ty...@googlegroups.com
Hi Jamra,

Bloom filter is originally intended to scale typeahead search over any person's network including 1st and 2nd degree connections. We adapted this implementation for general purposed typeahead when no network information is available.

If you have created a postings list (inverted list) for every possible prefixes, you do NOT need bloom filter at all. You can do joining over multiple inverted lists upon receiving several different prefixes. If you only keep the prefixes up to a specified length (say 4), you will need bloom filter when dealing with terms of lengths greater than the specified length (i.e. 4). If you have multiple terms in your search query, you can pick the shortest inverted list and apply bloom filter to quickly find out what are the potential matches.

Thanks.

Jingwei  
Message has been deleted

Jingwei

unread,
Aug 13, 2012, 12:44:58 AM8/13/12
to cleo-ty...@googlegroups.com
Ok, let's see an example query "goog an". The max length of any prefixes is 3. So there is no inverted list associated with "goog", The inverted list associated with "goo" is shorter than that associated with "an". We pick up the following inverted list. A bloom filter is created based on "goo an" and then compared with those previously computed for docs 5, 7, 13, 19 and 127.  

goo -> 5 (google search) 7 (google android) 13 (goodie anime) 19 (google analytics) 127 (google adwords)

The docs 7, 13 and 19 are not rejected by bloom filter. Within these three candidates, Doc 13 is not a match because there is no prefix "goog" in any of its terms. That is, after bloom filtering, you need consult forward index to find real matches.

Thanks.

Jingwei 



On Sunday, August 12, 2012 8:32:04 PM UTC-7, jamra wrote:
I feel like I need to narrow down my question.  I see why I have to use the bloom filter with the original, full length query.  I don't understand how I will use the bloom filter to compare the query and the potential hit from the inverted index.  

jamra

unread,
Aug 13, 2012, 3:26:19 AM8/13/12
to cleo-ty...@googlegroups.com
I think I understand it now.  Thank you.  

To test the bloom filter, I create another 8 byte filter of the original query and test that against the bloom filter of the candidate.  If the 1s in the first 8 bytes match with the 1s in the main bloom filter's 8 bytes, I can continue checking against the forward index.   

Jacob Amrany

unread,
Aug 19, 2012, 6:50:16 PM8/19/12
to cleo-ty...@googlegroups.com
Can you explain this function:

public static final int computeBloomFilter(String s, int prefixLength) {
    int cnt = Math.min(prefixLength, s.length());
    if (cnt <= 0) return 0;
    
    int filter = 0;
    int bitpos = 0;
    
    long hash = Fnv1Hash32.FNV_BASIS;
    for(int i = 0; i < cnt; i++) {
      char c = s.charAt(i);
      
      hash ^= 0xFF & c;
      hash *= Fnv1Hash32.FNV_PRIME;
      hash &= Fnv1Hash32.BITS_MASK;
      
      hash ^= 0xFF & (c >> 8);
      hash *= Fnv1Hash32.FNV_PRIME;
      hash &= Fnv1Hash32.BITS_MASK;
      
      bitpos = (int)(hash % NUM_BITS);
      if(bitpos < 0) bitpos += NUM_BITS;
      filter |= 1 << bitpos;
    }
      
    return filter;
  }

Why are you using the prefixLen? This goes against what I understood to be how we use the bloom filter after we match from the inverted index.

-- 
Jacob Amrany
Sent with Sparrow

Jacob Amrany

unread,
Aug 27, 2012, 2:35:58 AM8/27/12
to cleo-ty...@googlegroups.com
Can I have your permission to publish my golang implementation of this algorithm?

-- 
Jacob Amrany
Sent with Sparrow

Jingwei

unread,
Aug 27, 2012, 7:12:13 PM8/27/12
to cleo-ty...@googlegroups.com
Hi Jacob,

Please feel free to do so.

Jingwei

Jacob Amrany

unread,
Aug 27, 2012, 7:13:35 PM8/27/12
to cleo-ty...@googlegroups.com
Thank you.  

Jacob Amrany

unread,
Aug 30, 2012, 12:51:36 PM8/30/12
to cleo-ty...@googlegroups.com
github.com/jamra/gocleo

This was my first open source project.  I appreciate your assistance.  

On Mon, Aug 27, 2012 at 4:12 PM, Jingwei <jingw...@gmail.com> wrote:

Jingwei

unread,
Sep 6, 2012, 3:04:23 AM9/6/12
to cleo-ty...@googlegroups.com
You are more than welcome. 

Best,
Reply all
Reply to author
Forward
0 new messages