Recommendation on how to use Bleve for auto-suggest

359 views
Skip to first unread message

Siddharth Saha

unread,
Sep 30, 2020, 7:47:14 AM9/30/20
to bleve
Hi, 

I am trying to write a service that would be return airports matching a search string (even for single letter query strings).

I have a bunch of airports, with each airport having the following fields - 
  • Airport Code
    • Examples
    • LAX
    • SFO
  • Airport Name
    • Examples
    • Los Angles International Airport
    • San Francisco International Airport
  • City
    • Example
    • Los Angeles
    • San Francisco Bay Area
A user can type a string and get results. Some example query strings can be -
  • L (Return LAX)
  • Angel (Returns LAX airport)
  • SF (Returns SFO)
So, it's basically free form fuzzy matching on all the fields simultaneously. 

I loaded such data using vanilla Bleve code (no special analyzers, etc.) and it works for very specific use-cases only.

Any recommendations on how I can get started understanding more on building such a use case? 

Thanks,
Siddharth

Marty Schoch

unread,
Sep 30, 2020, 8:14:20 AM9/30/20
to bl...@googlegroups.com
So, the first concept to understand is that the primary mechanism of search is exact matching of a set of search terms against a set of indexed terms.

Start with a single field, consider a document with name: Los Angeles International Airport.  When we index that, we first analyze it.  The analyzer typically first breaks it up into tokens (tokenization), and then runs those tokens through filters (token filters).  The filters do basic things like lowercasing the letters, but can do more advanced things like stemming or other transformations.

The standard analyzer would produce the following tokens:
- los
- angeles
- international
- airport

Next, when a user runs a search, we take their search input, and run that through an analyzer as well (usually, but not always, the same analyzer used for indexing).  So, if a user searches for "angel", the standard analyzer produces just one token:
- angel

In this case, the search term does not match any of the indexed terms, so the search fails to find a match.

With that as the background, the first approach to consider is a custom analyzer which produces tokens that do match.  One way to do this is to use an n-gram analyzer.  The n-gram analyzer is configured with a minimum and maximum length, and basically computes various substrings of the tokens, and indexes those as well.  Here is an example (I've shortened the input text to just 'Los Angeles' for brevity, but you get the idea):

los
ang
ange
angel
nge
ngel
ngele
gel
gele
geles
ele
eles
les

Now, if the user searches for angel, we actually search for:

ang
ange
angel
nge
ngel
gel

Good news, now some of these tokens match, so the search would succeed in finding your document.  The bad news is that your index is substantially larger, so this approach may or may not work for your use case.  Second, this still wouldn't find a match with a user typing a single letter.  Could you index n-grams with minimum set to 1?  Yes, but for most use cases this will simply create an index that is too large, and returns search results that are undesirable.  There is also an "edge" n-gram variant that roots the tokens generated at one edge (usually front).  That way you can match prefixes only, but again that doesn't seem to match your requirements.

Finally, you mentioned doing fuzzy search.  Fuzzy works different than the technique I described because it attempts to find terms that are close, but not exact matches with terms in the index.  Fuzzy works using a metric called a levenshtein edit distance.  Bleve only supports finding matches with an edit distance of 1 or 2.

In some cases it could be useful to combine these two techniques.

Beyond fuzzy, there are things like wildcard/regexp, but it's important to remember, you're basically doing a brute-force search at that point.  If you try to search for '*a*' to match anything with the letter 'a' in it, you will find the index doesn't really help, you would be better of using grep on a text file.

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/2ebe6b1e-0b0d-4ccf-b762-445deede580bn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages