How to search for a substring?

478 views
Skip to first unread message

Philipp Wördehoff

unread,
Jul 24, 2019, 12:55:31 PM7/24/19
to bleve
Hey,

I'm trying to "fuzzy" search the index. I'm not 100% if a fuzzy search is what I actually want to do. Basically, I want "ef" to match "abcdefghi".

I tried a couple of things but I didn't get anywhere. I would be very happy about a hint about where to start.

Thanks :)

Abhinav Dangeti

unread,
Jul 24, 2019, 1:01:14 PM7/24/19
to bl...@googlegroups.com
Hey Philipp,

You could set up a custom n-gram token filter with min length 2 and max length 2; then set up a custom analyzer that picks up the n-gram token filter, and have your field mapping use the custom analyzer.

The term "abcdefghi" will match your search for "ef".

Cheers

--
AD
--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/84b7cf17-84f2-4c3d-8eb2-ef8dcec20ad4%40googlegroups.com.

Marty Schoch

unread,
Jul 24, 2019, 1:04:48 PM7/24/19
to bl...@googlegroups.com
Generally bleve is not we suited to finding arbitrary substrings inside of text.  It's not that it can't do it, it's that is designed to solve search problems differently (by using language aware techniques such as stemming)

Fuzzy search may appear to work in some simpler examples, but it is the wrong tool for the job.  To get it to match at all you'll need high fuzzy values, which will then also match terms you're not interested in.

We have wildcard searches, which are just a wrapper around regular expression.  Again, we provide this capability, but bleve has limited optimization for this use case.  If you find yourself using regular expressions extensively bleve may be the wrong tool for the job.

Finally, there is the option to index tokens differently, as described by Abhinav, but this too is a very special purpose suggestion for this specific search.  Although, you can extend this by building sets of n-grams you're looking for.  This can be extended to optimize regular expression searches (which is what the google code search did, indexing tri-grams).

marty

--

Philipp Wördehoff

unread,
Jul 24, 2019, 1:07:16 PM7/24/19
to bleve
Hey Abhinav, 

awesome, I totally get what you are saying! Thanks a lot :)

Have a good time! 

On Wednesday, July 24, 2019 at 7:01:14 PM UTC+2, Abhinav Dangeti wrote:
Hey Philipp,

You could set up a custom n-gram token filter with min length 2 and max length 2; then set up a custom analyzer that picks up the n-gram token filter, and have your field mapping use the custom analyzer.

The term "abcdefghi" will match your search for "ef".

Cheers

--
AD

On Jul 24, 2019, at 9:55 AM, Philipp Wördehoff <ph.woe...@gmail.com> wrote:

Hey,

I'm trying to "fuzzy" search the index. I'm not 100% if a fuzzy search is what I actually want to do. Basically, I want "ef" to match "abcdefghi".

I tried a couple of things but I didn't get anywhere. I would be very happy about a hint about where to start.

Thanks :)

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bl...@googlegroups.com.

Philipp Wördehoff

unread,
Jul 24, 2019, 2:07:24 PM7/24/19
to bleve
Hey Marty,

we want to retrieve nodes of a graph that models a data schema. Each node comes with a couple of properties, like id, name, edges, and flags. These properties barely contain any natural language. 

It's just a couple thousand nodes, so I expect it to be in an acceptable performance range.

What I came up so far with is the following. Unfortunately, it doesn't return any matches yet, tho. Could please quickly confirm that I'm progressing in the right direction?

```golang
mapping := bleve.NewIndexMapping()

err := mapping.AddCustomTokenFilter("edgeNgram325",
map[string]interface{}{
"type": edgengram.Name,
"min": 3.0,
"max": 25.0,
})
if err != nil {
panic(err)
}

err = mapping.AddCustomAnalyzer("edgeNgram325",
map[string]interface{}{
"type": custom.Name,
"tokenizer": unicode.Name,
"token_filters": []string{
lowercase.Name,
"edgeNgram325",
},
})
if err != nil {
panic(err)
}

mapping.DefaultAnalyzer = "edgeNgram325"

index, err := bleve.NewMemOnly(mapping)
if err != nil {
panic(err)
}

err = index.Index("0", map[string]string{
"name": "lorem ipsum",
})
if err != nil {
panic(err)
}

qs := bleve.NewQueryStringQuery("psum")
req := bleve.NewSearchRequest(qs)
res, err := index.Search(req)
if err != nil {
panic(err)
}

spew.Dump(res)
// (*bleve.SearchResult)(0xc00036a460)(No matches)
```

Cheers

Philipp Wördehoff

unread,
Jul 24, 2019, 2:42:32 PM7/24/19
to bleve
Hey Marty, 

I realized that I was using `edgengram.Name` whereas I actually wanted to use `ngram.Name`.

Thanks a lot for your support :)

On Wednesday, July 24, 2019 at 6:55:31 PM UTC+2, Philipp Wördehoff wrote:
Reply all
Reply to author
Forward
0 new messages