A little clarification on keyword analyzer and ~ (fuzzy search)

72 views
Skip to first unread message

Osiloke Emoekpere

unread,
May 20, 2015, 7:26:33 AM5/20/15
to bl...@googlegroups.com
Hi,

I am using the boolean query to search for items. I have a search request which returns nothing. The post field is indexed with a keyword analyzer.

$ bleve_query --index=.index/like.index/ +created_by:davinci post:555c3bb06f7a127cda000005
No matches

If i add a fuzzy search option (~, i'm assuming tilda is a fuzzy search character), i get the result i want

$ bleve_query --index=.index/like.index/ +created_by:davinci post:555c3bb06f7a127cda000005~3
1 matches, showing 1 through 1, took 524.096µs
    1. 555c40386f7a128421000001 (0.227900)
created_by
davinci
post
555c3bb06f7a127cda000005

I would like to ask if i'm using the query syntax properly.
Thanks

Note:
My index mapping is as follows

{
  "types": {
    "like": {
      "enabled": true,
      "dynamic": true,
      "properties": {
        "post": {
          "enabled": true,
          "dynamic": true,
          "fields": [
            {
              "type": "text",
              "analyzer": "keyword",
              "store": true,
              "index": true,
              "include_term_vectors": true,
              "include_in_all": true
            }
          ],
          "default_analyzer": ""
        }
      },
      "default_analyzer": ""
    }
  },
  "default_mapping": {
    "enabled": true,
    "dynamic": true,
    "default_analyzer": ""
  },
  "type_field": "_type",
  "default_type": "_default",
  "default_analyzer": "standard",
  "default_datetime_parser": "dateTimeOptional",
  "default_field": "_all",
  "byte_array_converter": "json",
  "analysis": {}

Marty Schoch

unread,
May 20, 2015, 12:01:36 PM5/20/15
to bl...@googlegroups.com
The syntax looks OK to me.  But fuzzy search with edit distance of 3 or higher will start to perform poorly.  Choosing a better solution might depend on some of the details of your use case though.  Are you simply trying to allow for typos when someone hand enters a UUID?

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/6da87267-eed6-4fe2-a9f2-a3427bf3c2fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Osiloke Emoekpere

unread,
May 20, 2015, 12:04:37 PM5/20/15
to bl...@googlegroups.com
I actually dont want to use fuzzy search. I wanted to do a keyword search but it seems to be yielding no results but when i do a fuzzy search it shows that the entry i'm looking for actually exists. I only used fuzzy search to verify that it actually exists. 

Marty Schoch

unread,
May 20, 2015, 12:07:44 PM5/20/15
to bl...@googlegroups.com
Oh, I see now.  I was looking at the document IDs and not the values of the "post" field.  They were very similar so I misunderstood.  Let me read everything again.

marty

Marty Schoch

unread,
May 20, 2015, 12:12:21 PM5/20/15
to bl...@googlegroups.com
OK, so the most likely cause is that either at index or query time something is being analyzed.  I see that you've configured a document type called "like".  Do the document you're indexing implement the Classifier interface, or have a field "_type" with value "like"?  Otherwise bleve won't know to use the "like" document mapping for them.

marty

Osiloke Emoekpere

unread,
May 20, 2015, 12:18:59 PM5/20/15
to bl...@googlegroups.com
Yes, i do set a field called _type (i learnt my lesson). I'll probably have to dig into the source code of bleve an put some log statements to debug if it is actually using the right classifier. If there are some other debug options already in place please let me know.

Thank you

Marty Schoch

unread,
May 20, 2015, 12:23:49 PM5/20/15
to bl...@googlegroups.com

The next step is to use the bleve_dump utility on the index.  This should show the terms being indexed, and it should  be obvious is they are the original terms or if they've been analyzed.

Marty

Osiloke Emoekpere

unread,
May 20, 2015, 1:42:04 PM5/20/15
to bl...@googlegroups.com
I did that and i do see that the field is indexed as shown below, could it be because of the length of the fields content or the fact that it has an integer and alphabetic mixture of characters. I have other documents indexed with shorter keys and i can query keyword fields easily.

Term: `555c3bb06f7a127cda000005` Field: 5 DocId: `` Frequency: 2 Norm: 0.000000 Vectors: []
Key:   74 05 00 35 35 35 63 33 62 62 30 36 66 37 61 31 32 37 63 64 61 30 30 30 30 30 35 ff                 
Value: 02 00                                                                                               

Term: `555c3bb06f7a127cda000005` Field: 5 DocId: `555c7e536f7a12d766000009` Frequency: 1 Norm: 0.229416 Vectors: [Field: 4 Pos: 1 Start: 0 End 24]
Key:   74 05 00 35 35 35 63 33 62 62 30 36 66 37 61 31 32 37 63 64 61 30 30 30 30 30 35 ff 35 35 35 63 37 65 35 33 36 66 37 61 31 32 64 37 36 36 30 30 30 30 30 39
Value: 01 f5 d7 ab f3 03 04 01 00 18                                                                       

Term: `555c3bb06f7a127cda000005` Field: 5 DocId: `555c7e926f7a12d766000010` Frequency: 1 Norm: 0.229416 Vectors: [Field: 4 Pos: 1 Start: 0 End 24]
Key:   74 05 00 35 35 35 63 33 62 62 30 36 66 37 61 31 32 37 63 64 61 30 30 30 30 30 35 ff 35 35 35 63 37 65 39 32 36 66 37 61 31 32 64 37 36 36 30 30 30 30 31 30
Value: 01 f5 d7 ab f3 03 04 01 00 18
Message has been deleted

Osiloke Emoekpere

unread,
May 20, 2015, 2:50:25 PM5/20/15
to bl...@googlegroups.com
I also have entries like this with unicode characters in the dump. I would like to know if this is normal

Backindex DocId: `555cb0576f7a121667000005` Term Entries: [term:"555caf696f7a1215eb000001" field:1  term:"like" field:2  term:"L0:U" field:3  term:"H\006\007*W" field:3  term:"\\\014" field:3  term:"4\003\003U+K\002\\" field:3  term:"8\030\035*\\X\025" field:3  term:"@\014\016U.," field:3  term:"00:U90+@" field:3  term:"D`u*r" field:3  term:"T\030\035" field:3  term:" \001AjUeA.\000\000\000" field:3  term:"$\014\016U.,\np\000\000" field:3  term:",\006\007*W\026\0058\000" field:3  term:"X\001A" field:3  term:"(`u*r`W\000\000" field:3  term:"<\001AjUeA" field:3  term:"P\003\003U" field:3  term:"davinci" field:4  term:"H\006\007*W" field:5  term:"(`u*r`W\000\000" field:5  term:"<\001AjUeA" field:5  term:"00:U90+@" field:5  term:"P\003\003U" field:5  term:" \001AjUeA.\000\000\000" field:5  term:"4\003\003U+K\002\\" field:5  term:"X\001A" field:5  term:"like" field:5  term:"@\014\016U.," field:5  term:"davinci" field:5  term:"\\\014" field:5  term:"$\014\016U.,\np\000\000" field:5  term:"D`u*r" field:5  term:"555caf696f7a1215eb000001" field:5  term:"8\030\035*\\X\025" field:5  term:",\006\007*W\026\0058\000" field:5  term:"T\030\035" field:5  term:"L0:U" field:5 ], Stored Entries: [field:1  field:2  field:3  field:4 ]

Osiloke Emoekpere

unread,
May 20, 2015, 3:08:14 PM5/20/15
to bl...@googlegroups.com
I've narrowed the issue to a bug in keyword search.
Keywords that start with a number are not matched while keywords that start with an alphabet are matched.
I opened a github issue.
Reply all
Reply to author
Forward
0 new messages