Trying to create tokenizer for underscore

346 views
Skip to first unread message

Philip O'Toole

unread,
May 26, 2015, 1:28:30 AM5/26/15
to bl...@googlegroups.com
Hello,

I'm using http://analysis.blevesearch.com/analysis which is very cool, but am unable to create a tokenizer that will split, say, "xxx_yyy" on the underscore. The Go code generated by the webpage is below, but the inverse tokenization seems to be happening -- see the attached screenshot. Any ideas?

Thanks,

Philip

func buildIndexMapping() (*bleve.IndexMapping, error) {
    indexMapping := bleve.NewIndexMapping()

    var err error
    err = indexMapping.AddCustomTokenizer("u1t",
        map[string]interface{}{
            "regexp": `_`,
            "type": `regexp`,
        })
    if err != nil {
        return nil, err
    }

    err = indexMapping.AddCustomAnalyzer("u1",
        map[string]interface{}{
            "type": `custom`,
            "char_filters": []interface{}{},
            "tokenizer": `u1t`,
            "token_filters": []interface{}{},
        })
    if err != nil {
        return nil, err
    }

    return indexMapping, nil
}
analysis.jpeg

Marty Schoch

unread,
May 26, 2015, 6:38:24 PM5/26/15
to bl...@googlegroups.com
Somewhat confusingly the regexp is a pattern which should match the tokens, not the separator characters.

I experimented a bit and came up with this:

"regexp": `[^\W_]+`

Normally [\w]+ is a pretty simple pattern that matches words, but as you found this won't split on underscore.  Using ^\W is equivalent to \w, but also allows us to exclude the underscore explicitly.

I didn't do extensive testing, but it split "cat_dog" into "cat" and "dog" in the analysis wizard.

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/CAESgYQEq03cF8_soDs8T2%2BWcGuXk6tq0VuooCwj0Jcisua-wxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Philip O'Toole

unread,
May 26, 2015, 6:43:30 PM5/26/15
to bl...@googlegroups.com
D'oh. I should have paid more attention -- \w splits on words so that's why it works. It makes sense to me now why my regex was doing the wrong thing.

OK, let me try that regex -- thanks.

Philip

Philip O'Toole

unread,
May 26, 2015, 8:30:12 PM5/26/15
to bl...@googlegroups.com
Say I want to tokenize on words and _? In other words I want to do this:

"foo_a overload" should result in 3 tokens "foo", "a", and "overload". Is this doable? I've been thinking about this, but the analyzer regex is not immediately apparent to me. Perhaps I need to pre-process the text before indexing, replacing all _ with whitespace. Kind of a hack, but it might work. I was also trying to see if I could get 4 tokens "foo", "a", "_", and "overload" and then adding a token filter to remove _.

Philip

Marty Schoch

unread,
May 26, 2015, 9:55:30 PM5/26/15
to bl...@googlegroups.com
I just tried using the pattern from earlier `[^\W_]+` on the input "foo_a overload" and it produces 3 tokens.  The pattern is looking for 1 or more characters that are not "not word charcters" or "_".  In other words, word characters other than "_" (normally _ is a word character).  So, in this case, with this example I think it does work, though its possible you have other cases in mind that still don't work.

In general you're right, you could also convert "_" to " " using a character filter is another option.  I wouldn't say its a hack, the roles of character filter, tokenizer, and token filters are somewhat flexible.  Where to do something is more about making it work than some rigid rule about the right place.

marty

Philip O'Toole

unread,
May 26, 2015, 10:18:26 PM5/26/15
to bl...@googlegroups.com
Ok, thanks Marty. Let me check my work, but I didn't think I saw that. I was trying various different regex patterns, but could have made mistake. 

Philip

Philip O'Toole

unread,
May 27, 2015, 1:11:22 PM5/27/15
to bl...@googlegroups.com
The regex you supplied actually does work -- very well. Thanks again Marty.

Philip
Reply all
Reply to author
Forward
0 new messages