Hi
I wanted to test what the "unicode" tokenizer returns when using it on
Japanese text.
Since the ICU functions seem to depend on the locale I first generated
a ja_JP.UTF-8 locale, switched to it and built bleve.
For the actual testing I used
https://github.com/blevesearch/beer-search
replacing the buildIndexMapping() function in mapping.go with the
one below.
func buildIndexMapping() (*bleve.IndexMapping, error) {
indexMapping := bleve.NewIndexMapping()
err := indexMapping.AddCustomAnalyzer("enWithEdgeNgram325",
map[string]interface{}{
"type": "custom",
"tokenizer": "unicode",
"token_filters": []string{
"stop_en",
},
})
if err != nil {
fmt.Println(err)
return nil, err
}
return indexMapping, nil
}
When replacing one of the beer descriptions with a random Japanese text
I made the following observations.
* Term search only found single characters i.e. 動 but not 動物園 even
though the latter was in the indexed text.
* The same was true for the prefix search even though the search
worked fine for the English text.
Provided I did not screw something up during my testing my conclusion is
that the "unicode" tokenizer does not return Japanese compounds but only
single characters. That makes it just as useful as the "whitespace"
tokenizer that seemed to return the same single characters during my
superficial testing.
Since bleve seems to be modeled after Lucene, I think it would make
sense to add a character bigram generating (CJK) token filter as well
as a morphologic analyzer for Japanese to it (for both there already
exists a Github issue). If no one beats me to it I will give the kagome
morphologic analyzer for Japanese a shot.
Has anyone tested bleve's "unicode" tokenizer and gotten it to return
compound words on CJK text?
Cheers,
Silvan