Trying ICU unicode tokenization with Japanese text

Silvan Jegen

unread,

Sep 9, 2014, 3:47:06 PM9/9/14

to bl...@googlegroups.com

Hi

I wanted to test what the "unicode" tokenizer returns when using it on
Japanese text.

Since the ICU functions seem to depend on the locale I first generated
a ja_JP.UTF-8 locale, switched to it and built bleve.

For the actual testing I used https://github.com/blevesearch/beer-search
replacing the buildIndexMapping() function in mapping.go with the
one below.

func buildIndexMapping() (*bleve.IndexMapping, error) {
indexMapping := bleve.NewIndexMapping()

err := indexMapping.AddCustomAnalyzer("enWithEdgeNgram325",
map[string]interface{}{
"type": "custom",
"tokenizer": "unicode",
"token_filters": []string{
"stop_en",
},
})
if err != nil {
fmt.Println(err)
return nil, err
}

return indexMapping, nil
}

When replacing one of the beer descriptions with a random Japanese text
I made the following observations.

* Term search only found single characters i.e. 動 but not 動物園 even
though the latter was in the indexed text.
* The same was true for the prefix search even though the search
worked fine for the English text.

Provided I did not screw something up during my testing my conclusion is
that the "unicode" tokenizer does not return Japanese compounds but only
single characters. That makes it just as useful as the "whitespace"
tokenizer that seemed to return the same single characters during my
superficial testing.

Since bleve seems to be modeled after Lucene, I think it would make
sense to add a character bigram generating (CJK) token filter as well
as a morphologic analyzer for Japanese to it (for both there already
exists a Github issue). If no one beats me to it I will give the kagome
morphologic analyzer for Japanese a shot.

Has anyone tested bleve's "unicode" tokenizer and gotten it to return
compound words on CJK text?

Cheers,

Silvan

Marty Schoch

unread,

Sep 9, 2014, 3:58:17 PM9/9/14

to bl...@googlegroups.com

Thanks for getting this started. See my responses inline.

On Tue, Sep 9, 2014 at 3:45 PM, Silvan Jegen <s.j...@gmail.com> wrote:

Hi

I wanted to test what the "unicode" tokenizer returns when using it on
Japanese text.

Since the ICU functions seem to depend on the locale I first generated
a ja_JP.UTF-8 locale, switched to it and built bleve.

The ICU tokenizer takes in a locale when we instantiate it. If you look at how we have a separate unicode_th tokenizer, that is explicitly setting the locale there to th_TH.

We don't do this for most other languages because it doesn't seem to matter. This has been confirmed from emails I've read about ICU. However, best practice is still to set the proper locale, as the behavior could change in the future.

In my experience, right now it doesn't matter, as it switches to the dictionary based rules based on the unicode script identified for a particular input.

So, even when leaving it at en_US, it still seems to tokenize Japanese text correctly.

For the actual testing I used https://github.com/blevesearch/beer-search
replacing the buildIndexMapping() function in mapping.go with the
one below.

func buildIndexMapping() (*bleve.IndexMapping, error) {
indexMapping := bleve.NewIndexMapping()

err := indexMapping.AddCustomAnalyzer("enWithEdgeNgram325",
map[string]interface{}{
"type": "custom",
"tokenizer": "unicode",
"token_filters": []string{
"stop_en",
},
})
if err != nil {
fmt.Println(err)
return nil, err
}

return indexMapping, nil
}

When replacing one of the beer descriptions with a random Japanese text
I made the following observations.

* Term search only found single characters i.e. 動 but not 動物園 even
though the latter was in the indexed text.
* The same was true for the prefix search even though the search
worked fine for the English text.

Can you share with me some of the sample text you indexed, and sample queries you ran? It will help me reproduce your results.

Provided I did not screw something up during my testing my conclusion is
that the "unicode" tokenizer does not return Japanese compounds but only
single characters. That makes it just as useful as the "whitespace"
tokenizer that seemed to return the same single characters during my
superficial testing.

Since bleve seems to be modeled after Lucene, I think it would make
sense to add a character bigram generating (CJK) token filter as well
as a morphologic analyzer for Japanese to it (for both there already
exists a Github issue). If no one beats me to it I will give the kagome
morphologic analyzer for Japanese a shot.

Has anyone tested bleve's "unicode" tokenizer and gotten it to return
compound words on CJK text?

I need to look more carefully to see where this is going wrong. But this test case may be relevant:

https://github.com/blevesearch/bleve/blob/master/analysis/tokenizers/unicode_word_boundary/boundary_test.go#L62-L80

It passes the string "こんにちは世界" which I believe is one translation of Hello World. I took this from the tour of go example. The test case shows it getting split into 2 tokens.

I'm working on an interactive utility to let people play around with the text analysis process. I hope to get it posted tonight.

marty

Silvan Jegen

unread,

Sep 9, 2014, 4:15:49 PM9/9/14

to bl...@googlegroups.com

Sure, please find a patch containing the changes I made to the beer-search
repo attached to this mail.

I simply searched for 動 which was found and 動物園 which wasn't in the
Term and Prefix search.

> > Provided I did not screw something up during my testing my conclusion is
> > that the "unicode" tokenizer does not return Japanese compounds but only
> > single characters. That makes it just as useful as the "whitespace"
> > tokenizer that seemed to return the same single characters during my
> > superficial testing.
> >
> > Since bleve seems to be modeled after Lucene, I think it would make
> > sense to add a character bigram generating (CJK) token filter as well
> > as a morphologic analyzer for Japanese to it (for both there already
> > exists a Github issue). If no one beats me to it I will give the kagome
> > morphologic analyzer for Japanese a shot.
> >
> > Has anyone tested bleve's "unicode" tokenizer and gotten it to return
> > compound words on CJK text?
> >
> >
> I need to look more carefully to see where this is going wrong. But this
> test case may be relevant:
>
> https://github.com/blevesearch/bleve/blob/master/analysis/tokenizers/unicode_word_boundary/boundary_test.go#L62-L80
>
> It passes the string "こんにちは世界" which I believe is one translation of Hello
> World. I took this from the tour of go example. The test case shows it
> getting split into 2 tokens.

Hm, you are right. When running

go test -tags icu

in the unicode_word_boundary directory the tests pass for me too. I must
have gotten something wrong it seems.

> I'm working on an interactive utility to let people play around with the
> text analysis process. I hope to get it posted tonight.