Japanese/Chinese/Korean FTS search

118 views
Skip to first unread message

Brendan Duddridge

unread,
Jun 16, 2017, 8:21:46 PM6/16/17
to Couchbase Mobile
Has anyone solved the problem with searching Japanese, Chinese, or Korean words with FTS in Couchbase Lite?

This looked like a potential solution:


Has anyone done anything like this with CBL?

Thanks,

Brendan

Jens Alfke

unread,
Jun 16, 2017, 8:55:22 PM6/16/17
to mobile-c...@googlegroups.com

On Jun 16, 2017, at 5:21 PM, Brendan Duddridge <bren...@gmail.com> wrote:

Has anyone solved the problem with searching Japanese, Chinese, or Korean words with FTS in Couchbase Lite?

Apple has some APIs for breaking text apart into words (NSLinguisticTagger) and in iOS 11 / macOS 10.13 they’re extending them — using machine-learning data sets — to do more work like stemming. There was a good session at last week’s WWDC describing these.

Using these in CBL would involve replacing the sqlite3_unicodesn tokenizer with a new tokenizer that called those APIs. For 2.0, I have an open issue in LiteCore to add a plugin API to allow replacing the FTS tokenizer.

—Jens

Brendan Duddridge

unread,
Jun 16, 2017, 11:48:57 PM6/16/17
to Couchbase Mobile
Interesting. I never heard of NSLinguisticTagger.

I thought I'd give it a try, but it didn't work how I'd hoped it would work.  I was able to get it to split words from a sentence, but it didn't work for Japanese.

I tried this Japanese word:

木箱

Which according to Google Translator means "Wooden Box".

But the NSLinguisticTagger only recognized it as a single word instead of 2 words. So searching for just 箱 failed to return anything.

I thought if I could split all the words and space separate them, then I would be able to search for words embedded within strings of characters for Asian type languages.

Here was the code I tried:

NSArray *tokenRanges = nil;


NSArray *wordTypes = [textValue linguisticTagsInRange:NSMakeRange(0, [textValue length])

                             scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass

                            options:NSLinguisticTaggerOmitOther | NSLinguisticTaggerOmitPunctuation | NSLinguisticTaggerOmitWhitespace

                        orthography:nil

                        tokenRanges:&tokenRanges];


if (wordTypes.count > 0) {

   NSMutableArray *words = [NSMutableArray array];

  for (NSValue *tokenRange in tokenRanges) {

      NSRange range = [tokenRange rangeValue];

      NSString *word = [value substringWithRange:range];

     [words addObject:word];

  }

  [emitValues addObject:[words componentsJoinedByString:@" "]];

}



So either I'm using it wrong, or it won't be able to split up Asian language words as I hoped.

Thanks,

Brendan

Jens Alfke

unread,
Jun 16, 2017, 11:55:05 PM6/16/17
to mobile-c...@googlegroups.com

On Jun 16, 2017, at 8:48 PM, Brendan Duddridge <bren...@gmail.com> wrote:

I thought I'd give it a try, but it didn't work how I'd hoped it would work.  I was able to get it to split words from a sentence, but it didn't work for Japanese.

Yes, that’s part of what's being added in iOS 11 / macOS 10.13. Watch the session.

—Jens

Brendan Duddridge

unread,
Jun 17, 2017, 12:30:27 PM6/17/17
to Couchbase Mobile

Yes, that’s part of what's being added in iOS 11 / macOS 10.13. Watch the session.

—Jens

I watched the session. They didn't really touch on Asian languages unfortunately. However, although my above sample didn't work to split the 2 characters apart, a larger example seemed to work.

From the Japanese marketing text for my app:

1つのアプリでどうやってあらゆるものを整理できるのでしょうか? それは、33種類の内蔵テンプレートを使用・カスタマイズできるだけでなく、あらゆる種類の情報を入力するための自分だけの「フォーム」を作成することができるからなのです。写真や落書き、音声の録音、計算、添付ファイル、評価、さらには他のフォームへのリンクであっても、どんな情報でも構いません。あたかも自分専用にカスタマイズされた整理アプリを構築するかのようです。


My original code managed to split the above text into this:

1 アプリ どう やっ あらゆる もの 整理 できる でしょう それ 33 種類 内蔵 テンプレート 使用 カスタマイズ できる だけ なく あらゆる 種類 情報 入力 する ため 自分 だけ フォーム 作成 する こと できる から です 写真 落書き 音声 録音 計算 添付 ファイル 評価 さらに フォーム リンク あっ どんな 情報 構い ませ あたかも 自分 専用 カスタマイズ 整理 アプリ 構築 する よう です


I have no idea if the word boundaries make sense to Japanese customers though. I've sent the above off to my Japanese translator to ask him if it makes sense. If so, then I'll just emit the above space separated text using emit(CBLTextKey(valueToEmit), doc[@"form"]); like I'm doing already.

Thanks for pointing me to the NSLinguisticTagger. I think that may be a solution without even having to modify CBL or providing a new tokenizer. Just emitting the space separated words should work well enough for my use case.

Thanks,

Brendan
Reply all
Reply to author
Forward
0 new messages