FTS substring search

23 views
Skip to first unread message

Brendan Duddridge

unread,
Apr 20, 2017, 6:14:47 PM4/20/17
to Couchbase Mobile
I just stumbled across this GitHub project:


It tokenizes every character in the FTS index allowing you to do substring searches. Perfect for languages such as Chinese, Japanese, and Korean where words aren't necessarily bounded by spaces.

Where would be the best place to stick this in Couchbase Lite so I can use it? Or perhaps it could be added to the master repo so anyone could use it (optionally)? I realize it would increase the size of the FTS index. But it definitely solves a big problem for those languages.

Thanks,

Brendan

Jens Alfke

unread,
Apr 21, 2017, 5:13:27 PM4/21/17
to mobile-c...@googlegroups.com

On Apr 20, 2017, at 3:14 PM, Brendan Duddridge <bren...@gmail.com> wrote:

It tokenizes every character in the FTS index allowing you to do substring searches. Perfect for languages such as Chinese, Japanese, and Korean where words aren't necessarily bounded by spaces.

That sounds like it would be space-intensive, since the number of tokens in every string is multiplied by something like five (for English.) On the other hand there are a lot fewer distinct tokens in the database … but that could mean a lot of false positives when it looks up rows containing tokens.

I don’t know much about the type of index SQLite uses for FTS. Have you tried this out yourself? It might work better with Chinese and Japanese which have a huge number of characters but very short words.

Where would be the best place to stick this in Couchbase Lite so I can use it?

Our current tokenizer is registered with SQLite in a function called register_unicodesn_tokenizer; you can look at that function and its call site. The name it’s registered as is “unicodesn”, so you can look at the SQL query in CBLSQLiteViewStorage.m which uses it.

(That’s for 1.x. I assume you’re not ready to try this out in 2.0 yet.)

—Jens

Brendan Duddridge

unread,
Apr 21, 2017, 6:15:45 PM4/21/17
to Couchbase Mobile
Hi Jens,

I could try enabling it only for Chinese, Japanese, and Korean languages and see how it fairs.

As for 2.0, not until CBLModel is implemented and of course there's the replication compatibility issue.

Brendan
Reply all
Reply to author
Forward
0 new messages