FTS3 tokenizer for case insensitive non-ASCII searches?

794 views
Skip to first unread message

Brendan Duddridge

unread,
Dec 3, 2012, 5:23:53 PM12/3/12
to sqlc...@googlegroups.com
I have some Russian users of my app who cannot do case insensitive searches. I use the FTS3 search engine in my SQLCipher build, but it seems to only support case insensitive searches for ASCII character sets.

Anyone know of a tokenizer that will work for iOS and Mac apps and SQLCipher which would provide case insensitive search functionality in non-ASCII character sets?

Thanks,

Brendan

Nick Parker

unread,
Dec 4, 2012, 8:46:47 AM12/4/12
to sqlc...@googlegroups.com
Hi Brendan,

You might consider building SQLCipher with ICU and then specifying an ICU locale identifier when creating your virtual table(s) [1].


Nick Parker

Alexey Illarionov

unread,
Dec 4, 2012, 1:30:13 PM12/4/12
to SQLCipher Users
Hi,
I have backported sqlite 3.7.13 unicode61 tokenizer for my android
application [1].
Not sure about iOS. You can still try ICU or snowball tokenizer [2].

1. https://github.com/littlesavage/sqlite3-unicodesn
2. https://bitbucket.org/sevkin/snowball_fts3/

Brendan Duddridge

unread,
Dec 4, 2012, 3:31:41 PM12/4/12
to sqlc...@googlegroups.com
Hi Nick,

I tried to compile ICU into SQLCipher but I wasn't successful with it. Plus it looked like it would add about 25 MB to my binary which wasn't cool. But maybe because I was doing it wrong.

If you've compiled ICU into SQLCipher I'd love to see your build configuration.

Thanks,

Brendan

Billy Gray

unread,
Dec 6, 2012, 4:29:37 PM12/6/12
to sqlc...@googlegroups.com
Hi Brendan,

I took a quick look at this because we wanted to provide diacritic-insensitive (or sensitive?) searches (e.g. searching for u should match ü), but I didn't see a particularly straight way to get ICU built statically, and put aside for the time being. 

B
--
Team Zetetic
http://zetetic.net

Brendan Duddridge

unread,
Dec 16, 2012, 9:54:17 PM12/16/12
to sqlc...@googlegroups.com
Hello Billy,

If you ever get a chance to try again, please do let the group know. I know this would be very useful for many users.

Thanks,

Brendan

Billy Gray

unread,
Dec 17, 2012, 12:26:13 PM12/17/12
to sqlc...@googlegroups.com
I won't have the time to look into this for the foreseeable future. Also, I'm not confident that the ICU library and extension to SQLite will give me the search and matching capabilities I'm interested in. If you have the opportunity to look into it yourself, let us know what you find!

Regards,
Billy

Alexey Illarionov

unread,
Dec 17, 2012, 12:42:37 PM12/17/12
to SQLCipher Users

On 5 дек, 00:31, Brendan Duddridge <brend...@gmail.com> wrote:
> Hi Nick,
>
> I tried to compile ICU into SQLCipher but I wasn't successful with it. Plus
> it looked like it would add about 25 MB to my binary which wasn't cool. But
> maybe because I was doing it wrong.
> If you've compiled ICU into SQLCipher I'd love to see your build
> configuration.

unicode61 does not depend on ICU and its already in sqlcipher core.
Try to build with SQLITE_ENABLE_FTS4_UNICODE61.

Brendan Duddridge

unread,
Dec 17, 2012, 4:12:51 PM12/17/12
to sqlc...@googlegroups.com
Hello Alexey,

So far this seems to work for searching Cyrillic words in a case-insensitive way. That's great!

Do you know if it works with Chinese and Japanese words too? There's different rules for word boundaries on those languages.

Thanks!

Brendan

Brendan Duddridge

unread,
Dec 17, 2012, 4:19:23 PM12/17/12
to sqlc...@googlegroups.com
Ok, I found the answer here:


"The "unicode61" tokenizer is available beginning with SQLite version 3.7.13. Unicode61 works very much like "simple" except that it does full unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens."

The ICU tokenizer does word boundary matching.


Brendan

Billy Gray

unread,
Dec 17, 2012, 5:07:27 PM12/17/12
to sqlc...@googlegroups.com
Nice find, guys!

Lubos Staracek

unread,
Nov 28, 2013, 11:04:15 AM11/28/13
to sqlc...@googlegroups.com
Hello Alexey,
could you please provide some tutorial on how to load your SQLite-unicodesn sqlite3 extension on Android? I had no luck with it.
Thanks,
Lubos

Dňa utorok, 4. decembra 2012 19:30:13 UTC+1 Alexey Illarionov napísal(-a):
Reply all
Reply to author
Forward
0 new messages