Fulltext Search Discussion (SQlite FTS)

80 views
Skip to first unread message

Vincent Breitmoser

unread,
Feb 11, 2016, 6:42:37 PM2/11/16
to K-9 Mail Developers List
Before the database switch, fulltext search was possible by way of a simple
LIKE comparison on the text_content column for each mail. Since mails are
now stored in their individual mime parts, this is no longer feasible. On top of
that, LIKE queries are very slow since they require a linear scan of all data,
so it's about time we did this properly.

What we want to do here is make use of the SQLite Fulltext Search (FTS)
extensions which come with sqlite on android to create a reverse index on
the text data to the ids of emails: https://www.sqlite.org/fts3.html

The simplest way to implement this is to use a simple virtual fts table, and
put the fulltext of all mails in there. However, this has the large drawback
that we need to store the full text of all e-mails redundantly:
+ simple to implement
+ allows display of snippets in search results
− all email text is stored twice (in addition to the reverse index)

FTS4 allows creation of "contentless" fts tables. These do not store the content
redundantly, and come in two flavors: true contentless and external content.

True contentless is basically the naked reverse index, which works very well
for documents which are immutable in nature (such as emails), since entries
cannot be updated or deleted. The lack of updates is not a large problem if
we only insert fully downloaded text data into the search index, the lack of
deletion can be mitigated by using an autoincrement primary key on the
messages table, which will leave deleted mails in the index but they aren't
a problem and we can get rid of them by rebuilding the index every once in
a while.
+ no redundant data
− no snippets
− more difficult to implement
− deleted mails leave orphaned data, getting rid of that requires rebuilding the index

For completeness, there are also "external content" fts4 tables, which are
contentless themselves but can look up data from a connected table. If we
wanted to use this directly, we would have to store the part data in a format
which is suitable for tokenization. Custom tokenizers are infeasible, so the
only option would be storing the text as raw UTF-8, which means either
redundancy or we don't store the original mail anymore, which means trouble
for signed data. Another option here might be dropping the preview column
of messages in favor of a text_content one and using that for both indexing
and preview purposes, getting rid of *some* redundancy.

There are some tradeoffs and decisions here which may be difficult to change
down the line, so we should put some thought into this. Despite that, I'd like
to move forward with an implemention soon. If anyone is familiar with sqlite
fts and wants to weigh in or has alternative ideas, feedback would be much
appreciated.

 - V

cketti

unread,
Feb 19, 2016, 10:56:19 PM2/19/16
to k-9...@googlegroups.com
Thanks a lot for the excellent overview, Vincent.

I propose to go ahead with the first option. It sounds like the easiest
to implement.

If the overhead of the search data turns out to be problematic in
practice we can think about optimizations. Maybe that can be combined
with supporting search in encrypted messages. We don't want to add
decrypted text to a plain text search index. One way to work around that
is by first decrypting then searching in the decrypted mail body
ourselves. That code path could optionally be used for all messages. As
option for users who don't use the search functionality very often
and/or don't want to pay the storage cost for a search index.
Reply all
Reply to author
Forward
0 new messages