Before the database switch, fulltext search was possible by way of a simple
LIKE comparison on the text_content column for each mail. Since mails are
now stored in their individual mime parts, this is no longer feasible. On top of
that, LIKE queries are very slow since they require a linear scan of all data,
so it's about time we did this properly.
What we want to do here is make use of the SQLite Fulltext Search (FTS)
extensions which come with sqlite on android to create a reverse index on
The simplest way to implement this is to use a simple virtual fts table, and
put the fulltext of all mails in there. However, this has the large drawback
that we need to store the full text of all e-mails redundantly:
+ simple to implement
+ allows display of snippets in search results
− all email text is stored twice (in addition to the reverse index)
FTS4 allows creation of "contentless" fts tables. These do not store the content
redundantly, and come in two flavors: true contentless and external content.
True contentless is basically the naked reverse index, which works very well
for documents which are immutable in nature (such as emails), since entries
cannot be updated or deleted. The lack of updates is not a large problem if
we only insert fully downloaded text data into the search index, the lack of
deletion can be mitigated by using an autoincrement primary key on the
messages table, which will leave deleted mails in the index but they aren't
a problem and we can get rid of them by rebuilding the index every once in
a while.
+ no redundant data
− no snippets
− more difficult to implement
− deleted mails leave orphaned data, getting rid of that requires rebuilding the index
For completeness, there are also "external content" fts4 tables, which are
contentless themselves but can look up data from a connected table. If we
wanted to use this directly, we would have to store the part data in a format
which is suitable for tokenization. Custom tokenizers are infeasible, so the
only option would be storing the text as raw UTF-8, which means either
redundancy or we don't store the original mail anymore, which means trouble
for signed data. Another option here might be dropping the preview column
of messages in favor of a text_content one and using that for both indexing
and preview purposes, getting rid of *some* redundancy.
There are some tradeoffs and decisions here which may be difficult to change
down the line, so we should put some thought into this. Despite that, I'd like
to move forward with an implemention soon. If anyone is familiar with sqlite
fts and wants to weigh in or has alternative ideas, feedback would be much
appreciated.
- V