Thunderbird 3.0 (RC1 coming soon, but existing nightlies and betas have
it too) has a global database that lives in global-messages-db.sqlite.
It uses SQLite's FTS3 fulltext search engine which uses an inverted
index internally.
If you have a specific use case in mind, I can provide more input.
Andrew
Hi Andrew
I just wonder, how big this DB might get?
Rolf
Well, my file (global-messages-db.sqlite) has reached 422MiB. ;)
John.
--
John Beranek To generalise is to be an idiot.
http://redux.org.uk/ -- William Blake
It has no inherent sizing constraints.
Andrew
But it seems proportional (in some fashion) to the total quantity of
messages.
On my system for example, Windows Explorer reports "Size on disk" for the
Mail folder has 3.14 GB, and global-messages-db.sqlite has 937 Mb.
Might be interesting to plot a graph to serve as a guide....
-- Russell
Qiyuan
Only Thunderbird 3 has the "gloda" global database with inverted index.
On a reasonably modern system, we index something like 10-25 messages
per second. We are doing more than just building a fulltext index and
are trying to avoid making Thunderbird unresponsive which is why that
number is not higher. Note that we require the messages to be available
offline before we will index their bodies.
Thunderbird's nsMsgSearch-style search implementation (non-SQLite
inverted index) supports building complex boolean decision trees. This
presumably includes Thunderbird 2.0.0.x. Unfortunately, there was a
serious bug involving the grouping flags which has only been recently
fixed. So you basically need Thunderbird 3 even for that.
The SQLite FTS3 query parser has two modes of operation. We are using
the traditional mode of operation which has some limitations. There is
documentation on the limitations/features of the various modes in
"ext/fts3/README.syntax" if you check a source distribution of SQLite.
Our fulltext query builder can be found in:
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/msg_search.js
It is somewhat limited in nature; it definitely does not try to build a
query as complex as your example. But it would hopefully provide you
with a starting point for anything you might want to do.
Andrew
Qiyuan
On 11月7日, 下午5时20分, Andrew Sutherland <sombr...@alum.mit.edu> wrote:
> On 11/07/2009 03:57 PM, qiyuan Tang wrote:
>
> > Hi Andrew, Do you mean that Thunderbird 2.0.0.21 doesn't support this
> > sqlite? In Thunderbird 3.0, if I lost the inverted index, how long
> > will it take to rebuild the index if I have thousands of emails?
> > I want to do search like (A | A')& (B | B') . A, A', B, B' are search
> > keywords. As far as I know, Thunderbird doesn't support such query.
> > But if I have the inverted index, this kind of search would be easy.
> > Am I right? Thank you.
>
> Only Thunderbird 3 has the "gloda" global database with inverted index.
> On a reasonably modern system, we index something like 10-25 messages
> per second. We are doing more than just building a fulltext index and
> are trying to avoid making Thunderbird unresponsive which is why that
> number is not higher. Note that we require the messages to be available
> offline before we will index their bodies.
>
> Thunderbird's nsMsgSearch-style search implementation (non-SQLite
> inverted index) supports building complex boolean decision trees. This
> presumably includes Thunderbird 2.0.0.x. Unfortunately, there was a
> serious bug involving the grouping flags which has only been recently
> fixed. So you basically need Thunderbird 3 even for that.
>
> The SQLite FTS3 query parser has two modes of operation. We are using
> the traditional mode of operation which has some limitations. There is
> documentation on the limitations/features of the various modes in
> "ext/fts3/README.syntax" if you check a source distribution of SQLite.
>
> Our fulltext query builder can be found in:http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/...
As far as I know, FTS3 does not provide an easy way to get at the word
list. You might be able to reach directly into the
"FULLTEXTTABLENAME_segments" table that actually stores the
representations, but I think it uses an optimized encoding that would be
a nightmare to deal with if you're not using the FTS3 code for that
directly.
If you find a way to get at the word list, I would love to know... it
would allow us to do "did you mean..."-style suggestions.
It may also be possible to request that FTS3 expose some of the
functionality if we have a helpful enough use-case that we can justify.
Andrew
And what about latent semantic indexing?
This sounds like a very cool add-on!
Without having seen your extension it's hard to say for sure how
difficult it would be. In general, I think moving to Thunderbird 3 is
probably a good idea and the benefits you gain would likely outweigh the
cost. Having said that, if this is a research project that you are not
hoping to have used in the real world but rather just need a workable
research platform and you already have that with 2.0.0.x, it may be
worth just sticking with it.
A lot of the javascript support code related to the thread pane that
lists messages has been refactored. If you had to do any hacky things
to get your results displayed before, they should hopefully no longer be
required. We have abstractions used by the built-in global search to
take its fulltext results and display them, and it should be pretty easy
to leverage.
From an XPCOM interface perspective, the MailNews interfaces and events
remain largely unchanged except for some extra documentation. The
message headers still work like they always did.
For binary components, I think a lot of the string handling may have
changed, but I am not an expert on that. (Although some binary
components may be required if you want to plug into SQLite's FTS3 at a
low level, you hopefully won't need them otherwise.) I would check out
developer.mozilla.org and see what it has to say on the various
subjects; I think there are guides to upgrading code for the mozilla
platform in general.
In terms of support/feedback, we're much more able to provide
information on how to make things work with 3.0, whereas most of us
probably have no idea how to make things work with 2.0.
Andrew