Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is there index for E-mail Messages in Thunderbird?

16 views
Skip to first unread message

qiyuan Tang

unread,
Nov 4, 2009, 7:38:08 PM11/4/09
to
I'm wondering if Thunderbird builds index for all the offline messages
in an folder such that each word in message is bind with some message
IDs. For example, if both message A and message B have the word
"Internet", then the relationship between "Internet" and the messages
containing it (A&B) is established somewhere so that if we search the
keyword "Internet", the thunderbird can know what are the
corresponding messages. Does Thunderbird do so? If so, where is such
inverted index? I can't find it in Thunderbird directory. Can someone
help me to figure this out? Thank you very much.

Andrew Sutherland

unread,
Nov 4, 2009, 9:21:57 PM11/4/09
to

Thunderbird 3.0 (RC1 coming soon, but existing nightlies and betas have
it too) has a global database that lives in global-messages-db.sqlite.
It uses SQLite's FTS3 fulltext search engine which uses an inverted
index internally.

If you have a specific use case in mind, I can provide more input.

Andrew

Rolf Gloor

unread,
Nov 5, 2009, 3:42:35 PM11/5/09
to

Hi Andrew

I just wonder, how big this DB might get?

Rolf

John Beranek

unread,
Nov 5, 2009, 4:21:03 PM11/5/09
to
On 05/11/2009 20:42, Rolf Gloor wrote:
> Am 05.11.2009 03:21, schrieb Andrew Sutherland:
>> On 11/04/2009 07:38 PM, qiyuan Tang wrote:
>>> I'm wondering if Thunderbird builds index for all the offline messages
>>> in an folder such that each word in message is bind with some message
>>> IDs. For example, if both message A and message B have the word
>>> "Internet", then the relationship between "Internet" and the messages
>>> containing it (A&B) is established somewhere so that if we search the
>>> keyword "Internet", the thunderbird can know what are the
>>> corresponding messages. Does Thunderbird do so? If so, where is such
>>> inverted index? I can't find it in Thunderbird directory. Can someone
>>> help me to figure this out? Thank you very much.
>>
>> Thunderbird 3.0 (RC1 coming soon, but existing nightlies and betas have
>> it too) has a global database that lives in global-messages-db.sqlite.
>> It uses SQLite's FTS3 fulltext search engine which uses an inverted
>> index internally.
>>
[snip]

> I just wonder, how big this DB might get?

Well, my file (global-messages-db.sqlite) has reached 422MiB. ;)

John.

--
John Beranek To generalise is to be an idiot.
http://redux.org.uk/ -- William Blake

Andrew Sutherland

unread,
Nov 6, 2009, 12:56:55 AM11/6/09
to
On 11/05/2009 12:42 PM, Rolf Gloor wrote:
> I just wonder, how big this DB might get?

It has no inherent sizing constraints.

Andrew

Russell East

unread,
Nov 6, 2009, 1:58:28 PM11/6/09
to

But it seems proportional (in some fashion) to the total quantity of
messages.
On my system for example, Windows Explorer reports "Size on disk" for the
Mail folder has 3.14 GB, and global-messages-db.sqlite has 937 Mb.

Might be interesting to plot a graph to serve as a guide....

-- Russell

qiyuan Tang

unread,
Nov 7, 2009, 6:57:18 PM11/7/09
to
Hi Andrew, Do you mean that Thunderbird 2.0.0.21 doesn't support this
sqlite? In Thunderbird 3.0, if I lost the inverted index, how long
will it take to rebuild the index if I have thousands of emails?
I want to do search like (A | A') & (B | B') . A, A', B, B' are search
keywords. As far as I know, Thunderbird doesn't support such query.
But if I have the inverted index, this kind of search would be easy.
Am I right? Thank you.

Qiyuan

Andrew Sutherland

unread,
Nov 7, 2009, 8:20:45 PM11/7/09
to
On 11/07/2009 03:57 PM, qiyuan Tang wrote:
> Hi Andrew, Do you mean that Thunderbird 2.0.0.21 doesn't support this
> sqlite? In Thunderbird 3.0, if I lost the inverted index, how long
> will it take to rebuild the index if I have thousands of emails?
> I want to do search like (A | A')& (B | B') . A, A', B, B' are search

> keywords. As far as I know, Thunderbird doesn't support such query.
> But if I have the inverted index, this kind of search would be easy.
> Am I right? Thank you.

Only Thunderbird 3 has the "gloda" global database with inverted index.
On a reasonably modern system, we index something like 10-25 messages
per second. We are doing more than just building a fulltext index and
are trying to avoid making Thunderbird unresponsive which is why that
number is not higher. Note that we require the messages to be available
offline before we will index their bodies.

Thunderbird's nsMsgSearch-style search implementation (non-SQLite
inverted index) supports building complex boolean decision trees. This
presumably includes Thunderbird 2.0.0.x. Unfortunately, there was a
serious bug involving the grouping flags which has only been recently
fixed. So you basically need Thunderbird 3 even for that.

The SQLite FTS3 query parser has two modes of operation. We are using
the traditional mode of operation which has some limitations. There is
documentation on the limitations/features of the various modes in
"ext/fts3/README.syntax" if you check a source distribution of SQLite.

Our fulltext query builder can be found in:
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/msg_search.js

It is somewhat limited in nature; it definitely does not try to build a
query as complex as your example. But it would hopefully provide you
with a starting point for anything you might want to do.

Andrew

qiyuan Tang

unread,
Nov 10, 2009, 10:36:48 PM11/10/09
to
Thank you very much, Andrew. This inverted index would really help to
start complex queries. Is it easy to get the word list (all the
indexed words from messages) from SQLite Full-Text index? and how to?
Is there existent interface/function to fetch the word list? Many
thanks.

Qiyuan

On 11月7日, 下午5时20分, Andrew Sutherland <sombr...@alum.mit.edu> wrote:
> On 11/07/2009 03:57 PM, qiyuan Tang wrote:
>
> > Hi Andrew, Do you mean that Thunderbird 2.0.0.21 doesn't support this
> > sqlite? In Thunderbird 3.0, if I lost the inverted index, how long
> > will it take to rebuild the index if I have thousands of emails?
> > I want to do search like (A | A')& (B | B') . A, A', B, B' are search
> > keywords. As far as I know, Thunderbird doesn't support such query.
> > But if I have the inverted index, this kind of search would be easy.
> > Am I right? Thank you.
>
> Only Thunderbird 3 has the "gloda" global database with inverted index.
> On a reasonably modern system, we index something like 10-25 messages
> per second. We are doing more than just building a fulltext index and
> are trying to avoid making Thunderbird unresponsive which is why that
> number is not higher. Note that we require the messages to be available
> offline before we will index their bodies.
>
> Thunderbird's nsMsgSearch-style search implementation (non-SQLite
> inverted index) supports building complex boolean decision trees. This
> presumably includes Thunderbird 2.0.0.x. Unfortunately, there was a
> serious bug involving the grouping flags which has only been recently
> fixed. So you basically need Thunderbird 3 even for that.
>
> The SQLite FTS3 query parser has two modes of operation. We are using
> the traditional mode of operation which has some limitations. There is
> documentation on the limitations/features of the various modes in
> "ext/fts3/README.syntax" if you check a source distribution of SQLite.
>

> Our fulltext query builder can be found in:http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/...

Andrew Sutherland

unread,
Nov 10, 2009, 10:59:13 PM11/10/09
to
On 11/10/2009 07:36 PM, qiyuan Tang wrote:
> Thank you very much, Andrew. This inverted index would really help to
> start complex queries. Is it easy to get the word list (all the
> indexed words from messages) from SQLite Full-Text index? and how to?
> Is there existent interface/function to fetch the word list? Many
> thanks.

As far as I know, FTS3 does not provide an easy way to get at the word
list. You might be able to reach directly into the
"FULLTEXTTABLENAME_segments" table that actually stores the
representations, but I think it uses an optimized encoding that would be
a nightmare to deal with if you're not using the FTS3 code for that
directly.

If you find a way to get at the word list, I would love to know... it
would allow us to do "did you mean..."-style suggestions.

It may also be possible to request that FTS3 expose some of the
functionality if we have a helpful enough use-case that we can justify.

Andrew


Jens Müller

unread,
Nov 11, 2009, 9:57:34 AM11/11/09
to
On 11.11.2009 04:59, Andrew Sutherland wrote:
> If you find a way to get at the word list, I would love to know... it
> would allow us to do "did you mean..."-style suggestions.

And what about latent semantic indexing?

qiyuan Tang

unread,
Nov 12, 2009, 5:00:39 PM11/12/09
to
Thanks, Andrew. About "did you mean..."-style suggestions on search,
that's exactly what I want to do. I'm in a group working on an
Thunderbird add-on project which tries to do fuzzy search on the
offline messages. The add-on works fine on Thunderbird 2.0.0.21, but
we had to build the inverted index by ourselves. It took a lot of time
to build our index and meanwhile thunderbird becomes unresponsive
(usually couple of minutes according to the number of emails) before
user can use it. Since Thunderbird 3.0 provides such inverted index,
and there is a way(though it's not easy as you said...) to fetch the
word list, we kind of want to move our project to Thunderbird 3.0, so
that we can exploit the existent inverted index instead of creating a
new one. But the question is, is it easy to move the code to the new
version? is there a big change of any other interface from Thunderbird
2.0.0.21 to 3.0 besides the sqlite? e.g. the treeview structure, the
xpcom thing, or other stuff that requires us to rewrite much of the
code? Thank you very much.

Andrew Sutherland

unread,
Nov 12, 2009, 10:43:43 PM11/12/09
to
On 11/12/2009 02:00 PM, qiyuan Tang wrote:
> Thanks, Andrew. About "did you mean..."-style suggestions on search,
> that's exactly what I want to do. I'm in a group working on an
> Thunderbird add-on project which tries to do fuzzy search on the
> offline messages. The add-on works fine on Thunderbird 2.0.0.21, but
> we had to build the inverted index by ourselves. It took a lot of time
> to build our index and meanwhile thunderbird becomes unresponsive
> (usually couple of minutes according to the number of emails) before
> user can use it. Since Thunderbird 3.0 provides such inverted index,
> and there is a way(though it's not easy as you said...) to fetch the
> word list, we kind of want to move our project to Thunderbird 3.0, so
> that we can exploit the existent inverted index instead of creating a
> new one. But the question is, is it easy to move the code to the new
> version? is there a big change of any other interface from Thunderbird
> 2.0.0.21 to 3.0 besides the sqlite? e.g. the treeview structure, the
> xpcom thing, or other stuff that requires us to rewrite much of the
> code? Thank you very much.

This sounds like a very cool add-on!

Without having seen your extension it's hard to say for sure how
difficult it would be. In general, I think moving to Thunderbird 3 is
probably a good idea and the benefits you gain would likely outweigh the
cost. Having said that, if this is a research project that you are not
hoping to have used in the real world but rather just need a workable
research platform and you already have that with 2.0.0.x, it may be
worth just sticking with it.

A lot of the javascript support code related to the thread pane that
lists messages has been refactored. If you had to do any hacky things
to get your results displayed before, they should hopefully no longer be
required. We have abstractions used by the built-in global search to
take its fulltext results and display them, and it should be pretty easy
to leverage.

From an XPCOM interface perspective, the MailNews interfaces and events
remain largely unchanged except for some extra documentation. The
message headers still work like they always did.

For binary components, I think a lot of the string handling may have
changed, but I am not an expert on that. (Although some binary
components may be required if you want to plug into SQLite's FTS3 at a
low level, you hopefully won't need them otherwise.) I would check out
developer.mozilla.org and see what it has to say on the various
subjects; I think there are guides to upgrading code for the mozilla
platform in general.

In terms of support/feedback, we're much more able to provide
information on how to make things work with 3.0, whereas most of us
probably have no idea how to make things work with 2.0.

Andrew

0 new messages