Proposal for a new database schema

cketti

unread,

May 27, 2011, 10:56:36 PM5/27/11

to k-9-...@googlegroups.com

Hello developers,

for my GSoC project
(https://github.com/k9mail/k-9/wiki/Overhaul-database-schema-and-abstraction-layer)
I modified our database schema with the goal of enabling us to implement
lots of cool features (PGP/MIME, S/MIME, message threading/conversation
view, fast searches, hierarchical folder view).

Check the attached file to see what I came up with. I don't expect this
to be the final schema as I will surely run into unforeseen problems
when implementing the new design. Maybe you can spot a mistake or have
additional ideas. Please let me (and the list) know.

Cheers,
cketti

database.txt

Jesse Vincent

unread,

May 27, 2011, 11:19:01 PM5/27/11

to k-9-...@googlegroups.com

> -- Goals:
> -- * support folder hierarchy
> -- * support for local folders
> CREATE TABLE folders (
> id INTEGER PRIMARY KEY,
> name TEXT, -- display name
> parent INTEGER, -- id of parent folder

Do we actually want true hierarchical folders or do we want to (like
IMAP) use names and delimiters?

> remote BOOLEAN, -- true, if this folder is linked to a folder on the server

I might invert it to call it local_only, assuming we need it.

> unread_count INTEGER, -- in the long run i want to ...
> flagged_count INTEGER, -- ... get rid of these
> integrate BOOLEAN,
> top_group BOOLEAN,
> display_class INTEGER,
> visible_limit INTEGER
> );
>
>
> -- Goal:
> -- * put store specific stuff in separate, easily extensible table
> CREATE TABLE folder_attributes (
> id INTEGER PRIMARY KEY,
> folder_id INTEGER,
> key TEXT,
> value TEXT
> );
> -- Possible values:
> -- * IMAP: uid_next
> -- * IMAP: uid_validity
> -- * IMAP: raw_name; raw (undecoded) name of folder (incl. prefix and separator)
> -- * push-able: push_state
> -- * push-able: last_pushed
> -- * push-able: push_class
> -- * remote: last_updated
> -- * remote: poll_class
> -- * all: status (TODO: replace with proper error reporting mechanism)

I worry a little bit about decomposing this far. It'll make it more
expensive to load up folders. And since we know what the columns are,
what's the real advantage of using it? I don't feel like we've run into
cases where we've wanted to extend the table and had trouble doing it.

>
> -- Goals:
> -- * be able to recreate byte-identical copies of all downloaded messages
> -- * be able to easily retrieve individual MIME parts of a message
> -- * create an entry for every message/rfc822 message part so we can display them as separate message
> -- * support local-only messages in folders linked to a remote folder
> -- (feature: delete (only) from server, restore message from trash folder for pop3 users [no upload possible])
> CREATE TABLE messages (
> id INTEGER PRIMARY KEY,
> folder_id INTEGER,
> uid TEXT,
> message_part_id INTEGER, -- "root" MIME part
> root INTEGER, -- for message/rfc822: store id of the real message this message/rfc822 is a part of
> parent INTEGER, -- for message/rfc822: easily accessible message hierarchy (maybe not necessary?)
> seq INTEGER, -- order of message/rfc822 parts
> local BOOLEAN, -- true, if this message is not linked to a message on the server
> deleted BOOLEAN, -- if true, the message was only deleted locally (entry must be kept so message isn't redownloaded)
> notified BOOLEAN, -- true, if this message "belongs" to an active notification (this way we can delete notifications if all "new" messages have been read elsewhere)
> date INTEGER, -- date from message header
> internal_date INTEGER -- date from server (if available, otherwise NULL)
> );
>
>
> CREATE TABLE message_attributes (
> id INTEGER PRIMARY KEY,
> message_id INTEGER,
> key TEXT,
> value TEXT
> );
>

Use case for this one?

>
> -- Goal:
> -- * allow searching for flags without string comparisons (... WHERE flags LIKE %SEEN%)
> CREATE TABLE flags (
> id INTEGER PRIMARY KEY,
> message_id INTEGER,
> name TEXT
> );
> -- Possible values:
> -- * seen
> -- * flagged
> -- * answered
> -- * forwarded
> -- * junk

The joins will be expensive relative to just making these booleans on
the messages table and since we know what they are, we should just end
up with them as boolean columns on the message

>
> -- Goals:
> -- * allow speedy searches by sender/recipient
> -- * don't waste time parsing/decoding recipients when creating message objects
> CREATE TABLE addresses (
> id INTEGER PRIMARY KEY,
> message_id INTEGER,
> type INTEGER, -- from, to, cc, bcc, sender, reply-to
> name TEXT,
> email TEXT
> );

Innteresting. Could be useful, definitely.

>
>
> CREATE TABLE message_threading (
> id INTEGER PRIMARY KEY,
> message_id INTEGER,
> root INTEGER,
> parent INTEGER,
> seq INTEGER
> );

Tell me more about how you want this to work?

> CREATE TABLE headers (
> id INTEGER PRIMARY KEY,
> message_part_id INTEGER,
> name TEXT,
> value TEXT, -- store header value in encoded form so original message can be reconstructed
> seq INTEGER
> );

Since we're decomposing the headers we "care" about regularly, what
about just storing the headers for a part as a TEXT/BLOB?

>
> CREATE TABLE message_parts (
> id INTEGER PRIMARY KEY,
> type INTEGER, -- inline (display), attachment (display attachment box), related (e.g. image referenced by html part; don't show attachment box)
> mime_type TEXT,
> root INTEGER,
> parent INTEGER,
> seq INTEGER,
> size INTEGER,
> data_type INTEGER, -- describes contents of 'data': empty/missing, blob, path to (attachment) file
> data TEXT -- original (transport-)encoded data
> );
>
>
> CREATE TABLE message_part_attributes (
> id INTEGER PRIMARY KEY,
> message_part_id INTEGER,
> key TEXT,
> value TEXT
> );
> -- Possible values:
> -- * IMAP: BODYSTRUCTURE part id
>
>
> -- Goal:
> -- * store important values in decoded form for fast display
> CREATE TABLE message_cache (
> id INTEGER PRIMARY KEY,
> message_id INTEGER,
> subject TEXT,
> preview TEXT
> );
>
>
> -- Goal:
> -- * store displayable bodies in decoded form for fast display
> CREATE TABLE message_part_cache (
> id INTEGER PRIMARY KEY,
> message_part_id INTEGER,
> name TEXT, -- part/file name (if available)
> data_type INTEGER, -- describes contents of 'data': decoded blob, path to file which contains the decoded body
> data TEXT
> );
>

It seems like having a "displayable body" text/blob column in the
message_cache table would get us what we want 99.99% of the time and
when a user wants attachments or inline images, the overhead of pulling
them from message_parts is low.

>
> -- no changes here
> CREATE TABLE pending_commands (
> id INTEGER PRIMARY KEY,
> command TEXT,
> arguments TEXT
> );

I guess my general feeling is that you're doing a lot of work to
decompose messages into the various parts, but I'm not actually sure you
have enough to reconstruct byte-identical versions of messages. In
particular, mime part separators don't seem to be anywhere in here.

-jesse

cketti

unread,

May 28, 2011, 12:15:49 AM5/28/11

to k-9-...@googlegroups.com

On Sat, 2011-05-28 at 5:19, Jesse Vincent <jes...@gmail.com> wrote:
>
>
>> -- Goals:
>> -- * support folder hierarchy
>> -- * support for local folders
>> CREATE TABLE folders (
>> id INTEGER PRIMARY KEY,
>> name TEXT, -- display name
>> parent INTEGER, -- id of parent folder
>
> Do we actually want true hierarchical folders or do we want to (like
> IMAP) use names and delimiters?

Hierarchical folder view is an often requested feature (that I'd also
like to see implemented). With this design it's easy to only load the
folders currently visible (without having the DB doing string compares).
Plus there's no decoding (split by delimiter, throw away everything but
the last part) necessary to get the actual folder name.

>> remote BOOLEAN, -- true, if this folder is linked to a folder on the server
>
> I might invert it to call it local_only, assuming we need it.

Done.

>> [...]

>>
>> -- Goal:
>> -- * put store specific stuff in separate, easily extensible table
>> CREATE TABLE folder_attributes (
>> id INTEGER PRIMARY KEY,
>> folder_id INTEGER,
>> key TEXT,
>> value TEXT
>> );
>> -- Possible values:
>> -- * IMAP: uid_next
>> -- * IMAP: uid_validity
>> -- * IMAP: raw_name; raw (undecoded) name of folder (incl. prefix and separator)
>> -- * push-able: push_state
>> -- * push-able: last_pushed
>> -- * push-able: push_class
>> -- * remote: last_updated
>> -- * remote: poll_class
>> -- * all: status (TODO: replace with proper error reporting mechanism)
>
>
> I worry a little bit about decomposing this far. It'll make it more
> expensive to load up folders. And since we know what the columns are,
> what's the real advantage of using it? I don't feel like we've run into
> cases where we've wanted to extend the table and had trouble doing it.

Most of this is store specific. So e.g. the folder list doesn't have to
look at these at all. And the *Store code only has to reload the folders
if something actually changes. I don't think it'll be a problem.

>> [...]

>>
>> CREATE TABLE message_attributes (
>> id INTEGER PRIMARY KEY,
>> message_id INTEGER,
>> key TEXT,
>> value TEXT
>> );
>>
>
> Use case for this one?

Hm... none, with this design. But see below.

>
>>
>> -- Goal:
>> -- * allow searching for flags without string comparisons (... WHERE flags LIKE %SEEN%)
>> CREATE TABLE flags (
>> id INTEGER PRIMARY KEY,
>> message_id INTEGER,
>> name TEXT
>> );
>> -- Possible values:
>> -- * seen
>> -- * flagged
>> -- * answered
>> -- * forwarded
>> -- * junk
>
> The joins will be expensive relative to just making these booleans on
> the messages table and since we know what they are, we should just end
> up with them as boolean columns on the message

You're probably right. I had it that way earlier. But then I remembered
all the internal X_something flags we use. With the proposed design I
can be lazy and keep those as custom flags in the flags table instead of
putting them into message_attributes.
So get rid of the flags table and use message_attributes for internal
X_* flags?

>> -- Goals:
>> -- * allow speedy searches by sender/recipient
>> -- * don't waste time parsing/decoding recipients when creating message objects
>> CREATE TABLE addresses (
>> id INTEGER PRIMARY KEY,
>> message_id INTEGER,
>> type INTEGER, -- from, to, cc, bcc, sender, reply-to
>> name TEXT,
>> email TEXT
>> );
>
> Innteresting. Could be useful, definitely.

Especially for searches like "more from this sender".

>> CREATE TABLE message_threading (
>> id INTEGER PRIMARY KEY,
>> message_id INTEGER,
>> root INTEGER,
>> parent INTEGER,
>> seq INTEGER
>> );
>
> Tell me more about how you want this to work?

That's one of the "we should probably have this, but I don't want to
implement the functionality that makes use of it" tables :)
The idea is to use (probably an adapted version of) Fiouz' threading
implementation to sort the incoming messages into existing threads (if
available). But now that I think about it the "seq" field ("order" is an
SQL keyword and I was too lazy for escaping) is probably a bad idea. The
order of all messages in one level might change if a new message comes
in. So better make neighbors a linked list.

>> CREATE TABLE headers (
>> id INTEGER PRIMARY KEY,
>> message_part_id INTEGER,
>> name TEXT,
>> value TEXT, -- store header value in encoded form so original message can be reconstructed
>> seq INTEGER
>> );
>
> Since we're decomposing the headers we "care" about regularly, what
> about just storing the headers for a part as a TEXT/BLOB?

I also thought about that. The only advantage of this method is that we
could easily search for header names. But I don't see us implementing a
powerful search function anytime soon. And even then I don't know if
users really want to search for/in arbitrary header fields. So if nobody
argues for the design above, I'll change it to what Jesse suggested.

>> [...]

>> -- Goal:
>> -- * store displayable bodies in decoded form for fast display
>> CREATE TABLE message_part_cache (
>> id INTEGER PRIMARY KEY,
>> message_part_id INTEGER,
>> name TEXT, -- part/file name (if available)
>> data_type INTEGER, -- describes contents of 'data': decoded blob, path to file which contains the decoded body
>> data TEXT
>> );
>>
>
> It seems like having a "displayable body" text/blob column in the
> message_cache table would get us what we want 99.99% of the time and
> when a user wants attachments or inline images, the overhead of pulling
> them from message_parts is low.

Changed.

>> [...]

>
> I guess my general feeling is that you're doing a lot of work to
> decompose messages into the various parts, but I'm not actually sure you
> have enough to reconstruct byte-identical versions of messages. In
> particular, mime part separators don't seem to be anywhere in here.

The boundary string is stored in the "Content-Type" header. Can a
boundary contain additional characters so we'd have to store the
original version?
But I did indeed forget preamble and epilogue :(

Thanks for the feedback.

-cketti

John McCabe

unread,

May 28, 2011, 5:02:00 AM5/28/11

to k-9-...@googlegroups.com

cketti <cke...@googlemail.com> wrote:

The flags table would almost certainly be more extensible if, in the future, there was a need or desire to add extra flags. E.g. with reference to the issue I raised yesterday about manual server deletions syncing with confirms perhaps a flag could be added that showed whether a message had been deleted on the server but not on the device. That's just a thought though.

One other thing here that confused me was 'name' being text but the comment being about searching without text comparisons; should that be an integer?

John
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

cketti

unread,

May 28, 2011, 10:06:01 AM5/28/11

to k-9-...@googlegroups.com

Since we already support not syncing server deletions (albeit not
manually) there is no need for an extra flag.

> One other thing here that confused me was 'name' being text but the
> comment being about searching without text comparisons; should that
> be an integer?

No, TEXT is correct. Currently we store flags as a comma separated list
("SEEN,FLAGGED,X-SOMETHING"). So searching messages with a specific flag
requires wildcard ("%" in SQL) string searches.
You are of course correct, that the proposed design also has a TEXT
column. But because there should be only very few different flags, an
index on the name column should get rid of the need for excessive string
compares.

But Jesse is right in that it would be cheaper to just keep the few
flags we actually use as BOOLEAN in the messages table. If we need
support for arbitrary flags later, we can change the DB schema then.

I will post my updated working copy of the schema later today.

-cketti

Daniel I. Applebaum

unread,

May 28, 2011, 2:03:49 PM5/28/11

to k-9-...@googlegroups.com, cketti

On 05/28/2011 09:06 AM, cketti wrote:
>
>>> You're probably right. I had it that way earlier. But then I
>>> remembered all the internal X_something flags we use. With the
>>> proposed design I can be lazy and keep those as custom flags in the
>>> flags table instead of putting them into message_attributes. So get
>>> rid of the flags table and use message_attributes for internal X_*
>>> flags?
>>>
>>>

IMAP (RFC-3501) supports user-defined flags (keywords). With the
normalized table, K-9 can, too.

Of course, we can distinguish between the flags K-9 uses for internal
functionality and user-defined flags, and keep the functionally
important flags in the message row. Or did you intend user-defined
flags to go in "message_attributes" ?

By having X_ flags in message_attributes (some of which are functionally
important), IMAP flags in the message row, and user-defined flags in
message_attributes or a flags table, the flags are getting divided up
pretty badly.

Do we want to allow user-defined flags? (I think it would be a good idea.)

I like to the idea of keeping all flags in a normalized table.

Dan.

cketti

unread,

May 28, 2011, 2:20:48 PM5/28/11

to k-9-...@googlegroups.com

On Sat, 2011-05-28 at 20:03, Daniel I. Applebaum <dana...@danapple.com>
wrote:

> On 05/28/2011 09:06 AM, cketti wrote:
>>
>>>> You're probably right. I had it that way earlier. But then I
>>>> remembered all the internal X_something flags we use. With the
>>>> proposed design I can be lazy and keep those as custom flags in the
>>>> flags table instead of putting them into message_attributes. So get
>>>> rid of the flags table and use message_attributes for internal X_*
>>>> flags?
>>>>
>>>>
> IMAP (RFC-3501) supports user-defined flags (keywords). With the
> normalized table, K-9 can, too.
>
> Of course, we can distinguish between the flags K-9 uses for internal
> functionality and user-defined flags, and keep the functionally
> important flags in the message row. Or did you intend user-defined
> flags to go in "message_attributes" ?

No, the intention was to put user-defined flags also in the flags table.
But we have no use case for this yet. So for now the plan is to put the
few flags (also internal ones) we currently use in the messages table
and get rid of both the flags and the message_attributes tables.

> By having X_ flags in message_attributes (some of which are functionally
> important), IMAP flags in the message row, and user-defined flags in
> message_attributes or a flags table, the flags are getting divided up
> pretty badly.
>
> Do we want to allow user-defined flags? (I think it would be a good idea.)

Probably not in the immediate future. But I agree that it'd be nice to
have some day.

> I like to the idea of keeping all flags in a normalized table.

I still think it's a clean design. But we have to take the limits of our
platform into account. So for now it's probably better to put the flags
we use into the messages table.
I guess once the slow/small devices have died out and we get to
implementing some of the features requiring more/other/arbitrary flags,
we'll come back to this design (separate flags table).

-cketti

cketti

unread,

May 28, 2011, 11:03:28 PM5/28/11

to k-9-...@googlegroups.com

Find attached the current draft for the new database schema. I
incorporated the suggested/discussed changes, but I'm sure I forgot
something. Please let me know.

-cketti

k9mail_schema_proposal_002.sql

Jesse Vincent

unread,

May 31, 2011, 9:50:05 PM5/31/11

to k-9-...@googlegroups.com

This looks a lot better. I think I'd still lift the folder_attributes
keys we already know we're going to have up to first-class columns on
the folders table.

> --
> You received this message because you are subscribed to the Google Groups "K-9 Mail" group.
> To post to this group, send email to k-9-...@googlegroups.com.
> To unsubscribe from this group, send email to k-9-mail+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/k-9-mail?hl=en.
>

> CREATE TABLE folders (
> id INTEGER PRIMARY KEY,
> name TEXT, -- folder name (with full hierarchy path, e.g. bills/2011/amazon)
> local_only BOOLEAN, -- true, if this folder is not linked to a folder on the server
> unread_count INTEGER,
> flagged_count INTEGER,

> integrate BOOLEAN,
> top_group BOOLEAN,
> display_class INTEGER,
> visible_limit INTEGER
> );
>
>

> CREATE TABLE folder_attributes (
> id INTEGER PRIMARY KEY,
> folder_id INTEGER,
> key TEXT,
> value TEXT
> );
> -- Possible values:

> -- * IMAP: uid_next
> -- * IMAP: uid_validity
> -- * IMAP: raw_name; raw (undecoded) name of folder (incl. prefix and separator)
> -- * push-able: push_state
> -- * push-able: last_pushed
> -- * push-able: push_class
> -- * remote: last_updated
> -- * remote: poll_class
> -- * all: status (TODO: replace with proper error reporting mechanism)
>
>

> CREATE TABLE messages (
> id INTEGER PRIMARY KEY,

> folder_id INTEGER,
> uid TEXT,
> message_part_id INTEGER, -- "root" MIME part
> root INTEGER, -- for message/rfc822: store id of the real message this message/rfc822 is a part of
> parent INTEGER, -- for message/rfc822: easily accessible message hierarchy (maybe not necessary?)
> seq INTEGER, -- order of message/rfc822 parts

> local_only BOOLEAN, -- true, if this message is not linked to a message on the server

> deleted BOOLEAN, -- if true, the message was only deleted locally (entry must be kept so message isn't redownloaded)
> notified BOOLEAN, -- true, if this message "belongs" to an active notification (this way we can delete notifications if all "new" messages have been read elsewhere)
> date INTEGER, -- date from message header

> internal_date INTEGER, -- date from server (if available, otherwise NULL)
> -- flags
> seen BOOLEAN,
> flagged BOOLEAN,
> answered BOOLEAN,
> forwarded BOOLEAN,
> --- x-flags
> destroyed BOOLEAN,
> send_failed BOOLEAN,
> send_in_progress BOOLEAN,
> downloaded_full BOOLEAN,
> downloaded_partial BOOLEAN,
> remote_copy_started BOOLEAN,
> got_all_headers BOOLEAN
> );
>
>
> CREATE TABLE addresses (

> id INTEGER PRIMARY KEY,
> message_id INTEGER,

> type INTEGER, -- from, to, cc, bcc, sender, reply-to
> name TEXT,
> email TEXT
> );
>
>

> CREATE TABLE message_parts (
> id INTEGER PRIMARY KEY,

> type INTEGER, -- inline (display), attachment (display attachment box), related (e.g. image referenced by html part; don't show attachment box)
> mime_type TEXT,

> root INTEGER,
> parent INTEGER,

> seq INTEGER,
> size INTEGER,
> data_type INTEGER, -- describes contents of 'data': empty/missing, blob, path to (attachment) file

> data TEXT, -- original (transport-)encoded data
> header TEXT,
> preamble TEXT,
> epilogue TEXT
> );
>
>
> CREATE TABLE message_part_attributes (
> id INTEGER PRIMARY KEY,
> message_part_id INTEGER,
> key TEXT,
> value TEXT
> );
> -- Possible values:
> -- * IMAP: BODYSTRUCTURE part id
>
>
> CREATE TABLE message_cache (

> id INTEGER PRIMARY KEY,
> message_id INTEGER,

> subject TEXT,
> preview TEXT,
> data_type INTEGER, -- describes contents of 'data': displayable body, path to file which contains the displayable body
> data TEXT
> );
>
>
> -- no changes here
> CREATE TABLE pending_commands (
> id INTEGER PRIMARY KEY,

Reply all

Reply to author

Forward