[MORK] some summary file problem statements

96 views
Skip to first unread message
Message has been deleted

Jamie Zawinski

unread,
Dec 17, 1998, 3:00:00 AM12/17/98
to
For the record, the summary file format used by Netscape 2.0 and 3.0
is documented at
http://cvs-mirror.mozilla.org/webtools/lxr/source/grendel/storage/MailSummaryFileCheddar.java#35

I am still of the opinion that a custom in-memory read-the-whole-file
type of cache is the way to go, and I think that the summary files that
2.0/3.0 used were brilliantly speedy and compact, if I do say so myself.
(I have disk usage statistics on this format that has been known to make
young girls squeal, but I don't hav them handy right now.)

When I was reimplementing this stuff for Grendel, and doing the only
sane thing and ignoring the whole 4.x summary file debacle, I thought of
a way to reduce the summary file sizes even more than 2.x/3.x did (by
not storing message IDs, but instead storing MD5 hashes of them); that
new format, which never really left the building, is documented here:
http://cvs-mirror.mozilla.org/webtools/lxr/source/grendel/storage/MailSummaryFileGrendel.java#39

I distrust the idea of using a general-purpose database for mail summary
files. I smell overengineering.

--
Jamie Zawinski
j...@mozilla.org http://www.mozilla.org/ (work)
j...@jwz.org http://www.jwz.org/ (play)

Jamie Zawinski

unread,
Dec 17, 1998, 3:00:00 AM12/17/98
to
David McCusker wrote:
>
> - charsets
>
> It must be possible to indicate what charset is being used in cell
> values. So file headers will typically indicate the default file
> charset, and individual dicts can indicate the charset for cell
> values tokenized in that particular dict. We don't really need to
> permit a way to declare charsets on an individual basis for cells
> inside rows, since we can do this indirectly through a dict that
> indicates explicit charset encoding.

You are summarizing mail folders. Mail folders are composed of mail
messages. Mail messages have headers which are composed of short-line
8-bit (7-bit, technically) quantities. Therefore, the summary files
need only store data composed of printable ASCII (or perhaps Latin1)
characters. So how do you do i18n stuff? The same way you do it in
mail headers: MIME-2 encoding.

I still smell overengineering.

Message has been deleted

Scott MacGregor

unread,
Dec 17, 1998, 3:00:00 AM12/17/98
to
David McCusker wrote:

> (Note this post assumes Mork syntax slightly different from last report.)
>
> This posting concerns some problems to be solved in summary file formats.
> I won't bother to distinguish problems that are merely nice to solve, as
> opposed to necessary to solve, unless that happens to be interesting for
> some reason (perhaps because of conflicts with something else). As long
> as it is easy to address a problem without getting caught in a kitchen
> sink growth trap, than adding one more nice thing seems mostly harmless.
>
> Here's a list of problems as a table of contents for this posting:
> - magic signature
> - metainformation
> - charsets
> - binary content
> - inter-file references
> - line continuation
> - line endings
> - comments
> - tokenization
> - delimiting content
> - efficient updates
> - transactions
> - avoiding offsets
>
> - magic signature
>
> It's a good idea to have the file format begin with something that
> can be examined to explicitly identify the file format. I plan to
> modify the first draft mdb spec to include the first 512 bytes of
> a file's content to accompany the file name when calling methods
> that query whether a file can be opened. This avoids having every
> factory open and close the file themselves to look for signatures.
> (Yes, there's an assumption here about where to look for such.)
>
> The magic signature being used for mozilla's XPIDL typelib files is
> interesting, since it includes a CRLF as well as single CR (or LF?)
> in order to catch file munging that plays with line end munging.
> Perhaps that is a good idea for Mork magic signatures; but in a text
> format, it might be hard to expect users to respect the white space.
>
> - metainformation
>
> Files, dicts, and tables typically use metainfo to specify important
> global relationships, so each of these should have some construct
> for specifying attribute value pairs describing the container.
> Files and dicts might declare default charsets. A file might have
> an atttibute pointing to the file which it describes (in order to
> explicit which mbox is indexed by a summary file, for example). A
> table needs to specify the scopes for contained rows and the kind
> of table itself.
>
> Below is an example of how the leading header in a table used to
> indicate the row and table scopes for the table. Note that both
> attribute names and attribute values can both be tokenized. Yes,
> this means that tokenized summary files can be hard to read.
>
> // standard names and values might often be tokenized like this:
> #{ (1=ns:msg:db:table:mail:messages) // 0x1 is messages row scope
> (2=ns:msg:db:table:mail:threads) // 0x2 is threads table scope
> (3=rowScope) (4=tableScope) // we can also tokenize column names
> }#
>
> // (^3 ^1) expands to (rowScope ns:msg:db:table:mail:messages) below
> { 7fd <(^3 ^1)(^4 ^2)> // table 0x7FD is a table of message threads
> r64bc // first row has id 0x64BC (in the separate messages table)
> // ...


> }
>
> - charsets
>
> It must be possible to indicate what charset is being used in cell
> values. So file headers will typically indicate the default file
> charset, and individual dicts can indicate the charset for cell
> values tokenized in that particular dict. We don't really need to
> permit a way to declare charsets on an individual basis for cells
> inside rows, since we can do this indirectly through a dict that
> indicates explicit charset encoding.
>

> // this says token 0x42 equals "the answer" in charset iso-8859-1:
> #{<(charset=iso-8859-1)> (42=the answer)}# // Mork dict syntax
>
> // the same thing with charset stuff tokenized in an earlier dict:
> #{(21=charset) (22=iso-8859-1)}# // charset and iso-8859-1 tokens
> #{<(^21^22)> (42=the answer)}# // Mork dict syntax
>

I wonder, maybe we should look into always storing data in UTF-8 in the
database. The client of the database could then specify the character set
they would like the resulting strings in and we could do a conversion before
returning the string to the client. By doing that, we now have the ability to
support any character set (since you can go from UTF-8 to any character set)
at the display level. And you wouldn't need to keep track of the character
set of each string attribute at the db level. You can serve up the string in
any CSID. Of course one potential downside is that the client needs to keep
track of the CSID they requested the string to be in if they want to do
something else with it later. Another potential problem is that the
conversion requires the client owning their own copy of the string (in the
csid they wanted). I'm not sure how that fits into the memory model the
database had in mind.

-Scott

>
> - binary content
>
> The summary file format should allow binary content to appear in
> cell values, although some alphanumeric encoding would be considered
> better style. So no byte value should be forbidden in cell values,
> even though some clients might not be able to handle all byte values.
> In particular, null bytes will cause problems from some clients. The
> address book code in 4.5 is not able to handle unencoded null bytes.
>
> I probably need to add a means to permit base64 encoding for cell
> values in the Mork syntax. But that should only be used for extended
> binary objects with no other useful resentation. It is better to let
> users find the readable content somewhere in the summary file when it
> is feasible to do so, although indirection through dicts is awkward.
>
> - inter-file references
>
> Summary files will need to point at content in other files, but we
> don't need special syntax for this in the summary file itself. Still,
> we need standard attributes and value encodings for this. I'll defer
> that to someone else since it doesn't seem very problematic.
>
> - line continuation
>
> It should be possible to break long values to avoid lines longer than
> some reasonable upper limit on the order of 70-80 characters. So we
> use /zm:Continue ::= '\' zm:LineEnd/ in zany Mork to continue a value
> on the next line. The choice of escape character is arbitrary.
>
> - line endings
>
> The format should be line ending agnostic with regard to whether and
> how many CR's and LF's are used. A practical approach is to consume
> one of each in any sequence when they occur to represent a line end.
>
> - comments
>
> It is nice to allow users or tools to annotate content with comments.
> Since it is easy to excise them from the input stream, there's little
> reason to avoid a comment syntax. I like both C++ and C style comments,
> but I prefer than C style comments nest, since this let's users add C
> comments around any content (except inside a value) to excise content.
> The Mork grammar shows only C++ style comments, but we can add C style.
>
> - tokenization
>
> As a form of compression, it should be possible to store a string value
> once in a dict, and henceforth refer to this string by it's id alone.
> When a such a string is used more than once, and when the string is
> longer than its hypothetical dict id, then using an id as much as
> possible will generate a smaller file.
>
> Yes, an application can (and probably should) tokenize both attribute
> names and cell values when a summary file is loaded, but this is a
> separate issue. The speed difference related to tokenization is not
> too important since file i/o performance typically dominates. So we
> are mainly concerned about the size of a summary file, since making
> the file smaller might decrease file i/o time cost.
>
> Note that while the grammar permits tokenization with dicts, this is
> not required, so the decision regarding file readability is a policy
> decision made by an application.
>
> - delimiting content
>
> Content delimitors should be selected to differ from each other,
> and also to have definite left and right orientation so balance of
> delimitors can be easily determined. All the naturally occuring
> left and right delimitors in the Latin1 charset should be used, and
> even so there are not enough to go around.
>
> The zany Mork syntax uses < > to enclose metainfo (composed of
> attribute value cells), { } to enclose tables, [ ] to enclose rows,
> and ( ) to enclose cells or cell value definitions. The values of
> cells are not explicitly enclosed in string quotes since the final
> right paren is adequate to terminate the string content. Since we
> don't have any more very good left and right delimitors, dicts use
> braces enclosed on both ends with # in order to hold dict content.
>
> - efficient updates
>
> Usually all the hard performance work in databases is involved in
> writing quickly and safely. If a summary file is very large, and
> only small portions of content are changed, then rewriting a whole
> file on transaction commits would be inefficient and quite slow.
> So better performance with a text format can be effected by using
> a scheme that works well with incremental writes that update old
> content. So Mork includes an explicit syntax for appended updates.
>
> When a row or table is preceded by '+', this means the row or table
> should have an already used id, and that content should be added to
> the existing row or table. If preceded by '-', this means the row
> or table contains cells or rows that should be removed from the
> existing collection. A preceding '!' means a row or table is being
> completely redefined by the content specified. If you can think of
> other clever notations for variations, please point them out.
>
> This update syntax is intended to permit rows and tables to change
> by means of appending updates to a summary file, so that incremental
> changes can be written more efficiently that rewriting entire files.
> This means files will grow longer than necessary, so that later
> rewrites are necessary to compress files into a smaller form. When
> to perform such rewrites is obviously not specified by the grammar.
>
> Folks who object to using the update-by-append mechanism for one
> reason or another can simply not use the update syntax, and handle
> incremental writes in some other fashion. The syntax specifies the
> update grammar so this particular approach is well-defined. Whether
> used or not, this does imply development cost for the file reader.
>
> - transactions
>
> Sometimes invariant relationships are distributed across many kinds
> of collections that must all be updated atomically in order to keep
> a valid set of constraints. So it is very useful for a summary file
> syntax to include a means of grouping content into transaction sets
> that should be applied atomically (either completely or not at all).
>
> The zany Mork syntax begins a transaction with '@$${zm:Id{@', where
> the starting id matches the id appearing in the end notation that
> either commits or aborts the transaction. Since a transaction id
> will never be confused with the id of anything else, one can assign
> transaction ids without regard for collision with actual content ids.
>
> The '$$' was added to the transaction start and end delimitors since
> this byte sequence cannot occur inside cell values because '$' must
> be escaped when it is not used to lead following hex content. This
> ensures that value content cannot inadvertantly be confused with
> either transaction delimitor.
>
> An ended transaction is committed with '@$$}zm:Id}@' or else aborted
> with '@$$}~abort~zm:Id}@', so updates can be appended to a summary
> file inside a transaction without fear that a crash will result in
> a database with inconsistent content. When a transaction commits,
> one can write a trailing '@$$}zm:Id}@' and then flush the file. So
> if the end delimitor is written successfully, then the transaction
> will be committed for the next reader of the file. Anything else
> means the final transaction was aborted. The explicit abort end
> delimitor is provided so a transaction can be voluntarily aborted
> without needing to backup and remove any logged transaction updates.
>
> - avoiding offsets
>
> Some ways of expressing summary files could use self-reference by
> means of file offsets within the summary file, but such constructs
> would be very fragile in the event a user ever modified a file by
> hand. If such offsets were ever used, they could never be trusted,
> and each individual offset would require verification during the
> parsing of read content, so presence in a file would be useless.
>
> However, offsets are very useful when they point into another file,
> and this will be a normal design pattern for summary files that
> index another base file. This implies such a summary file must be
> rebuilt when the base file is modified in a way that changes the
> offsets in the base file. So a summary file must use metainfo to
> record when the base file was last modified at the time the summary
> file was last in sync with the base file.
>
> David McCusker, speaking only for myself, mozilla mail/news client eng
> Values have meaning only against the context of a set of relationships.


Message has been deleted

Jamie Zawinski

unread,
Dec 17, 1998, 3:00:00 AM12/17/98
to
David McCusker wrote:
>
> I intend that a Mork implementation of the mdb interface be like this.
> We have added another main requirement that we not need to write the
> entire cache in order to commit a small change, though.

Well really the requirement is "small changes should be fast", right?
In 2.x/3.x, we did this by writing out the summary file lazily (when
switching folders, and other times when file IO would be happening
(never as a result of a fast action like toggling the "read" flag.)

In Grendel, we did it from a background thread; when a change was made,
a background thread was created and went to sleep, and after 30 seconds
or so, the batched up changes would be written out. This was even
better, because the UI was still live during I/O.

> That might have a small effect on what kind of format is the best fit.
> We need a good showing for newsgroup summaries with 10,000 to 20,000
> messages.

Right now I'm looking at a folder in 3.0. It has 15,466 messages
in it. Selecting this folder takes less than a second (it's hard to
eyeball it, but I'd say it takes about 1/2 to 3/4 second from when
I click to when I see the message summary on the screen). The BSD
mbox file is 57.2MB (1.2 million lines) and the summary file is 1.3MB
(2% of the size of the folder.)

This is on a P266 with a local IDE disk (Linux.)

Oh, if I have threading on, it takes about 1.5 seconds to select the
folder. (Note that unlike 4.x, there is no "pre-threading" in the
summary files; this is all done in core, from the first principles of
the Message-ID and References data stored in the summary file.)

> Using MD5 hashes as IDs is a pretty good idea, although the IDs might
> be a bit bigger than those assigned some other way.

An MD5 hash is 128 bit quantity. In my code, I was only using 64 bits
of that, since I wasn't really worried about it being cryptographically
strong. The result is that I had to store 8 bytes per message ID rather
than the full message ID string, which tended to average somewhere
around 25 or 30 bytes (I don't remember the actual number, but I
collected stats back when I was working on this, and it was a big
number.)

> Maybe you know the answer to a question I have had for a while
> regarding the statistical distribution of MD5 hashes as opposed to CRC
> hashes with the same number of bits. I haven't studied MD5, but I
> suppose the goal is extremely low probabiliy of collision? Is it much
> better than CRCs?

It's much better than CRC, because MD5 is intended to be
cryptographically strong: as far as I know, nobody has yet come
up with two strings that have the same MD5 hash; RSA's FAQ
(http://www.rsa.com/rsalabs/faq/html/3-6-6.html) says in part:

Van Oorschot and Wiener have considered a brute-force
search for collisions in hash functions, and they estimate
a collision search machine designed specifically for MD5
(costing $10 million in 1994) could find a collision for MD5
in 24 days on average.

More detailed MD5 info:
http://www.rsa.com/rsalabs/pubs/cryptobytes/spring95/md5.htm

I used MD5 because it's a great hash function, and we had the code
handy. It might well be overkill for this purpose.

> Sorry about the smell, it seems to follow me everywhere. There are at
> least two main causes of that smell in this case. One is the the same
> file format must be able to represent address book files, where there is
> no constraint on the permitted attributes, and where users have some
> control over charsets in use, either globally or at smaller granularity.

Now wait, I see two different problems: you need a mail summary file;
and you need a representation for address books. It's not obvious
to me that those are anywhere *near* the same problem, or that one
should try to conflate them and solve them simultaniously. The
requirements of the two problems seem pretty dramatically different
to me.

On the one hand, it's a drag to do two different implementations,
but on the other hand, it's a drag to have one of the two problems
be solved badly because of baggage from the other problem; or to have
the all-singing-all-dancing solution take longer than two independent
solutions together.

I see mail summary files as a very specialized need for a very fast
cache for very specific kinds of data and access patterns. I can
see how an address book would need something more general than that
(and would also have less severe speed and storage requirements.)
And I can also see how a general data store for RDF-ish things would
be really, really useful. But I'd hate to see either of those latter
two desires make mail summary files be bloated, slow, or late.

Message has been deleted
Message has been deleted

Jamie Zawinski

unread,
Dec 17, 1998, 3:00:00 AM12/17/98
to
David McCusker wrote:
>
> I'd be more interested in the time it takes to add ten new messages
> or to cut ten old messages from a newsgroup that large. This should
> happen close to instantly if a resort is not needed (though I think
> it would require cleverness to figure intersection with the view).

I think the answer is that it's the time needed for the I/O required
to do two things: read the whole summary file for the source folder into
memory; plus the time needed to fseek() in the source file and copy the
message body bytes from one file to the other.

But note that in most cases, the summary data would already have been
loaded, so really the cost would be purely the message-body I/O. If
you're dragging messages from one folder to another, then you're
gesturing at the source folder, so the source folder is on the screen,
which means it summary data is in memory (or you wouldn't be seeing it
at all.)

In more detail: in the model used by 2.x/3.x and Grendel, there was an
in-memory object representing a mail folder; and the two disk-oriented
operations you could perform on it were "read the whole from scratch" or
"write the whole thing." (Actually there was a 3rd one, a sort of
recovery-mode for when file dates or sizes had changed on us
unexpectedly, but pay no attention to the man behind the curtain.)

So the act of copying several messages from folder A to folder B would
be:

* read the summary file for folder A -- load the whole thing into
memory (but they're small, so this is fast);
* open the mbox file for folder B for append;
* based on summary info, fseek() to the start of the first message
to copy from file A;
* copy the proper number of bytes from file A to file B (the summary
file gave us start-byte and length for each message);
* repeat for the other messages;
* close A and B;
* overwrite a handful of bytes near the front of B's summary file
(so that the summary is in sync with the mbox file -- this is
part of our outside-change-detection logic).

Now, the current state is that B's summary file indicates in its header,
"I am a summary for N messages through byte position X of a file of
length Y with date Z."

The next time folder B is selected, we do the error checking -- does the
file length and date match? If not, we go into "recovery mode", which
I'll continue to handwave over.

If it does match, but the summarized portion is smaller than the length
of the file, then the act of creating the in-memory summary object has
two passes: first the data is parsed from the summary file; then more
data is appended to the in-memory summary object by parsing the file
itself, starting at position X (say, 10 messages from the end of the
file.)

At this point we have an in-memory summary object for the folder, and if
needed, it is marked as dirty. When the time comes to write out dirty
summary data, the whole summary file is rewritten from the beginning
(which is fast, because summary files are small.)

The act of marking a message deleted in a folder is pretty simple, too;
the "deleted" flag is set on the in-memory summary object; and the
summary object is marked as dirty. At some point in the future, the
summary file will be flushed to disk; and also, we will fseek() through
the mbox file itself and overwrite the X-Mozilla-Status headers if they
exist. In this way, we maintain the (important) goal that the summary
file is a cache -- a redundant store for information that can be
recomputed from the mbox file itself.

This only breaks down if the mbox file doesn't contain space for
X-Mozilla-Status headers; in that case, there can be situations where
crucial information exists only in the summary file. However, this is
corrected when the folder is compacted (when Trash is emptied) by
rewriting the mbox so that all messages have X-Mozilla-Status.

> There's also another relevant way to look at this issue. We intend to
> use the same abstract db API (the mdb interface) for both mail/news
> and address books, but they need not be the same file format, since we
> could use different implementations for different purposes. So if you
> implement a better mail/news summary file, then it would be feasible
> to use the Mork format only for address books, and yours for mail/news.

I suppose, but in 3.x and Grendel, the summary files weren't really
thought of as a "database" at all; they were more like a disk
representation of a set of objects. Sometimes you'd write those
objects to disk, or read them back, and the layer that dealt with
the file format had the mission of doing this I/O quickly and using
minimal disk space.

It was really more of a "serialization" exercise than a "database"
exercise.

Reply all
Reply to author
Forward
0 new messages