Raw MIME API proposal

Joshua Cranmer

unread,

Jul 28, 2011, 8:57:27 PM7/28/11

to

I had a more thorough discussion, but that message appears to have been
lost into the aether. So here I am rewriting it, and will hopefully be
more brief this time.

Right now, libmime serves three purposes:
1. Parse the message
2. Provide a view of body and attachments
3. Drive the UI

Getting at number 1 is surprisingly difficult, especially if you want
things to be extremely lossless. What I want to do is propose a new API
which does this, particularly making this easy to do from JS and with an
eye to eventually rearchitecting libmime to use it. Key requirements
include:
* The core parser is implemented as an nsIStreamListener and emits data
via an nsIMimeEmitter-like interface
* The core parsing routine is asynchronous
* The API must be accessible to JS and should act JS-ish (e.g., using an
{ param: value } configuration syntax)
* We should be able to feed the message using at least a message URI,
necko URL, or a nsIMsgDBHdr
* If feasible, we should have a synchronous decoding API
* The output should look similar to mimemsg.js's current syntax
* How much to be decoded should be selectable:
- Headers: raw data, CWFS collapsing, or full RFC 2047 decoding
- Bodies: no body, raw body, decode base64/quoted printable, non-MIME
decapsulation
* Ability to retrieve data for a specific part (including fake MIME
decapsulated parts)

Open issues:
* Part numbering -- Libmime numbers parts differently from IMAP. In
particular, libmime requires a part number of `1' to get to the body of
a message, so 1.4 in libmime is really IMAP part 4, and (if IMAP part 3
were a message/rfc822) 1.3.1.2 would be IMAP part 3.2. Numbering
non-MIME decapsulated parts would require a different separator, e.g.,
3-1.3-3 if we chose to paint the bikeshed `-'.
* Non-MIME decapsulation -- It's not quite non-MIME, but we sometimes
need to represent synthesized MIME trees not in the tree. The cases I
know of:
- multipart/encrypted (both S/MIME and PGP, although I think PGP
doesn't actually use this Content-Type?)
- message/external-body
- uuencode and yenc
- TNEF
- mbox syntax (since we often need to parse headers when parsing a
mailbox, we might as well parse the mailbox at the same time... should
also save on buffer rereading and copying).
* Be liberal in what you accept -- The only significant case I am aware
of off the top of my head where the RFC is violated but occurs in
practice is spaces in RFC 2047-encoded words. Violations in end-of-line
standards are possibly an issue as well.
* How to implement -- I know a lot of people would like to see this in
JS, but this is a case where we are potentially dealing with very large
strings of possibly-non-ASCII data that I don't want to spend a lot of
time converting between "ASCII" and UTF-16. In my experience, dealing
with binary I/O in JS is still too difficult. Finally, a C++
implementation gives us better leeway in dealing with multithreaded APIs.
* How extensible should it be -- What hooks do extensions need to be
able to implement in this process, and how?

Example use cases:
* imapd.js, for FETCH and BODYSTRUCTURE (I see undefineds occasionally
sneak in there)
* nntpd.js for when I need to parse message headers for XOVER, XHDR, and
HEAD
* mimemsg.js and whomever uses it (gloda, Thunderbird Conversations)
* People who parse mailboxes (nsMailboxParser), if we add in mbox support
* nsMsgBodyHandler (full body text search)

How to get it into the tree:
I'll admit, this would need to be a reimplementation from scratch. As a
result, blocking on the patch until it is fully feature-compatible with
current libmime would result in this patch not getting in for years.
Instead, I would rather see this get inserted piece by piece and made
available to the extension authors or core people who really desire it.
My proposal for the pieces of the patch:
1. Basic implementation, that supports getting headers and bodies, with
base64/quoted-printable support
2. Full gamut of headers support (i.e., RFC 2047 decoding)
3. Grabbing specific part[s]
4. Moderately complete test suite
5. Use as gloda's mimemsg.js backend and elsewhere where useful
6. Decode multipart/encrypted, message/external-body trees
7. Decode uuencode, yenc, mbox, TNEF
8. Comprehensive test suite, complete with a fuzzer
9. Use as backend for libmime

Work on this would probably need to be spread out over several releases;
in particular, I would like to see part 5 (where it would be used widely
in practice by extensions) bake in comm-release for one or two versions
before committing part 9 to comm-central. Everything beyond part 5 is
either new functionality or changes to prevent regression of current
functionality. Some of the work could probably be shared with libmime
before then, though (see bug 77811 and bug 446321 as the bugs for TNEF
support).

I don't think I myself have time to do more than a prototype over the
next few weeks, so if anyone else would be itching to implement this,
feel free to go ahead.

If you have any amendments, bikesheds, comments, concerns, or questions,
please feel free to comment right now.

Appendix A: IRC logs where I discussed other prototypes (forthcoming)
Appendix B: Current mime coverage levels:
<http://www.tjhsst.edu/~jcranmer/c-ccov/src/trunk/mailnews/mime/src/index.html>
(note that line 215 of mimebuf.cpp is only executed 3,390,308 times!).

Robert Kaiser

unread,

Jul 29, 2011, 11:42:58 AM7/29/11

to

Joshua Cranmer schrieb:

> * How to implement -- I know a lot of people would like to see this in
> JS, but this is a case where we are potentially dealing with very large
> strings of possibly-non-ASCII data that I don't want to spend a lot of
> time converting between "ASCII" and UTF-16. In my experience, dealing
> with binary I/O in JS is still too difficult. Finally, a C++
> implementation gives us better leeway in dealing with multithreaded APIs.

Would it be possible to talk to that with js-ctypes then instead of
going through XPCOM?

Thanks for bringing this into the discussion. I strongly think that a
number of pieces of Mozilla MailNews code really deserve complete
reimplementation even though it's a daunting task. And libmime surely
sounds like one piece that needs this. I also really like going with a
step-by-step plan that constructs the new code next to slowly phasing
the old code out, that should surely have better chances of success than
an all-or-nothing game.

Robert Kaiser

--
Note that any statements of mine - no matter how passionate - are never
meant to be offensive but very often as food for thought or possible
arguments that we as a community should think about. And most of the
time, I even appreciate irony and fun! :)

Joshua Cranmer

unread,

Jul 29, 2011, 1:21:28 PM7/29/11

to

On 7/29/2011 8:42 AM, Robert Kaiser wrote:
> Joshua Cranmer schrieb:
>> * How to implement -- I know a lot of people would like to see this in
>> JS, but this is a case where we are potentially dealing with very large
>> strings of possibly-non-ASCII data that I don't want to spend a lot of
>> time converting between "ASCII" and UTF-16. In my experience, dealing
>> with binary I/O in JS is still too difficult. Finally, a C++
>> implementation gives us better leeway in dealing with multithreaded
>> APIs.
>
> Would it be possible to talk to that with js-ctypes then instead of
> going through XPCOM?

I would prefer that, but that requires either fixing bug 593484 or
making the MIME library into a separate library. Fixing bug 593484 is
easy on Linux, possible on Windows, and possibly impossible on OS X.

Jim

unread,

Jul 31, 2011, 12:29:24 AM7/31/11

to

On 07/28/2011 07:57 PM, Joshua Cranmer wrote:
> I don't think I myself have time to do more than a prototype over the
> next few weeks, so if anyone else would be itching to implement this,
> feel free to go ahead.

I don't think I'd want to write this entirely by myself, but given a
prototype (and ideally, a branch to work on), I'd certainly hack on
this. Having added various bits and pieces to libmime, I know all too
well its limitations, and I'd love to see a better MIME parser.

- Jim

Andrew Sutherland

unread,

Jul 31, 2011, 3:58:45 AM7/31/11

to

On 07/28/2011 05:57 PM, Joshua Cranmer wrote:
> * How to implement -- I know a lot of people would like to see this in
> JS, but this is a case where we are potentially dealing with very large
> strings of possibly-non-ASCII data that I don't want to spend a lot of
> time converting between "ASCII" and UTF-16. In my experience, dealing
> with binary I/O in JS is still too difficult.

I think you may be overstating the difficulty of dealing with encoded
strings as byte-arrays. While I agree that trying to play the "JS
strings hold both proper unicode and weird encodings" game is
unpleasant, we're not living in a stock JS 1.5 world and so we don't
have to do that. If you are asserting efficiency concerns, I don't
believe it would be a problem given the type-stability of string
processing, the efficacy of our JITs on string types, and easy ability
to expose native code to JS without going through the XPCOM layer.

The JS engine has available/can have exposed to it typed arrays,
https://developer.mozilla.org/en/JavaScript_typed_arrays

node.js also has a pretty good example of a byte-buffer abstraction that
can do string encoding tricks. While MIME parsing involves more
encodings than node knows about, it's not a hard set of tricks to teach
it. It would not be hard to mimic in SpiderMonkey even without the
typed array stuff; the v8-based implementation is just a clever re-use
of the built in string type and most of the logic is pure JS.

node Buffer docs:
http://nodejs.org/docs/v0.5.2/api/buffers.html

And v8monkey/spidernode already knows how to look like a v8 string...
https://github.com/zpao/v8monkey/blob/master/js/src/v8api/string.cpp

> Finally, a C++
> implementation gives us better leeway in dealing with multithreaded APIs.

I presume the benefit in this case would be that if we parse on one
thread and consume on another thread, then copying costs could be
avoided through use of reference-counted global-heap-managed objects?

If so, it seems like the benefit would be moot in the cases where we
process the MIME tree on the same thread that we parse it. And in the
different thread cases, it seems like either a) the amount of data would
be small, or b) the amount of data would be large consisting of large
blocks of memory that can be efficiently bulk copied.

A relevant question would also be whether the (XPCOM) cycle collector is
multi-thread-aware? While the MIME hierarchy is indeed a tree and so
can be represented without cycles, that kind of invariant is easy to
screw up, especially if one of the goals is to let third-party
extensions augment the tree, we don't want them to be able to regress
memory behavior.

And along those lines, one might argue that it's much easier for
third-parties to write and ship pure JS code than build and link it on
all supported platforms given the release cadence. I'll cite rkent's
unfortunate build problems he has been asking about to that end :(

Andrew

Joshua Cranmer

unread,

Jul 31, 2011, 3:14:48 PM7/31/11

to

On 07/31/2011 03:58 AM, Andrew Sutherland wrote:
> The JS engine has available/can have exposed to it typed arrays,
> https://developer.mozilla.org/en/JavaScript_typed_arrays

Typed arrays are nice, but our I/O layer can't spit out a typed array
yet, to my knowledge, so we at least have to put a shell around it in C++.

>> Finally, a C++
>> implementation gives us better leeway in dealing with multithreaded
>> APIs.
>
> I presume the benefit in this case would be that if we parse on one
> thread and consume on another thread, then copying costs could be
> avoided through use of reference-counted global-heap-managed objects?

Well, this is my knowledge of the current state of multithreading in JS:
1. JS XPCOM components cannot be accessed off the main thread
2. If you want to use multiple threads, you have to use workers
3. In worker threads, you can only access interfaces that are explicitly
marked as threadsafe via nsIClassInfo
4. Necko interfaces claim to only work on the main thread

In other words, if it were implemented in JS, to my knowledge, you could
only access the MIME tree via the main thread. I don't particularly care
about parsing and consuming on different threads, but I do want to make
sure that the MIME is not inaccessible to other threads.

> A relevant question would also be whether the (XPCOM) cycle collector
> is multi-thread-aware? While the MIME hierarchy is indeed a tree and
> so can be represented without cycles, that kind of invariant is easy
> to screw up, especially if one of the goals is to let third-party
> extensions augment the tree, we don't want them to be able to regress
> memory behavior.

The cycle collector is not really thread-aware: it can only handle
refcount changes on the main thread (or the cycle collector thread).

> And along those lines, one might argue that it's much easier for
> third-parties to write and ship pure JS code than build and link it on
> all supported platforms given the release cadence. I'll cite rkent's
> unfortunate build problems he has been asking about to that end :(

I think the most appropriate answer here is jsctypes, particularly if
bug 593484 can be fixed.

Even if that's not the case, I still wonder how extensible raw MIME
parsing needs to be. It seems sufficiently rare to me that I think we
can handle the pain of binary issues until jsctypes or similar give us a
better story, especially if we expose it via a raw C ABI (which we need
to do for jsctypes anyways).

Andrew Sutherland

unread,

Jul 31, 2011, 5:08:00 PM7/31/11

to

On 07/31/2011 12:14 PM, Joshua Cranmer wrote:
> Well, this is my knowledge of the current state of multithreading in JS:
> 1. JS XPCOM components cannot be accessed off the main thread
> 2. If you want to use multiple threads, you have to use workers
> 3. In worker threads, you can only access interfaces that are explicitly
> marked as threadsafe via nsIClassInfo
> 4. Necko interfaces claim to only work on the main thread
>
> In other words, if it were implemented in JS, to my knowledge, you could
> only access the MIME tree via the main thread. I don't particularly care
> about parsing and consuming on different threads, but I do want to make
> sure that the MIME is not inaccessible to other threads.

Yes, if exposed exclusively via XPConnect to C++ code, that could be
troublesome. But it's not that hard to spin up a JS runtime on another
thread and expose a lightweight C++ wrapper class that is based on using
the JS API to traverse the JS MIME object hierarchy.

Note that my value of "that hard" is of course a relative thing given
that you are already talking about reimplementing the parser in C++ in
an XPCOM/multi-threaded happy way. If the comparison were using an
existing and known/proven already very working with active users (I
believe Jonathan Kamens is interested in such a strategy) via jsctypes,
that is indeed a different issue.

> Even if that's not the case, I still wonder how extensible raw MIME
> parsing needs to be. It seems sufficiently rare to me that I think we
> can handle the pain of binary issues until jsctypes or similar give us a
> better story, especially if we expose it via a raw C ABI (which we need
> to do for jsctypes anyways).

I agree that it does not seem like the type of thing that needs to be
endlessly extensible. If we don't start using any such rewrite until we
have implemented PGP support and TNEF support in the core, then it
becomes much less of a concern.

However, my concern is that we would be unlikely to block on that, and
the pain is not experienced by the Thunderbird Core but instead by
extension developers. It's understandable why the current situation is
so painful for extension developers trying to replace or extend core C++
functionality that was not really designed for extensions. It would be
less understandable why a ground-up rewrite would leave them in almost
the same exact situation (and one which might still require some degree
of rewrite on their part.)

Andrew

Andrew Sutherland

unread,

Jul 31, 2011, 5:11:08 PM7/31/11

to

On 07/31/2011 02:08 PM, Andrew Sutherland wrote:
> If the comparison were using an
> existing and known/proven already very working with active users (I

^ C library

> believe Jonathan Kamens is interested in such a strategy) via jsctypes,
> that is indeed a different issue.

(left out a word)

Andrew

Jonathan Protzenko

unread,

Jul 31, 2011, 7:30:00 PM7/31/11

to Joshua Cranmer, dev-apps-t...@lists.mozilla.org

This might sounds na�ve but why not use coroutines in JS to write an
async libmime parser? The load would still be on the main thread but at
least it would be asynchronous and frankly speaking, with all the effort
that's been put into JS performance lately (as Andrew pointed out),
performance might not be that much of a concern anymore. libmime was
great 20 years ago when we couldn't afford to hold a message all at once
in memory; I tend to think that the constraints have relaxed since then.

jonathan

Jonathan Protzenko

unread,

Jul 31, 2011, 7:32:06 PM7/31/11

to Joshua Cranmer, dev-apps-t...@lists.mozilla.org

Of course, what I have in mind for point 1. you mentioned on IRC
(message display) is Conversations, which could (on a first approach)
use a pure JS implementation of libmime without having to ensure that it
is accessible for C++. I'm talking about message display here, not
downloading attachments and stuff.

jonathan

Robert Kaiser

unread,

Aug 1, 2011, 10:09:20 AM8/1/11

to

Andrew Sutherland schrieb:

> On 07/28/2011 05:57 PM, Joshua Cranmer wrote:
>> Finally, a C++
>> implementation gives us better leeway in dealing with multithreaded APIs.
>
> I presume the benefit in this case would be that if we parse on one
> thread and consume on another thread, then copying costs could be
> avoided through use of reference-counted global-heap-managed objects?

Could chrome workers help us there if we would be completely in JS land?

Joshua Cranmer

unread,

Aug 2, 2011, 2:39:02 AM8/2/11

to

JS doesn't allow practical coroutines if you're also playing around with
recursion (<https://bugzilla.mozilla.org/show_bug.cgi?id=666396> would
fix a lot of the issues).

I don't doubt that JS is fast enough to implement a MIME parser, but
that wasn't necessarily my concern. My concern with a JS implementation
is the following:

1. I want to allow access from C++. This implies an XPCOM JS component,
which is inaccessible from non-main threads, or techniques that have us
create our own JS runtimes, which I'm hesitant to do noting the severe
churn SpiderMonkey has in their APIs.
2. We need this to be accessible from JS workers (the current preferred
model for multithreaded JS, it appears). This implies that we can't
implement as a JS component and instead as a script which can work in
either context (which means I lose a lot of XPCOM--anything that isn't
explicitly thread-safe, and that includes most of necko).
3. We need to deal with potentially binary data (i.e., EAI, inane bad
charsets, or 8-bit MIME). JavaScript lacks a lot of effective APIs in
this regard. Uint8Array or ctypes.char.array are about the closest, but
these have issues (in particular, I highly doubt that String-like APIs
work on them). We also need to be able to call Unicode decoders for RFC
2047 support (although that currently being an auxiliary C++ API means
it's not imperative right now).
4. We need a way to asynchronously deliver data to the parser. XHR is
out because it can't deliver on the "asynchronous" part (I see nothing
in XHR nor XHR2). Necko is out because it's inaccessible from JS chrome
workers (nsIOService fails to implement nsIClassInfo, let alone claim to
be threadsafe). That leaves either creating our own thunking
implementation or doing magic calls to JS, neither of which feel
particularly good to me.

asuth recently argued to me over IRC that it may be better to prototype
this in JS with a hacky layer for now and then migrate it to better APIs
as they are added. While I see the goal of that philosophy, I am
concerned that the platform community will not be forthcoming with the
necessary APIs and we will be forced to maintain a layer of hacky APIs.

In short, what it would take to convince that a JS implementation is the
best way forward:
1. An assurance that an asynchronous, binary I/O API for JS that is
usable from multiple threads is coming.
2. An assurance that there will be a suitable API that allows JS
implementations to both be usable from multiple threads in JS and in C++.
3. Preliminary guidelines for these APIs to minimize churn for when we
remove any necessary preliminary hacky APIs.

Jean-Marc Desperrier

unread,

Aug 2, 2011, 5:20:58 AM8/2/11

to

Jonathan Protzenko wrote:
> with all the effort that's been put into JS performance lately (as
> Andrew pointed out), performance might not be that much of a concern
> anymore.
> libmime was great 20 years ago when we couldn't afford to hold a message
> all at once in memory;

In the context of newgroups/mailing-list, there can be ten of thousands
of messages, the performance of the raw part of decoding still needs to
be top-notch.

Ludovic Hirlimann

unread,

Aug 2, 2011, 6:59:41 AM8/2/11

to

On 29/07/11 02:57, Joshua Cranmer wrote:

> Open issues:
> * Part numbering -- Libmime numbers parts differently from IMAP. In
> particular, libmime requires a part number of `1' to get to the body of
> a message, so 1.4 in libmime is really IMAP part 4, and (if IMAP part 3
> were a message/rfc822) 1.3.1.2 would be IMAP part 3.2. Numbering
> non-MIME decapsulated parts would require a different separator, e.g.,
> 3-1.3-3 if we chose to paint the bikeshed `-'.
> * Non-MIME decapsulation -- It's not quite non-MIME, but we sometimes
> need to represent synthesized MIME trees not in the tree. The cases I
> know of:
> - multipart/encrypted (both S/MIME and PGP, although I think PGP doesn't
> actually use this Content-Type?)
> - message/external-body
> - uuencode and yenc
> - TNEF

The partly good news is that we have plenty of test case for the mime
format we don't support properly, so it would be quite easy to have a
large number of tests available and maybe use an Xtreme programming
approach to make sure that this new implementation wouldn't regress.

Ludo
--
Ludovic Hirlimann MozillaMessaging QA lead
https://wiki.mozilla.org/Thunderbird:Testing
http://www.spreadthunderbird.com/aff/79/2

Jonathan Protzenko

unread,

Aug 2, 2011, 9:32:01 AM8/2/11

to Jean-Marc Desperrier, dev-apps-t...@lists.mozilla.org

Well let's write it in assembly then! And let's leave it to the
programmers to deal with readability and maintainability, they should
know how to do that.

More seriously, we're not parsing thousands of messages *at once*. The
only thing I can think of that parses lots of messages is the indexing,
and it's asynchronous, in the background, and can be easily configured
to not chew all the CPU.

jonathan

> _______________________________________________
> dev-apps-thunderbird mailing list
> dev-apps-t...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-apps-thunderbird

Jonathan Protzenko

unread,

Aug 2, 2011, 9:40:10 AM8/2/11

to Joshua Cranmer, dev-apps-t...@lists.mozilla.org

On 08/01/2011 11:39 PM, Joshua Cranmer wrote:
> On 07/31/2011 07:30 PM, Jonathan Protzenko wrote:
>> This might sounds na�ve but why not use coroutines in JS to write an
>> async libmime parser? The load would still be on the main thread but
>> at least it would be asynchronous and frankly speaking, with all the
>> effort that's been put into JS performance lately (as Andrew pointed
>> out), performance might not be that much of a concern anymore.
>> libmime was great 20 years ago when we couldn't afford to hold a
>> message all at once in memory; I tend to think that the constraints
>> have relaxed since then.
>
> JS doesn't allow practical coroutines if you're also playing around
> with recursion (<https://bugzilla.mozilla.org/show_bug.cgi?id=666396>
> would fix a lot of the issues).

I was thinking more of the TCO bug being fixed
https://bugzilla.mozilla.org/show_bug.cgi?id=445363, but well... (TCO is
what prevents my coroutines from not growing the stack in
https://github.com/protz/thunderbird-stdlib/blob/master/tests/test_SimpleStorage.js).

>
> I don't doubt that JS is fast enough to implement a MIME parser, but
> that wasn't necessarily my concern. My concern with a JS
> implementation is the following:
>
> 1. I want to allow access from C++. This implies an XPCOM JS
> component, which is inaccessible from non-main threads, or techniques
> that have us create our own JS runtimes, which I'm hesitant to do
> noting the severe churn SpiderMonkey has in their APIs.

Why? I'm sorry but I'm not getting that part at all. Could you be very
clear in how and why you need this to be done? How is that not solved
with a component on the main thread written in JS that delegates to some
other thread, if you really insist on having this multithreaded?

Cheers,

jonathan

Jean-Marc Desperrier

unread,

Aug 2, 2011, 12:45:35 PM8/2/11

to

Jonathan Protzenko wrote:
> Well let's write it in assembly then! And let's leave it to the
> programmers to deal with readability and maintainability, they should
> know how to do that.
>
> More seriously, we're not parsing thousands of messages *at once*.

What I mean is that if you start by saying we don't need to care about
performance they drop a lot (and a lot more than what having a cleaner
code actually needs), and that here they are some use case where they
are critical.

I'm convinced you can do almost all the processing in js, and maybe even
*all* the processing in js, and still have good performance for those
specific cases, but for that to happen you need to check and test this
kind of critical cases from the start.

> The only thing I can think of that parses lots of messages is the indexing

mozilla.support.firefox contains around 52 000 message. It's not that
much of an anomaly for a newsgroup folder.
Switching the view to ordered by "From" takes around 6 seconds on my
Core2 Duo E7500 @3GHz, almost the same as going from threaded to
unthreaded, but unthreaded to threaded take only about 2 seconds.

I know how deeply inefficient the code behind is, in fact this
threaded/unthreaded asymmetry cries out "hey look at how inefficient I
am, I can even run the easier case 3 time slower", so I wouldn't be that
surprised if a full js but well thought and smartly written code were to
end up being faster. But I'm also convinced a code written with the
assumption that there is nothing there that needs to be
optimized/well-written algorithmically will be slower.

So, I don't believe for one second you need to go assembly language, you
just need to check the algorithmic complexity and make sure your inner
loop are very efficient (and with type-interference, etc., that could
still be full js).

Joshua Cranmer

unread,

Aug 2, 2011, 12:51:55 PM8/2/11

to

On 8/2/2011 9:45 AM, Jean-Marc Desperrier wrote:
> mozilla.support.firefox contains around 52 000 message. It's not that
> much of an anomaly for a newsgroup folder.
> Switching the view to ordered by "From" takes around 6 seconds on my
> Core2 Duo E7500 @3GHz, almost the same as going from threaded to
> unthreaded, but unthreaded to threaded take only about 2 seconds.
>
> I know how deeply inefficient the code behind is, in fact this
> threaded/unthreaded asymmetry cries out "hey look at how inefficient I
> am, I can even run the easier case 3 time slower", so I wouldn't be
> that surprised if a full js but well thought and smartly written code
> were to end up being faster. But I'm also convinced a code written
> with the assumption that there is nothing there that needs to be
> optimized/well-written algorithmically will be slower.

That's not about MIME parsing, that's about our database efficiencies or
lack thereof. And possibly nsITreeView overhead.

Andrew Sutherland

unread,

Aug 2, 2011, 1:06:47 PM8/2/11

to

On 08/01/2011 11:39 PM, Joshua Cranmer wrote:

> In short, what it would take to convince that a JS implementation is the
> best way forward:

It sounds very much like you are very determined to go the C++ route.
As long as you are willing to do the work and see it through to
completion, that's fantastic and I will happy to see the backside of
libmime.

If anyone else is looking to take this on, I suggest using JS and I
would be happy to provide pointers on what would be required to expose
the JS implementation to C++ without using XPCOM/XPConnect.

Andrew

Jean-Marc Desperrier

unread,

Aug 2, 2011, 1:28:01 PM8/2/11

to

Andrew Sutherland wrote:
> node.js also has a pretty good example of a byte-buffer abstraction that
> can do string encoding tricks.

In the MIME case, I think the best option is indeed to keep the data as
a raw byte buffer, and convert it to what will be displayed on the fly.

And if it's included in a new message, copy the raw byte version, not
the interpreted one. You might need to have a cache of the sanitized
version when your comparison are based on it, and not just the display.

It would be a great step forward if two Thunderbird clients responding
to each other were *guaranteed* to keep the content of the subject the
same (when the user did not manually edit it), whatever their charset
decoding is set to, or that if the subject is incorrect it doesn't get
more and more broken each time.

> While MIME parsing involves more encodings than node knows about,
> it's not a hard set of tricks to teach it.

You make me slightly worried that you underestimate how hard it is to
correctly handle the edge cases of MIME encoding (decoding is a little
bit easier, but not much).

99% of the libraries around get it wrong.
I remember some specific cases where Mozilla's MIME code had it wrong
for a very long time, resulting for example with the incorrect insertion
of tab characters inside the subject.

I didn't look at every detail, but there's a related bug that's still
open and active (created ten years ago in 2001) :
"inconsistent display of TAB characters in subjects & thread pane"
https://bugzilla.mozilla.org/show_bug.cgi?id=64948

Joshua Cranmer

unread,

Aug 2, 2011, 1:53:26 PM8/2/11

to

On 8/2/2011 10:06 AM, Andrew Sutherland wrote:
> On 08/01/2011 11:39 PM, Joshua Cranmer wrote:
>> In short, what it would take to convince that a JS implementation is the
>> best way forward:
>
> It sounds very much like you are very determined to go the C++ route.
> As long as you are willing to do the work and see it through to
> completion, that's fantastic and I will happy to see the backside of
> libmime.

Not so much "determined" as I am "resigned". I've also had, through a
few of my other projects, experience with keeping something in sync with
SpiderMonkey code; that experience has left a bad taste in my mouth, as
it seems practically every new release modifies the APIs needed to run a
script. The other concern I have is maintaining the globals that people
expect.

After thinking about this some more, I realized what you probably meant
earlier about driving the JS runtime ourselves, after independently
coming up with something similar. Some poking around xpconnect leaves me
assured that the globals issue can be solved
(nsIXPConnect::InitClassesWithNewWrappedGlobal), but I'd still rather
see a method which says "give me a global object and context for this
script".

Andrew Sutherland

unread,

Aug 2, 2011, 2:28:31 PM8/2/11

to

On 08/02/2011 10:53 AM, Joshua Cranmer wrote:
> Not so much "determined" as I am "resigned". I've also had, through a
> few of my other projects, experience with keeping something in sync with
> SpiderMonkey code; that experience has left a bad taste in my mouth, as
> it seems practically every new release modifies the APIs needed to run a
> script. The other concern I have is maintaining the globals that people
> expect.

Even if they do change the API to run a script every release, we are
talking about two functions with simple signatures, and honestly, these
look pretty familiar to me from many revisions ago:

JSObject * JS_CompileFile(JSContext *cx, JSObject *obj, const char
*filename);
JSBool JS_ExecuteScript(JSContext *cx, JSObject *obj, JSObject
*scriptObj, jsval *rval);

https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_CompileFile
https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_ExecuteScript
https://developer.mozilla.org/En/SpiderMonkey/JSAPI_User_Guide#Compiled_scripts

If you are referring to the bit-rot JSHydra experienced, I don't think
it's surprising that the parser internals would change. (And now there
is an explicit AST intended for consumption! :)

> After thinking about this some more, I realized what you probably meant
> earlier about driving the JS runtime ourselves, after independently
> coming up with something similar. Some poking around xpconnect leaves me
> assured that the globals issue can be solved
> (nsIXPConnect::InitClassesWithNewWrappedGlobal), but I'd still rather
> see a method which says "give me a global object and context for this
> script".

How about:
https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_NewContext
followed by:
https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_NewCompartmentAndGlobalObject

Andrew

David Bienvenu

unread,

Aug 2, 2011, 4:27:13 PM8/2/11

to

On 8/2/2011 10:06 AM, Andrew Sutherland wrote:
>

> If anyone else is looking to take this on, I suggest using JS and I would be happy to provide pointers on what would be required to expose the JS implementation to C++
> without using XPCOM/XPConnect.

Roughly what does that entail?

- David

Andrew Sutherland

unread,

Aug 2, 2011, 9:14:34 PM8/2/11

to

The control-flow could vary a lot depending on the need of the C++ code.
In general, the tree big things that would need to happen are:
1) Spin up a JS runtime on the thread and load the JS mime parsing code
into it.
2) Cause I/O to be fed to the JS mime parsing code, producing a JS
object representation as a byproduct.
3) Traverse/consume the resulting JS object hierarchy.

For #1, see this for a good idea of what the boilerplate looks like.
https://developer.mozilla.org/En/SpiderMonkey/JSAPI_User_Guide#A_minimal_example
It may also be possible to piggyback on the web/chrome workers
mechanism, depending.

For #2, there are a variety of ways this could happen, such as:
- Expose the I/O mechanisms that web/chrome workers can use into the JS
runtime, tell the JS code a mailnews URI, and let it use those I/O
mechanisms to get the data. For example, an initial hack could just use
XHR1.

- Expose a custom C++ object to JS to provide I/O services, it would
look something like this:
https://developer.mozilla.org/En/SpiderMonkey/JSAPI_User_Guide#Defining_objects_and_properties

- Use a C++ I/O mechanism to get the data to our thread, then feed it
into the JS code either by direct invocation or by posting an event.
Direct calls look like so and posted events would also look similar,
it's just a question of how much complexity is on the stack when you
call into JS space and how much of a chance you give the trace JIT to
get into a groove by providing it with lots of data and longer loops
that avoid returning into C++ space too much:
https://developer.mozilla.org/En/SpiderMonkey/JSAPI_User_Guide#Calling_functions

For #3, you basically translate how you would traverse the object from
JS into C++:
https://developer.mozilla.org/En/SpiderMonkey/JSAPI_Phrasebook#Object_properties

It's worth noting that there's a very limited set of things that the
current C++ back-end needs from the parser *if we assume that most of
the display bits will be done in JS*. Specifically, at the bottom-most
layer, I suspect the mbox parsing logic only needs a very simple
understanding of message parsing. Then when we go up a layer to the
things nsIMsgDBHdrs need to know about, they have a constrained set of
data that could easily be postMessaged across from a worker thread and
then set on an nsIMsgDBHdr via traditional XPCOM. Of course, you
obviously know much more about the needs of the existing C++ code than I!

Andrew

Jonathan Protzenko

unread,

Aug 3, 2011, 10:34:44 AM8/3/11

to Andrew Sutherland, dev-apps-t...@lists.mozilla.org

If that's the route we decide to go, maybe
https://bugzilla.mozilla.org/show_bug.cgi?id=649537 would be a source
of inspiration (esp. given the "Lose XPConnect part").

jonathan

Joshua Cranmer

unread,

Aug 6, 2011, 12:40:08 AM8/6/11

to

On 7/31/2011 2:08 PM, Andrew Sutherland wrote:
> On 07/31/2011 12:14 PM, Joshua Cranmer wrote:
>> Well, this is my knowledge of the current state of multithreading in JS:
>> 1. JS XPCOM components cannot be accessed off the main thread
>> 2. If you want to use multiple threads, you have to use workers
>> 3. In worker threads, you can only access interfaces that are explicitly
>> marked as threadsafe via nsIClassInfo
>> 4. Necko interfaces claim to only work on the main thread
>>
>> In other words, if it were implemented in JS, to my knowledge, you could
>> only access the MIME tree via the main thread. I don't particularly care
>> about parsing and consuming on different threads, but I do want to make
>> sure that the MIME is not inaccessible to other threads.
>
> Yes, if exposed exclusively via XPConnect to C++ code, that could be
> troublesome. But it's not that hard to spin up a JS runtime on
> another thread and expose a lightweight C++ wrapper class that is
> based on using the JS API to traverse the JS MIME object hierarchy.

I coded up a basic prototype that parses an RFC 822 envelope into
headers and body and ran it on a MIME torture test I have lying around
(I don't recall the original place I got it from, but I'm pretty sure it
was related to the IMAP unofficial mailing list). My first cut
implementation was a standard, simple recursive descent parser which
requires the entire message to be in the buffer (i.e., no progressive).

I then tried an approach which is a partial hack around the proposed
yield* operator which still tries to keep the code looking clean... The
result was a regression of "slow enough that my computer died", or at
least 20x regression.

I have a few ideas for alternative implementations:
1. Turn it into a more classical LR-ish generator. Non-recursive
solutions are more amenable to a generator implementation but they are
less readable, less maintainable, and much more likely to drive the
trace engine bonkers (giant switch statements kill our tracers, the last
I heard).
2. Try some sort of JS hack that allows us to "pause" the execution and
then "resume" it when we get more data. This would imply that we need a
C++ driver (unless you can guarantee the complete buffer of data or are
willing to synchronously block), is possibly impossible, and would in
any case require some hacks when it comes to JS.
2a. An alternative approach is to cause the native bufferer to
synchronously download the data by spinning the event loop until we get
the notification we need. This implies, probably, that we need to do
some more runnable proxies to forestall any unexpected reentrancy issues.
3. Give up and write it in C++.
4. Give up and require you to hold a buffer of the entire contents in
memory.

I think 2a is the best option at this point, although I'm nervous about
potential catastrophic failures.

Andrew Sutherland

unread,

Aug 6, 2011, 2:10:38 AM8/6/11

to

On 08/05/2011 09:40 PM, Joshua Cranmer wrote:
> I have a few ideas for alternative implementations:

What about something more along the lines of the existing libmime
processing structure, but avoiding integrating display logic into the
parsing logic?

MIME parsing is reasonably straightforward until we get to the nesting.
You have a simple state machine that needs to deal with line-oriented
data, handling headers and then switching to body parts. The complexity
comes from nesting and the fact that the parent containers are
responsible for detecting the end of the child and potentially need to
deal with encoding/decoding/decryption, etc.

libmime's structure of stream processing handles this quite well, it's
just obfuscated somewhat by the custom class system, a lot of verbose
low-level C book-keeping operations, and all that display
logic/streaming output logic.

The only improvement on libmime that would just out at me is rather than
having parent containers either pass data through directly to children
via calls or by playing games with the sorta-virtual-function-table,
consider having an explicit driver class that has a stack of active
parts, active delimiters that should indicate the end of one of the
current parts, and active parse logic. It would be my hope this might
simplify the mental model of the implementation, as well as making error
handling and debugging potentially easier.

Is there anything in particular you were trying to address with using
generators/yield? Since they are not yet a standard and it would be
great to be able to use this code in other engines like V8 on node.js,
it would be nice to avoid using them. (It is also my expectation that
they are unlikely to be extensively optimized until they become a
standard and become one of the benchmarks to optimize for.)

Andrew

Joshua Cranmer

unread,

Aug 6, 2011, 2:59:27 AM8/6/11

to

On 8/5/2011 11:10 PM, Andrew Sutherland wrote:
> On 08/05/2011 09:40 PM, Joshua Cranmer wrote:
>> I have a few ideas for alternative implementations:
>
> What about something more along the lines of the existing libmime
> processing structure, but avoiding integrating display logic into the
> parsing logic?
>
> MIME parsing is reasonably straightforward until we get to the
> nesting. You have a simple state machine that needs to deal with
> line-oriented data, handling headers and then switching to body
> parts. The complexity comes from nesting and the fact that the parent
> containers are responsible for detecting the end of the child and
> potentially need to deal with encoding/decoding/decryption, etc.

There are one or two cases that we have right now where being unable to
look ahead across a newline causes issues (IIRC, image < > links in
plain text that cross multiple lines); attempts to gracefully handle
malformed data broken across multiple lines is also an issue that
concerns me. Instead, I wanted to write the parser in a more
synchronous-expecting manner, which allows cases where newlines are in
the middle of productions (particularly reinterpreted body parts or
folded whitespace in headers) to lookahead to decide if they need to eat
the newline or not.

> Is there anything in particular you were trying to address with using
> generators/yield? Since they are not yet a standard and it would be
> great to be able to use this code in other engines like V8 on node.js,
> it would be nice to avoid using them. (It is also my expectation that
> they are unlikely to be extensively optimized until they become a
> standard and become one of the benchmarks to optimize for.)

Mainly, I was trying to turn a parser which expects a blocking input
stream into an asynchronous input stream mechanism using generators...
to say that it blew up in my face is a totally true statement.

Andrew Sutherland

unread,

Aug 6, 2011, 2:07:51 PM8/6/11

to

On 08/05/2011 11:59 PM, Joshua Cranmer wrote:
> There are one or two cases that we have right now where being unable to
> look ahead across a newline causes issues (IIRC, image < > links in
> plain text that cross multiple lines); attempts to gracefully handle
> malformed data broken across multiple lines is also an issue that
> concerns me. Instead, I wanted to write the parser in a more
> synchronous-expecting manner, which allows cases where newlines are in
> the middle of productions (particularly reinterpreted body parts or
> folded whitespace in headers) to lookahead to decide if they need to eat
> the newline or not.

Assuming our stream model is "your handler is invoked with an array of
newlines, and an EOF is conveyed by a null", it seems reasonably
straightforward to let the processing code be able to handle both the
"have another line" case or "need to save our state and wait for more
newlines to show up". Specifically, a variable such as lineSoFar could
be closed over or accessed via 'this'.

To clarify, the expectation is that disk/network I/O would come to us in
blocks of data much larger than a newline, allowing us to process the
data in fairly large batches of multiple newlines at a time. This
should method jit and trace jit fairly well. Idiomatically this could
look like:

function makeParserUsingClosures() {
var closedOver_a, closedOver_lineSoFar = null;

return {
function processLines(lines) {
for (var iChunkLine = 0; iChunkLine < lines.length; iChunkLine++){
var haveNextLine = (iChunkLine + 1 < lines.length),
line = lines[iChunkLine];

// (assume we are in header parsing)
if ((line.length === 0) ||
(closedOver_lineSoFar && !re_ws.test(line[0])) {
// flush and parse closedOver_lineSoFar...
}
else {
// concatenate the line to line so far
}
}
}
};
}

I believe that definitely works for the multi-line headers.

I'm not sure exactly what you are referring to with reinterpreted body
parts. Do you mean something like multipart/related with a text/html
part that gets buffered and later has URLs transformed? Since that's
something we potentially need to entirely buffer anyways, there's no
harm in performing a synchronous transformation step later on.

Andrew

MaynardELIZABETH21

unread,

Oct 29, 2011, 10:32:59 AM10/29/11

to

I opine that to receive the <a href="http://goodfinance-blog.com">loan</a> from creditors you ought to present a good motivation. But, once I have got a short term loan, just because I wanted to buy a bike.