Heads-up: Upcoming m-c change in character encoding conversion infrastructure

Henri Sivonen

unread,

Apr 28, 2017, 7:09:58 AM4/28/17

to dev-apps-t...@lists.mozilla.org

I'm working on replacing uconv with encoding_rs:
https://bugzilla.mozilla.org/show_bug.cgi?id=encoding_rs

encoding_rs is a newly-written implementation of the Encoding Standard
that is more correct than uconv, performs better (if compiled with
nightly Rust), supports both UTF-16 and UTF-8 as the in-memory Unicode
representation (without data table duplication) and supports both Rust
and C++ callers.

Since it appears that mailnews components aren't being written in
Rust, the C++ case is the one that is of relevance to mailnews. The
current state of the API can be seen in
https://hg.mozilla.org/try/file/0dfc31834877/intl/Encoding.h . (It's
best to scroll to mozilla::Encoding and read it first before reading
mozilla::Decoder and mozilla::Encoder.)

The change isn't imminent, because the new code is still orange on
try, is unreviewed and requires nightly Rust in order not to regress
performance. (The last point is the main source of schedule
uncertainty.) Still, I think it would make sense for Thunderbird
developers to assess the impact on mailnews code sooner than later, so
that the change doesn't come as a surprise.

Particular things of note are:

* The set of converters will no longer be extensible via XPCOM.
Therefore, mailnews will no longer be able to register UTF-7 via XPCOM
and expect be able to instantiate a UTF-7 decoder using
mozilla::dom::EncodingUtils::DecoderForEncoding(). Instead, UTF-7
handling will need to happen is a one-off special case outside the
converter framework.

* nsIUnicodeDecoder and nsIUnicodeEncoder will be replaced by
mozilla::Decoder and mozilla::Encoder, which have a similar but subtly
different API. The API changes address design flaws in the previous
API.

* When holding something that designates an encoding, it will be
preferable to hold const mozilla::Encoding* instead of holding an
nsACString containing the name of the encoding. (All instances of
mozilla::Encoding are static, so there's no need to refcount the
pointer.) If you need to designate a particular encoding at compile
time, there are constants of the form UTF_8_ENCODING for referring to
the encodings directly (as opposed to having to resolve a name or a
label at run time).

* When you have the entire input in nsAString or nsACString,
mozilla::Encoding provides non-streaming conversion methods that hide
the complexity of using mozilla::Decoder and mozilla::Encoder (and
also avoid having to malloc the converter by doing a stack allocation
in Rust instead).

* Since conversion to and from UTF-8 is supported, there is no need to
pivot through UTF-16 when converting from an arbitrary encoding to
UTF-8. It appears that mailnews currently has code that wants to
convert stuff to UTF-8 and is forced to pivot through UTF-16.
Replacing this code with direct conversions to UTF-8 should give
mailnews a nice performance boost.

* The ISO-2022-JP decoder (for consistency with other browsers)
doesn't support ISO-2022-JP-2. While it's unlikely that new email will
arrive as ISO-2022-JP-2, it may exist in archive mailboxes as having
been sent by Apple Mail before it switched to always sending UTF-8.

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Henri Sivonen

unread,

Apr 28, 2017, 9:14:03 AM4/28/17

to dev-apps-t...@lists.mozilla.org

On Apr 28, 2017 14:09, "Henri Sivonen" <hsiv...@hsivonen.fi> wrote:

* The ISO-2022-JP decoder (for consistency with other browsers)
doesn't support ISO-2022-JP-2. While it's unlikely that new email will
arrive as ISO-2022-JP-2, it may exist in archive mailboxes as having
been sent by Apple Mail before it switched to always sending UTF-8.

Previous intent to unship:
https://groups.google.com/d/msg/mozilla.dev.apps.thunderbird/pP-MvtsbesU/rfwVOFW2BQAJ

Joshua Cranmer 🐧

unread,

Apr 28, 2017, 10:13:02 AM4/28/17

to

On 4/28/17 6:09 AM, Henri Sivonen wrote:
> The change isn't imminent, because the new code is still orange on
> try, is unreviewed and requires nightly Rust in order not to regress
> performance. (The last point is the main source of schedule
> uncertainty.) Still, I think it would make sense for Thunderbird
> developers to assess the impact on mailnews code sooner than later, so
> that the change doesn't come as a surprise.

Thanks for the heads-up.

> * The set of converters will no longer be extensible via XPCOM.
> Therefore, mailnews will no longer be able to register UTF-7 via XPCOM
> and expect be able to instantiate a UTF-7 decoder using
> mozilla::dom::EncodingUtils::DecoderForEncoding(). Instead, UTF-7
> handling will need to happen is a one-off special case outside the
> converter framework.

This is probably the thing that's going to bite the hardest. We need to
use UTF-7 from both JS (where I have something that reuses the old
nsIDecoder framework to polyfill TextDecoder with more charsets) as well
as from C++ code.

> * Since conversion to and from UTF-8 is supported, there is no need to
> pivot through UTF-16 when converting from an arbitrary encoding to
> UTF-8. It appears that mailnews currently has code that wants to
> convert stuff to UTF-8 and is forced to pivot through UTF-16.
> Replacing this code with direct conversions to UTF-8 should give
> mailnews a nice performance boost.

Honestly, most of the time where we expect non-ASCII (e.g., folder
names), we use UTF-16 internally to store decoded stuff anyways.

--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

Henri Sivonen

unread,

Apr 28, 2017, 2:49:48 PM4/28/17

to Joshua Cranmer 🐧, dev-apps-t...@lists.mozilla.org

On Apr 28, 2017 5:12 PM, "Joshua Cranmer 🐧" <Pidg...@gmail.com> wrote:

On 4/28/17 6:09 AM, Henri Sivonen wrote:

> * The set of converters will no longer be extensible via XPCOM.
> Therefore, mailnews will no longer be able to register UTF-7 via XPCOM
> and expect be able to instantiate a UTF-7 decoder using
> mozilla::dom::EncodingUtils::DecoderForEncoding(). Instead, UTF-7
> handling will need to happen is a one-off special case outside the
> converter framework.
>

This is probably the thing that's going to bite the hardest. We need to use
UTF-7 from both JS (where I have something that reuses the old nsIDecoder
framework to polyfill TextDecoder with more charsets) as well as from C++
code.

I suggest introducing a wrapper type mozilla::EmailEncoding that wraps a
pointer to mozilla::Encoding or nullptr. If nullptr, the type represents
UTF-7, otherwise it represents a Web encoding and delegates to the wrapped
encoding. Then I'd introduce a chrome-only TextDecoder-like (duck types to
the same API in JS) WebIDL interface to the UTF-7 decoder and a JS-based
factory that instantiates the UTF-7 decoder for "UTF-7" and TextDecoder
otherwise.

* Since conversion to and from UTF-8 is supported, there is no need to
> pivot through UTF-16 when converting from an arbitrary encoding to
> UTF-8. It appears that mailnews currently has code that wants to
> convert stuff to UTF-8 and is forced to pivot through UTF-16.
> Replacing this code with direct conversions to UTF-8 should give
> mailnews a nice performance boost.
>
Honestly, most of the time where we expect non-ASCII (e.g., folder names),
we use UTF-16 internally to store decoded stuff anyways.

https://searchfox.org/comm-central/source/mailnews/mime/src/mimemoz2.cpp#811
appears to be doing a conversion that pivots through UTF-16 and a
superficial look suggested the caller uses UTF-8 as the target.

ishikawa

unread,

Apr 30, 2017, 10:20:41 PM4/30/17

to dev-apps-t...@lists.mozilla.org, Henri Sivonen

I have not digested the e-mail and followup very well, but
does the removal in mozilla-central mean
that Thunderbird is likely not to be able to
read old e-mail archives that have ISO-2022-JP-2 encoded e-mail messages?
That is tough.

If so, I need a solution that may be specific to C-C source tree :-(

TIA

PS: I understand that the ISO-2022-JP-2 is meant for e-mail encoding and not
for web pages.

Henri Sivonen

unread,

May 1, 2017, 4:37:06 AM5/1/17

to ishikawa, dev-apps-t...@lists.mozilla.org

On May 1, 2017 5:20 AM, "ishikawa" <ishi...@yk.rim.or.jp> wrote:

On 2017年04月28日 22:13, Henri Sivonen wrote:

I have not digested the e-mail and followup very well, but
does the removal in mozilla-central mean
that Thunderbird is likely not to be able to
read old e-mail archives that have ISO-2022-JP-2 encoded e-mail messages?
That is tough.

If so, I need a solution that may be specific to C-C source tree :-(

How frequent are ISO-2022-JP-2 emails in your archive? What email client
has generated them? What pushes them to -2? JIS X 0212 Japanese characters
or non-Japanese (Chinese, Korean, Greek or Latin-1) characters?

Note that if you map the ISO-2022-JP-2 label to ISO-2022-JP, you will
still be able to read the JIS X 0208 parts.

ishikawa

unread,

May 7, 2017, 10:59:36 PM5/7/17

to Henri Sivonen, dev-apps-t...@lists.mozilla.org

Thank you for the tips.

I checked my archive on one PC (my archive is spread now on three PCs and
the one I checked
contains the work-related e-mails at the office for the last dozen years. )

Luckily there are not that many e-mails with ISO-2022-JP-2, but the
e-mails come from an important contact. The Murphy's law struck.
I found that mailer on Mac OSX at some point seemed to generate ISO-2022-JP-2
e-mails. I am not sure what "pushes them to -2". I suspect some special
characters, but I could not find the
culprit easily.

Also, it turns out that there are attachments "forwarded" to me by someone
and that attachments
are encoded in iso-2022-jp-2. In those attachment cases, it is not clear
which mailer originally created such
encodings, but I suspect Mac mailer at some point (2013 - 2014?). I also
found a few e-mails from 2012 3Q, too.

This is my preliminary finding.

I also found that some university computer support desks in Japan have
published warnings about ISO-2022-JP-2 and one of them specifically mentions
Mac Mailer and that is compatible with my findings about a contact sending
me e-mails from Mac mailer and some of them are
in ISO-2022-JP-2 encoding. So at some time, some e-mail clients began
sending out ISO-2022-JP-2 encoded e-mails.

(In Japanese)

About the mail daemon at University of Electro-Communications failed to
recieve ISO-2022-jp-2 encoding.
https://www.cc.uec.ac.jp/blogs/news/2014/10/20141022iso2022jp2mail.html

About Mojibake at Sophia University.
http://ccweb.cc.sophia.ac.jp/joznu8ku8-26/

The number of such e-mails and attachments is small considering the large
archives I have,
but of all the correspondents, the contact whose e-mail archive contains many
iso-2022-jp-2 encoded e-mails is my boss :-(

Tough luck.

TIA

ishikawa

unread,

May 7, 2017, 11:17:09 PM5/7/17

to Henri Sivonen, dev-apps-t...@lists.mozilla.org

Another blog I found mentioned that
Mac Mail produced iso-2022-jp-2
encoded e-mail message when the client responded to a message encoded in
iso-2022-jp or iso-2022-jp-2 and
the original message contained Windows-specific character(s) [the author of
the blog mentioned
Zenkaku tilda ＾and wavy dash 〜 (Not sure if the characters I typed on the
line are
the characters the blog authors had in mind.)]

http://www.gallery-ryna.net/?dat=20130107

Anyway, either Apple fixed the mailer or
Japanese Mac users installed a plug-in to avoid such encoding when a message
is sent, or
in the drastic case, a filtering is done to convert the encoding to
ISO-2022-JP even with some characters not converted correctly (!) as in the
case of UEC (University of Electro-communication) case mentioned in my
previous post, so
the number of iso-2022-JP-2 encoded e-mails were contained to about 100. Not
bad for a large archive.
But I am a little worried about my other archives at home...

However, I suspect that if UEC could get away with the lossy conversion,
then a well-crafted bullet-proof Perl script or something like that can
convert a mail archive that contains iso-2022-jp-2 encoded e-mails into one
with only iso-2022-jp encoding, assuming some characters may become
garbled (mojibake) in the process, without any serious damage to the content.

TIA

TIA

Henri Sivonen

unread,

May 8, 2017, 7:14:16 AM5/8/17

to ishikawa, dev-apps-t...@lists.mozilla.org

On Mon, May 8, 2017 at 6:16 AM, ishikawa <ishi...@yk.rim.or.jp> wrote:
> On 2017年05月08日 11:59, ishikawa wrote:

The first piece of good news is that you didn't discover other email
clients sending ISO-2022-JP-2 than the one that was known at the time
of the Intent to Unship: old versions of Apple Mail. (Newer versions
of Apple Mail send UTF-8. Apple Mail with the LetterFix plug-in sends
ISO-2022-JP, not ISO-2022-JP-2.)

The second piece of good news is that you didn't find evidence of JIS
X 0212 Kanji being the issue. If the main culprits are isolated
punctiation-ish symbols, it shouldn't be too terrible if they become
mojibake when reading old archives. Of course, for that to work, the
ISO-2022-JP-2 label would have to map to the ISO-2022-JP decoder. (It
is unsurprising that JIS X 0212 Kanji isn't the issue. After all,
Shift_JIS doesn't have the JIS X 0212 bits, either, so Japanese
communication in Shift_JIS being possible at all is indicative of how
rare the JIS X 0212 Kanji are.)

(Curiously, the Android screenshot doesn't appear to make sense as an
ASCII-ification of any of the escape sequences that ISO-2022-JP-2
added.)

I think the main lesson from the old Apple Mail is that email client
developers should make the software they write send UTF-8-encoded
email (like newer Apple Mail) instead of trying to send something
other than UTF-8 on the theory of all pre-UTF-8 encodings being
somehow better supported by the receivers than UTF-8.

ISHIKAWA, Chiaki

unread,

May 12, 2017, 7:59:04 PM5/12/17

to Henri Sivonen, dev-apps-t...@lists.mozilla.org

On 2017/05/08 20:13, Henri Sivonen wrote:
> so Japanese
> communication in Shift_JIS being possible at all is indicative of how
> rare the JIS X 0212 Kanji are.)

Yes, modern communication is possible, to be exact.

I found that JIS X0212 Kanji contains many characters that we can find
in books written more than 80-90 years ago.
So for literary work (archiving, research, etc.) those "rare (today)"
kanjis do need to be supported, and that is why the X 0212 Kanji was
standardized in Japan to begin with.
Language is an issue that mere technologists can't have the last say in
any large society.

I was not aware that Apple Mailer used non-UTF encoding for a while.
There have been some issues of file name encoding was screwed up
for an attachment that was sent by Mac mailer. This could also be
due to the encoding issue. But I have no details at hand.