After the current ESR, there have been various changes to the
character encoding converters that have been rendered obsolete by the
Encoding Standard for the purposes of the Web and, by extension,
Firefox and mozilla-central. (Note that changes to HZ-GB-2312 and
GB2312 described below are on mozilla-inbound and have not merged to
m-c yet. For the rest of the email, I'll pretend they've been merged
already.)
These changes require action in comm-central. Bugs are on file. Please
see
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1054354&hide_resolved=1
.
For Gecko purposes, both labels and encodings are, unfortunately,
ASCII strings. This lack of type-safety causes some confusion. For
example, the string "csisolatin2" is a label but the string
"ISO-8859-2" is both a label and an encoding. Since we don't have a
distinctive datatype for encodings, I'll call the latter kind of names
Gecko-canonical names. The difference is that you can concatenate a
Gecko-canonical name to NS_UNICODEENCODER_CONTRACTID_
BASE or
NS_UNICODEDECODER_CONTRACTID_BASE to obtain a contract ID for
instantiating an encoder or a decoder, but you can't do that with
labels that are not also Gecko-canonical names. Instead, resolving the
label into a Gecko-canonical name has to be performed first. It is
possible for comm-central to support encodings, i.e. Gecko-canonical
names and associated contract IDs, that are not in mozilla-central.
For example, this is the case for UTF-7. mailnews has its own system
(nsCharsetAlias and charsetalias.properties) for mapping labels to
encodings. The Web mappings that m-c uses are in
labelsencodings.properties.
Therefore, when something goes away from mozilla-central, comm-central
needs to decide whether to just adjust the label mappings accordingly
or whether to import the removed encodings into comm-central and keep
supporting them there for mail and news. Note that it's possible to
only import a decoder and mark and encoding "notForOutgoing".
I've now reached the end of my patch queue for changing this stuff in
mozilla-central. Now would be a good time for comm-central to react
before the next ESR comes around, since *comm-central is right now in
a broken state* due to the m-c changes.
Before the current ESR, I added telemetry for some of the encodings
(DECODER_INSTANTIATED_*). However, I don't know how to actually see
the telemetry results from the current Thunderbird release. I
encourage Thunderbird developers to work with the Metrics team to find
a way to see the telemetry results ASAP.
The following encodings have been removed from mozilla-central completely:
T.61-8bit
x-johab
x-euc-tw
IBM850
IBM852
IBM855
IBM857
IBM862
IBM864
armscii-8
ISO-IR-111
VISCII
x-viet-tcvn5712
x-viet-vps
VISCII and armscii-8 are special in the sense that, for long time,
Thunderbird itself (misguidedly) provided these encodings in the user
interface for the choice of outgoing character encoding when composing
a message. Therefore, it is possible that there exists a
Thunderbird-created legacy of VISCII and armscii-8 email and Usenet
posts. If telemetry shows that decoder instantiations for these two
encodings are not insignificant in Thunderbird, I suggest importing
only the decoders for these two encodings into comm-central and
marking them as notForOutgoing. Other than that, I recommend not
importing encodings on the above list to comm-central. Note that the
LDAP code is c-c has its own T.61 conversion code, so you don't need
to import T.61-8bit for LDAP to work.
The following encodings have been removed from mozilla-central, but
knowledge of the labels has been kept and the labels are mapped to
the replacement encoding in order to protect Web sites against XSS:
HZ-GB-2312
ISO-2022-CN
ISO-2022-KR
My recommendation is that Thunderbird developers evaluate telemetry
data to see if it's worthwhile to import the decoders for these
encodings into comm-central (and mark the encodings as
notForOutgoing). I gather that HZ-GB-2312 was originally created at
Stanford for the purpose of writing Chinese on Usenet, but to my
knowledge it hasn't actually been popular in China. ISO-2022-CN was
added to Gecko to be able to read email sent from the Sun CDE email
client. I'm not aware of non-XSS uses of ISO-2022-KR. Since these are
multi-byte encodings whose decoders have a history of security bugs,
it's probably a bad idea to import these unless telemetry shows a
compelling reason to.
The encoders for the following encodings have been removed (the
decoders remain in m-c in order to be able to decode the names of
legacy Mac fonts!):
x-mac-ce
x-mac-turkish
x-mac-greek
x-mac-icelandic
x-mac-croatian
x-mac-romanian
x-mac-hebrew
x-mac-arabic
x-mac-farsi
x-mac-devanagari
x-mac-gujarati
x-mac-gurmukhi
My recommendation is to mark these as notForOutgoing. It makes sense
to leave Thunderbird able to decode email in these encodings for as
long as m-c keeps the decoders around for fonts, because Thunderbird
has (misguidedly) made it possible for the user to manually configure
these encodings for outgoing email or for outgoing Usenet posts. At
least in the case of x-mac-croatian, there's known to be a (tiny)
self-inflicted Usenet legacy from this misguided UI! Additionally, I
recommend marking the two remaining Mac encodings, macintosh (i.e.
MacRoman) and x-mac-cyrillic as notForOutgoing.
The following encodings have been removed, because what were
previously Gecko-canonical names have become mere labels for other
encodings:
us-ascii
ISO-8859-6-I
ISO-8859-6-E
ISO-8859-8-E
ISO-8859-9
ISO-8859-11
TIS-620
GB2312
x-mac-ukrainian
Additionally, for the time being, ISO-8859-1 is in the code base as a
Gecko-canonical name, but it, too, is expected to go away.
us-ascii and ISO-8859-1 are now labels for windows-1252. ISO-8859-9 is
now a label for windows-1254. ISO-8859-11 and TIS-620 are now labels
of windows-874. GB2312 is now a label of gbk. (gbk itself has changed
so that there is no longer a distinct gbk decoder and the gbk
decoder's contract ID points to the gb18030 decoder, which is a
superset of the old gbk decoder. The gbk encoding is being kept around
to avoid submitting 4-byte sequences to sites that aren't prepared to
handle the non-gbk parts of gb18030.) ISO-8859-6-I and -E are now
labels of ISO-8859-6. ISO-8859-8-E is now a label of ISO-8859-8.
x-mac-ukrainian is now a label of x-mac-cyrillic.
Currently, Thunderbird has a special handling for ISO-8859-1: The
Gecko-canonical name that travels in the app internals is ISO-8859-1,
but when it comes time to encode something, the windows-1252 encoder
is instantiated. The result is labeled as ISO-8859-1 in the outgoing
email. The same is not done for TIS-620 and ISO-8859-9. (ISO-8859-11
is not IANA-registered; TIS-620 is the IANA-preferred name.)
You could choose to simply make the same alias mappings as in m-c. Or
you can do something more complicated to still use the old labels on
the wire. I think you shouldn't try to use the old labels on the wire
unless you have knowledge that is required for compatibility. It looks
like simply adjusting the alias mapping is the approach being pursued
by mkmelin, which is nice.
Finally, as always, the issue of how to label outgoing windows-1252,
windows-1254 or windows-874 would be moot if you started just using
UTF-8 for outgoing email. To the extent that's not feasible for Japan,
yet, I think the best solution to the problem would be:
1) Remove all current UI for controlling outgoing encoding.
2) Add a boolean pref, defaulting to on, for "Use ISO-2022-JP for
Japanese email"
3) When sending email, implement the following logic: IF the above
pref is set AND the email contains a character that's between U+3040
and U+30FF (inclusive; that's Hiragana and Katakana) AND all the
characters of the email are encodable as ISO-2022-JP THEN encode as
ISO-2022-JP ELSE encode as UTF-8.
P.S. So are the changes in this area in m-c now "done"? No, ISO-8859-1
remains to be removed and big5 remains to be rewritten after which
big5-hkscs can become a label of big5. However, c-c shouldn't wait for
these. It makes sense to prepare for the ISO-8859-1 removal before it
happens and the big5 rewrite probably won't happen before the next
ESR, and all the stuff indicated above needs to be addressed before
the next ESR.
--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/