Character encoding changes in m-c require c-c action

93 views
Skip to first unread message

Henri Sivonen

unread,
Sep 26, 2014, 8:19:26 AM9/26/14
to dev-apps-t...@lists.mozilla.org, Anne van Kesteren
After the current ESR, there have been various changes to the
character encoding converters that have been rendered obsolete by the
Encoding Standard for the purposes of the Web and, by extension,
Firefox and mozilla-central. (Note that changes to HZ-GB-2312 and
GB2312 described below are on mozilla-inbound and have not merged to
m-c yet. For the rest of the email, I'll pretend they've been merged
already.)

These changes require action in comm-central. Bugs are on file. Please
see https://bugzilla.mozilla.org/showdependencytree.cgi?id=1054354&hide_resolved=1
.

For Gecko purposes, both labels and encodings are, unfortunately,
ASCII strings. This lack of type-safety causes some confusion. For
example, the string "csisolatin2" is a label but the string
"ISO-8859-2" is both a label and an encoding. Since we don't have a
distinctive datatype for encodings, I'll call the latter kind of names
Gecko-canonical names. The difference is that you can concatenate a
Gecko-canonical name to NS_UNICODEENCODER_CONTRACTID_
BASE or
NS_UNICODEDECODER_CONTRACTID_BASE to obtain a contract ID for
instantiating an encoder or a decoder, but you can't do that with
labels that are not also Gecko-canonical names. Instead, resolving the
label into a Gecko-canonical name has to be performed first. It is
possible for comm-central to support encodings, i.e. Gecko-canonical
names and associated contract IDs, that are not in mozilla-central.
For example, this is the case for UTF-7. mailnews has its own system
(nsCharsetAlias and charsetalias.properties) for mapping labels to
encodings. The Web mappings that m-c uses are in
labelsencodings.properties.

Therefore, when something goes away from mozilla-central, comm-central
needs to decide whether to just adjust the label mappings accordingly
or whether to import the removed encodings into comm-central and keep
supporting them there for mail and news. Note that it's possible to
only import a decoder and mark and encoding "notForOutgoing".

I've now reached the end of my patch queue for changing this stuff in
mozilla-central. Now would be a good time for comm-central to react
before the next ESR comes around, since *comm-central is right now in
a broken state* due to the m-c changes.

Before the current ESR, I added telemetry for some of the encodings
(DECODER_INSTANTIATED_*). However, I don't know how to actually see
the telemetry results from the current Thunderbird release. I
encourage Thunderbird developers to work with the Metrics team to find
a way to see the telemetry results ASAP.

The following encodings have been removed from mozilla-central completely:
T.61-8bit
x-johab
x-euc-tw
IBM850
IBM852
IBM855
IBM857
IBM862
IBM864
armscii-8
ISO-IR-111
VISCII
x-viet-tcvn5712
x-viet-vps

VISCII and armscii-8 are special in the sense that, for long time,
Thunderbird itself (misguidedly) provided these encodings in the user
interface for the choice of outgoing character encoding when composing
a message. Therefore, it is possible that there exists a
Thunderbird-created legacy of VISCII and armscii-8 email and Usenet
posts. If telemetry shows that decoder instantiations for these two
encodings are not insignificant in Thunderbird, I suggest importing
only the decoders for these two encodings into comm-central and
marking them as notForOutgoing. Other than that, I recommend not
importing encodings on the above list to comm-central. Note that the
LDAP code is c-c has its own T.61 conversion code, so you don't need
to import T.61-8bit for LDAP to work.

The following encodings have been removed from mozilla-central, but
knowledge of the labels has been kept and the labels are mapped to
the replacement encoding in order to protect Web sites against XSS:
HZ-GB-2312
ISO-2022-CN
ISO-2022-KR

My recommendation is that Thunderbird developers evaluate telemetry
data to see if it's worthwhile to import the decoders for these
encodings into comm-central (and mark the encodings as
notForOutgoing). I gather that HZ-GB-2312 was originally created at
Stanford for the purpose of writing Chinese on Usenet, but to my
knowledge it hasn't actually been popular in China. ISO-2022-CN was
added to Gecko to be able to read email sent from the Sun CDE email
client. I'm not aware of non-XSS uses of ISO-2022-KR. Since these are
multi-byte encodings whose decoders have a history of security bugs,
it's probably a bad idea to import these unless telemetry shows a
compelling reason to.

The encoders for the following encodings have been removed (the
decoders remain in m-c in order to be able to decode the names of
legacy Mac fonts!):
x-mac-ce
x-mac-turkish
x-mac-greek
x-mac-icelandic
x-mac-croatian
x-mac-romanian
x-mac-hebrew
x-mac-arabic
x-mac-farsi
x-mac-devanagari
x-mac-gujarati
x-mac-gurmukhi

My recommendation is to mark these as notForOutgoing. It makes sense
to leave Thunderbird able to decode email in these encodings for as
long as m-c keeps the decoders around for fonts, because Thunderbird
has (misguidedly) made it possible for the user to manually configure
these encodings for outgoing email or for outgoing Usenet posts. At
least in the case of x-mac-croatian, there's known to be a (tiny)
self-inflicted Usenet legacy from this misguided UI! Additionally, I
recommend marking the two remaining Mac encodings, macintosh (i.e.
MacRoman) and x-mac-cyrillic as notForOutgoing.

The following encodings have been removed, because what were
previously Gecko-canonical names have become mere labels for other
encodings:
us-ascii
ISO-8859-6-I
ISO-8859-6-E
ISO-8859-8-E
ISO-8859-9
ISO-8859-11
TIS-620
GB2312
x-mac-ukrainian

Additionally, for the time being, ISO-8859-1 is in the code base as a
Gecko-canonical name, but it, too, is expected to go away.

us-ascii and ISO-8859-1 are now labels for windows-1252. ISO-8859-9 is
now a label for windows-1254. ISO-8859-11 and TIS-620 are now labels
of windows-874. GB2312 is now a label of gbk. (gbk itself has changed
so that there is no longer a distinct gbk decoder and the gbk
decoder's contract ID points to the gb18030 decoder, which is a
superset of the old gbk decoder. The gbk encoding is being kept around
to avoid submitting 4-byte sequences to sites that aren't prepared to
handle the non-gbk parts of gb18030.) ISO-8859-6-I and -E are now
labels of ISO-8859-6. ISO-8859-8-E is now a label of ISO-8859-8.
x-mac-ukrainian is now a label of x-mac-cyrillic.

Currently, Thunderbird has a special handling for ISO-8859-1: The
Gecko-canonical name that travels in the app internals is ISO-8859-1,
but when it comes time to encode something, the windows-1252 encoder
is instantiated. The result is labeled as ISO-8859-1 in the outgoing
email. The same is not done for TIS-620 and ISO-8859-9. (ISO-8859-11
is not IANA-registered; TIS-620 is the IANA-preferred name.)

You could choose to simply make the same alias mappings as in m-c. Or
you can do something more complicated to still use the old labels on
the wire. I think you shouldn't try to use the old labels on the wire
unless you have knowledge that is required for compatibility. It looks
like simply adjusting the alias mapping is the approach being pursued
by mkmelin, which is nice.

Finally, as always, the issue of how to label outgoing windows-1252,
windows-1254 or windows-874 would be moot if you started just using
UTF-8 for outgoing email. To the extent that's not feasible for Japan,
yet, I think the best solution to the problem would be:
1) Remove all current UI for controlling outgoing encoding.
2) Add a boolean pref, defaulting to on, for "Use ISO-2022-JP for
Japanese email"
3) When sending email, implement the following logic: IF the above
pref is set AND the email contains a character that's between U+3040
and U+30FF (inclusive; that's Hiragana and Katakana) AND all the
characters of the email are encodable as ISO-2022-JP THEN encode as
ISO-2022-JP ELSE encode as UTF-8.

P.S. So are the changes in this area in m-c now "done"? No, ISO-8859-1
remains to be removed and big5 remains to be rewritten after which
big5-hkscs can become a label of big5. However, c-c shouldn't wait for
these. It makes sense to prepare for the ISO-8859-1 removal before it
happens and the big5 rewrite probably won't happen before the next
ESR, and all the stuff indicated above needs to be addressed before
the next ESR.

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Joshua Cranmer 🐧

unread,
Sep 26, 2014, 10:48:53 AM9/26/14
to
On 9/26/2014 7:19 AM, Henri Sivonen wrote:
> I've now reached the end of my patch queue for changing this stuff in
> mozilla-central. Now would be a good time for comm-central to react
> before the next ESR comes around, since *comm-central is right now in
> a broken state* due to the m-c changes.

The only m-c change that's really breaking us is bug 1071497. :-P


> Before the current ESR, I added telemetry for some of the encodings
> (DECODER_INSTANTIATED_*). However, I don't know how to actually see
> the telemetry results from the current Thunderbird release. I
> encourage Thunderbird developers to work with the Metrics team to find
> a way to see the telemetry results ASAP.

I don't know how to make sense of Telemetry even for Firefox data...
> My recommendation is to mark these as notForOutgoing. It makes sense
> to leave Thunderbird able to decode email in these encodings for as
> long as m-c keeps the decoders around for fonts, because Thunderbird
> has (misguidedly) made it possible for the user to manually configure
> these encodings for outgoing email or for outgoing Usenet posts. At
> least in the case of x-mac-croatian, there's known to be a (tiny)
> self-inflicted Usenet legacy from this misguided UI! Additionally, I
> recommend marking the two remaining Mac encodings, macintosh (i.e.
> MacRoman) and x-mac-cyrillic as notForOutgoing.

The only non-Encoding-Standard-encoding whose removal I suspect (in lieu
of good information) would cause c-c some pain is x-mac-croatian.
Mostly, it's because I use that one in tests for deprecated encoding
conversion support. Somewhere in 32 or 33 (I don't remember when,
exactly), I wrote some code to forcibly reset these settings back to
UTF-8, which means removing it before the next ESR would really be
problematic.

The compose window logic (with tests!) forcibly makes it possible to
only encode to a hard-coded list of possible charsets. Our backend code
doesn't have these same checks, but our upgrade-to-UTF-8 code is
hopefully capable of falling back to UTF-8 if the charset encode doesn't
exist. In any case, removing encodings (except for ones in that
hard-coded list) shouldn't break anything unless you're using add-ons.

> You could choose to simply make the same alias mappings as in m-c. Or
> you can do something more complicated to still use the old labels on
> the wire. I think you shouldn't try to use the old labels on the wire
> unless you have knowledge that is required for compatibility. It looks
> like simply adjusting the alias mapping is the approach being pursued
> by mkmelin, which is nice.

I'm trying to use the Encoding Standard label mappings as much as
possible. The primary reason mailnews/intl exists is because we need
that logic to be able to hook up the UTF-7 decoder/encoder--I'm trying
to make it go away in my slow-going work on deleting libmime.

> Finally, as always, the issue of how to label outgoing windows-1252,
> windows-1254 or windows-874 would be moot if you started just using
> UTF-8 for outgoing email. To the extent that's not feasible for Japan,
> yet, I think the best solution to the problem would be:

Oh, the problem of Japanese phones not supporting UTF-8 is no longer an
issue, I think. But the JP locale still heavily resists Unicode, to the
point where they would still rather garble their emails than upgrade
ISO-2022-JP to UTF-8 (yes, I've complained about this before). So as
long as we have to support that messed-up hack, the marginal cost of
supporting other encoders we currently have is rather nil.

That said, we can probably make a bit more effort to convince the other
localizations that switching their defaults to UTF-8 is worthwhile.
Czech, Argentina, Basque, Frisian, Hungarian, Italian, Dutch, and
Swedish almost certainly can default to UTF-8; Korean and Chinese may or
may not be able to as easily.

--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

ISHIKAWA, Chiaki

unread,
Sep 27, 2014, 2:17:19 PM9/27/14
to
Japanese situation is hopeless. I often receive
e-mails written in Japanese with attachment [with the file name in
Japanese] created on Apple Macs to
TB running under linux. When I try to forward such an e-mail, most of
the times, it works, but sometimes, I get
a completely garbled main body.
I have absolutely no idea why and where the problem occurs.
After many such incidences, I now suspect these problems may occur that
the encoding of the filename and encoding of the main mail body text may
not match when such garbled forwarding occurs.
(I have no idea if MY own e-mail with attachment of a file with Japanese
filename cause some people to create a garbled forwarded e-mails.)

Oh well. This observation comes, from someone, who have different linux
installations that share remote file systems: some file archives were
created in EUC days [ISO-2022-JP, basically], and have incorrectly
encoded file names from misguided Mac and Window PCs even.
Some linux installations that I used to use EUC as its main locale, and
recently I finally bit the bullet and have begun to install linux using
UTF-8 as the main locale. So I am looking at Japanese file system
listings sometimes neatly (matching locales),
and sometimes in garbled manner (mismatched locales of a particular file
system and the local linux installation), and using "ls -l | nkf -w" or
"ls -l | nkf -e" to combat the situation. nkf is an intelligent code
converter and most of the time, it successfully outputs UTF-8 equivalent
of input (-w) or EUC equivalent of (-e).

Legacy issues can't be solved by technology argument alone :-(
We need softlanding strategy for people with legacy issues.



> That said, we can probably make a bit more effort to convince the other
> localizations that switching their defaults to UTF-8 is worthwhile.
> Czech, Argentina, Basque, Frisian, Hungarian, Italian, Dutch, and
> Swedish almost certainly can default to UTF-8; Korean and Chinese may or
> may not be able to as easily.
>

In another 20-30 years, most people with active need for EUC/ISO-2022-JP
will retire ;-)
Remember EBCDIC?



ISHIKAWA, Chiaki

unread,
Sep 27, 2014, 10:28:57 PM9/27/14
to
On 2014/09/26 21:19, Henri Sivonen wrote:
[...
> Finally, as always, the issue of how to label outgoing windows-1252,
> windows-1254 or windows-874 would be moot if you started just using
> UTF-8 for outgoing email. To the extent that's not feasible for Japan,
> yet, I think the best solution to the problem would be:
> 1) Remove all current UI for controlling outgoing encoding.
> 2) Add a boolean pref, defaulting to on, for "Use ISO-2022-JP for
> Japanese email"
> 3) When sending email, implement the following logic: IF the above
> pref is set AND the email contains a character that's between U+3040
> and U+30FF (inclusive; that's Hiragana and Katakana) AND all the
> characters of the email are encodable as ISO-2022-JP THEN encode as
> ISO-2022-JP ELSE encode as UTF-8.

I can not offer authoritative opinion on this since I am just a user of
TB in Japan albeit I have about 10 years worth of e-mails
using TB (and prior to it, netscape suite.)
Older e-mails were in emacs's RMAIL archive format.
The archive's characteristics may be quite skewed in contrast to
average Japanese user's.

I am sure you will hear from people who have more experience as
*developers* of mail client in Japan who need to deal with many Japanese
users. They can offer more balanced opinions (though,
there will always be vocal opinions from non-average users.).

That said, I appreciate that you have considered the way out for the
complex (or hopeless) Japanese encoding situation.

cf. I meant to insert something similar to above in the followup to
Joshua Cranmer's initial followup. Reading my post today, I realized I
must have omitted by mistake during copy&paste. So here it goes.

Henri Sivonen

unread,
Sep 29, 2014, 6:44:39 AM9/29/14
to Joshua Cranmer 🐧, VYV0...@nifty.ne.jp, Anne van Kesteren, dev-apps-t...@lists.mozilla.org, Zephyrus C
FWIW, the GB2312 and HZ-GB-2312 changes got backed out, because I
accidentally broke Mac font loading, because I had only used linux64
try runs to save resources. I expect to re-push in the near future,
though.

On Fri, Sep 26, 2014 at 5:48 PM, Joshua Cranmer 🐧 <Pidg...@gmail.com> wrote:
> On 9/26/2014 7:19 AM, Henri Sivonen wrote:
>>
>> I've now reached the end of my patch queue for changing this stuff in
>> mozilla-central. Now would be a good time for comm-central to react
>> before the next ESR comes around, since *comm-central is right now in
>> a broken state* due to the m-c changes.
>
> The only m-c change that's really breaking us is bug 1071497. :-P

OK. So encoding-unrelated.

>> Before the current ESR, I added telemetry for some of the encodings
>> (DECODER_INSTANTIATED_*). However, I don't know how to actually see
>> the telemetry results from the current Thunderbird release. I
>> encourage Thunderbird developers to work with the Metrics team to find
>> a way to see the telemetry results ASAP.
>
> I don't know how to make sense of Telemetry even for Firefox data...

Do you mean you don't know what constitutes a reasonable threshold of
little enough use for a feature to be removed?

If TB devs don't actually look at telemetry data, I guess I shouldn't
bother to put in probes for c-c's benefit in the future. :-(

>> My recommendation is to mark these as notForOutgoing. It makes sense
>> to leave Thunderbird able to decode email in these encodings for as
>> long as m-c keeps the decoders around for fonts, because Thunderbird
>> has (misguidedly) made it possible for the user to manually configure
>> these encodings for outgoing email or for outgoing Usenet posts. At
>> least in the case of x-mac-croatian, there's known to be a (tiny)
>> self-inflicted Usenet legacy from this misguided UI! Additionally, I
>> recommend marking the two remaining Mac encodings, macintosh (i.e.
>> MacRoman) and x-mac-cyrillic as notForOutgoing.
>
>
> The only non-Encoding-Standard-encoding whose removal I suspect (in lieu of
> good information) would cause c-c some pain is x-mac-croatian.

Only the encoder for x-mac-croatian got removed. It would probably
make sense to go ahead and remove the decoder, too, though, because
gfxFontUtils doesn't happen to use that particular Mac encoding.

> Mostly, it's
> because I use that one in tests for deprecated encoding conversion support.
> Somewhere in 32 or 33 (I don't remember when, exactly), I wrote some code to
> forcibly reset these settings back to UTF-8, which means removing it before
> the next ESR would really be problematic.

If just removing the x-mac-croation *encoder* makes it unsuited for
the test, could the test be changed to use x-mac-cyrillic or one of
the ISO-8859-* encodings that are in the Encoding Standard but not in
the menu: ISO-8859-3, ISO-8859-10, ISO-8859-13, ISO-8859-14,
ISO-8859-15 or ISO-8859-16?

> The compose window logic (with tests!) forcibly makes it possible to only
> encode to a hard-coded list of possible charsets.

This list is the list of menu items, right?

>> You could choose to simply make the same alias mappings as in m-c. Or
>> you can do something more complicated to still use the old labels on
>> the wire. I think you shouldn't try to use the old labels on the wire
>> unless you have knowledge that is required for compatibility. It looks
>> like simply adjusting the alias mapping is the approach being pursued
>> by mkmelin, which is nice.
>
>
> I'm trying to use the Encoding Standard label mappings as much as possible.
> The primary reason mailnews/intl exists is because we need that logic to be
> able to hook up the UTF-7 decoder/encoder--I'm trying to make it go away in
> my slow-going work on deleting libmime.

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1074125 .

>> Finally, as always, the issue of how to label outgoing windows-1252,
>> windows-1254 or windows-874 would be moot if you started just using
>> UTF-8 for outgoing email. To the extent that's not feasible for Japan,
>> yet, I think the best solution to the problem would be:
>
>
> Oh, the problem of Japanese phones not supporting UTF-8 is no longer an
> issue, I think.

I don't have enough data to judge whether Japanese phones not
supporting UTF-8 is still an issue. However, emk (CCed) pointed out
the LetterFix (http://sourceforge.jp/projects/letter-fix/) hack in
https://bugzilla.mozilla.org/show_bug.cgi?id=1003716#c7 . Apparently,
Apple Mail going UTF-8-only caused someone to consider it worthwhile
to hack it to send email encoded in ISO-2022-JP. :-(

> But the JP locale still heavily resists Unicode, to the
> point where they would still rather garble their emails than upgrade
> ISO-2022-JP to UTF-8 (yes, I've complained about this before). So as long as
> we have to support that messed-up hack, the marginal cost of supporting
> other encoders we currently have is rather nil.

Surely, having the UI for choosing the outgoing encoding in the
message compose window and in the preferences has some cost. If the
Japanese locale blocks, or is perceived to block, Thunderbird from
going UTF-8-only for outgoing email, I think having only two outgoing
encodings: UTF-8 and ISO-2022-JP and a heuristic for choosing the
latter in some cases so that there is less need for UI would be a win
compared to having the current UI around.

> That said, we can probably make a bit more effort to convince the other
> localizations that switching their defaults to UTF-8 is worthwhile. Czech,
> Argentina, Basque, Frisian, Hungarian, Italian, Dutch, and Swedish almost
> certainly can default to UTF-8;

Why bother doing convincing when you could make the default moot by
reducing configurability? My experience from the browser side is that
it doesn't make sense to try to convince every localizer. To get
results, it makes sense to move the encoding decisions out of the
localizations into the core engine.

> Korean and Chinese may or may not be able to as easily.

Why not?

ISHIKAWA, Chiaki wrote:
> I often receive e-mails written in Japanese with attachment [with the file
> name in Japanese] created on Apple Macs to TB running under linux.
> When I try to forward such an e-mail, most of the times, it works, but
> sometimes, I get a completely garbled main body.

Is there a report in Bugzilla for this?

> I can not offer authoritative opinion on this since I am just a user of TB
> in Japan albeit I have about 10 years worth of e-mails
> using TB (and prior to it, netscape suite.)

Unfortunately, an email archive doesn't show if the other email
clients, despite maybe sending ISO-2022-JP, could actually receive
UTF-8. :-(

ISHIKAWA, Chiaki

unread,
Sep 29, 2014, 8:50:42 AM9/29/14
to
On 2014/09/29 19:44, Henri Sivonen wrote:
> ISHIKAWA, Chiaki wrote:
>> >I often receive e-mails written in Japanese with attachment [with the file
>> >name in Japanese] created on Apple Macs to TB running under linux.
>> >When I try to forward such an e-mail, most of the times, it works, but
>> >sometimes, I get a completely garbled main body.
> Is there a report in Bugzilla for this?

I thought I reported this around the time I reported some issues like
incorrectly encoded/decoded filename of an attachment.

But, from a quick search,
I could only find a very indirectly related ones.
Not necessarily my original post: I chimed in mid-stream of posts.

In the following, I am talking about MY POST that makes the text
attachment garbled.
Bug 241821 - Mozilla gives dubious mime-type "text/plain" when I
attach a file to outgoing e-mail

Bug 238152 - when attaching text/* type file, 'charset' parameter is
not added

Bug 244829 - utf-16 text file attached incorrectly decoded for in-line
display


I thought I had reported incorrect filename encoding/decoding and
reported a garble main text issue in passing in one bug report, but
I could not find it easily. (Could it be it was NETSCAPE suite mailer?!)

IMHO, maybe it is better to file a new bug. The above bugzilla entries
date back to 2004, and I just experienced the garbled text when I try to
forward an e-mail with an attachment the week before.

The version information would no longer be useful from the old
bugziall entries even if I can locate the right one.



>> >I can not offer authoritative opinion on this since I am just a user of TB
>> >in Japan albeit I have about 10 years worth of e-mails
>> >using TB (and prior to it, netscape suite.)
> Unfortunately, an email archive doesn't show if the other email
> clients, despite maybe sending ISO-2022-JP, could actually receive
> UTF-8.:-(

Right on :-(


Regarding telemetry:
>
> Do you mean you don't know what constitutes a reasonable threshold of
> little enough use for a feature to be removed?
>
> If TB devs don't actually look at telemetry data, I guess I shouldn't
> bother to put in probes for c-c's benefit in the future.

Not so fast, Henri.

Some programmer-types who have recently post patches for TB in the last
24 months or so are USERS (like me) who are FORCED to
find the buggy place and post a fix since there have surfaced a few (or
several dozen depending on one's point of view) serious bugs that affect
their workflows using TB.

If there had not been bugs, or bugs reported had been fixed
quickly enough I would not have touched the mozilla code base at all :-)

When I began tinkering with TB code base, I was surprised at its size.

TB is a huge code base: Obviously larger than firefox.
It can now open own web browser tab (like after google search.)
So basically, on top of web features, it has the mailer (POP3, SMTP,
IMAP, etc.)
I doubt the merit of such additional features as opposed to the core
mailer functionality, but that is the direction of some "developers".
I have no idea if they wanted to code or were forced to code because of
the bugs like myself.

Anyway, for someone like me who is basically a user and
was forced to look into the bugs, the learning is very slow.

To be honest, I won't be able to figure out the telemetry in my spare
time alone.

That is how the TB is being barely maintained by the community AND
core mozilla people acting as patch shepherd, and occasionally active
debuggers, etc., I think.

I don't know how many people (how much percentage, say) who have
contributed the bugs to TB understand "telemetry" deeper than the
superficial understanding of "Oh, it collects user information, etc.".
But how, and in what format, and what are the method to collect certain
information, is it scritable, etc.?

Inquirying minds don't have the time unfortunately in many cases IMHO.

You said "put in probes", but I have no idea what form it is like, etc.

The following comment is not directed this topic alone.
But often times, seasoned developers left a cryptic comment to
a newbie debugger and went out of sight since they are busy themselves.
But such a cryptic message may not be enough to solve the issue at hand.
Or more often, the message is generally understandable, but lack
the details which the newbie find very difficult to fill in technically, and
getting the patch suggested in a general manner
accepted in the end may be another hurdle.

Enough barrier for the past-time patch posters, IMHO.

Like the bustage bug
> The only m-c change that's really breaking us is bug 1071497.

I could figure out the API has changed, but how do I
patch the resulting failure?. I had no clue.
Good thing that the bugzilla had a temporary workaround (at least
it compiles and builds TB successfully although whether the patch
is correct or not semantically is beyond me.) But all I wanted to do at
the moment was to test some fixes to the failure to report filtering
failure (my CURRENT CONCERN) and, to be honest, I did not have the time
to investigate other aspects of mozilla development. Would be nice to
learn many topics in depth: telemetry, performance tweaking, etc.

Just my two cents worth.

TIA

Joshua Cranmer 🐧

unread,
Sep 29, 2014, 9:47:17 AM9/29/14
to
On 9/29/2014 5:44 AM, Henri Sivonen wrote:
> Do you mean you don't know what constitutes a reasonable threshold of
> little enough use for a feature to be removed?

No, it means I don't understand the Telemetry UI to be able to interpret
what it's showing me.
> If TB devs don't actually look at telemetry data, I guess I shouldn't
> bother to put in probes for c-c's benefit in the future. :-(
I'd love to look at the data, and I suspect others would to, but it's
really hard when the UI is as unintuitive and opaque as it is. Let alone
the fact that apparently no one knows how to view Thunderbird-specific
data points.

> If just removing the x-mac-croation *encoder* makes it unsuited for
> the test, could the test be changed to use x-mac-cyrillic or one of
> the ISO-8859-* encodings that are in the Encoding Standard but not in
> the menu: ISO-8859-3, ISO-8859-10, ISO-8859-13, ISO-8859-14,
> ISO-8859-15 or ISO-8859-16?

The test needs to use a non-Encoding Standard decoder.

> This list is the list of menu items, right?

Yes.
> I don't have enough data to judge whether Japanese phones not
> supporting UTF-8 is still an issue. However, emk (CCed) pointed out
> the LetterFix (http://sourceforge.jp/projects/letter-fix/) hack in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1003716#c7 . Apparently,
> Apple Mail going UTF-8-only caused someone to consider it worthwhile
> to hack it to send email encoded in ISO-2022-JP. :-(

That's the data point I'm getting my knowledge from.

> Surely, having the UI for choosing the outgoing encoding in the
> message compose window and in the preferences has some cost. If the
> Japanese locale blocks, or is perceived to block, Thunderbird from
> going UTF-8-only for outgoing email, I think having only two outgoing
> encodings: UTF-8 and ISO-2022-JP and a heuristic for choosing the
> latter in some cases so that there is less need for UI would be a win
> compared to having the current UI around.

Bug 410333 introduced code that made emails not representable in the
currently selected charset silently fallback to UTF-8 rather than ask
the user if they wanted to change charsets, because, well, it's a
pointless and it only makes sense, right? Well, Bug 448842 was filed by
the JP locale because the "my email is no longer ISO-2022-JP because of
a single character" was too problematic for users. The logic that JP
locale wants is incompatible with good logic for the rest of the world.

>> That said, we can probably make a bit more effort to convince the other
>> localizations that switching their defaults to UTF-8 is worthwhile. Czech,
>> Argentina, Basque, Frisian, Hungarian, Italian, Dutch, and Swedish almost
>> certainly can default to UTF-8;
> Why bother doing convincing when you could make the default moot by
> reducing configurability? My experience from the browser side is that
> it doesn't make sense to try to convince every localizer. To get
> results, it makes sense to move the encoding decisions out of the
> localizations into the core engine.

As I said above, the configurability of that option is hard to remove as
long as resistance in Japan is so entrenched. But more importantly, I
want to know if any other locales are similarly entrenched.

Henri Sivonen

unread,
Sep 30, 2014, 7:26:42 AM9/30/14
to Joshua Cranmer 🐧, dev-apps-t...@lists.mozilla.org
On Mon, Sep 29, 2014 at 4:47 PM, Joshua Cranmer 🐧 <Pidg...@gmail.com> wrote:
> On 9/29/2014 5:44 AM, Henri Sivonen wrote:
>>
>> Do you mean you don't know what constitutes a reasonable threshold of
>> little enough use for a feature to be removed?
>
> No, it means I don't understand the Telemetry UI to be able to interpret
> what it's showing me.

E.g. http://telemetry.mozilla.org/#filter=release%2F32%2FDECODER_INSTANTIATED_HZ&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
says that during the observed date range in Firefox 32 on the release
channel, there were 113.11 million sessions in which the HZ-GB-2312
decoder was not instantiated and 83 sessions in which it was.

>> If TB devs don't actually look at telemetry data, I guess I shouldn't
>> bother to put in probes for c-c's benefit in the future. :-(
>
> I'd love to look at the data, and I suspect others would to, but it's really
> hard when the UI is as unintuitive and opaque as it is.

The DECODER_INSTANTIATED_* stuff is pretty simple:
You pick a release, the system gives you some random date range and
then you get a number that represents the number of *sessions* in
which the decoder wasn't instantiated at all and the number of
*sessions* in which the decoder was instantiated at least once. If the
latter number is tiny, getting rid of the encoding shouldn't be a
problem.

> Let alone the fact
> that apparently no one knows how to view Thunderbird-specific data points.

https://bugzilla.mozilla.org/show_bug.cgi?id=956101

>> If just removing the x-mac-croation *encoder* makes it unsuited for
>> the test, could the test be changed to use x-mac-cyrillic or one of
>> the ISO-8859-* encodings that are in the Encoding Standard but not in
>> the menu: ISO-8859-3, ISO-8859-10, ISO-8859-13, ISO-8859-14,
>> ISO-8859-15 or ISO-8859-16?
>
> The test needs to use a non-Encoding Standard decoder.

I suggest using an x-mac-* encoding that gfx uses and that telemetry
shows non-zero Firefox usage then. (gfx does not use x-mac-croatian.)

> Bug 410333 introduced code that made emails not representable in the
> currently selected charset silently fallback to UTF-8 rather than ask the
> user if they wanted to change charsets, because, well, it's a pointless and
> it only makes sense, right? Well, Bug 448842 was filed by the JP locale
> because the "my email is no longer ISO-2022-JP because of a single
> character" was too problematic for users. The logic that JP locale wants is
> incompatible with good logic for the rest of the world.

Henri Sivonen

unread,
Sep 30, 2014, 8:18:46 AM9/30/14
to ISHIKAWA, Chiaki, dev-apps-t...@lists.mozilla.org
On Mon, Sep 29, 2014 at 3:50 PM, ISHIKAWA, Chiaki <ishi...@yk.rim.or.jp> wrote:
> The above bugzilla entries
> date back to 2004, and I just experienced the garbled text when I try to
> forward an e-mail with an attachment the week before.

Not exactly fast bug fixing. :-(

> You said "put in probes", but I have no idea what form it is like, etc.

The probes are just calls to Telemetry::Accumulate. Like this:
http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ucvlatin/nsCP866ToUnicode.cpp#23

The types of probes are defined in Histograms.json:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json#4023

The "flag" type of probe reports the value 1 if Telemetry::Accumulate
has been called at least once for it during the session and 0 if
Telemetry::Accumulate has not been called at all for it during the
session.

Kent James

unread,
Sep 30, 2014, 1:27:01 PM9/30/14
to
On 9/30/2014 4:26 AM, Henri Sivonen wrote:
> On Mon, Sep 29, 2014 at 4:47 PM, Joshua Cranmer 🐧 <Pidg...@gmail.com> wrote:
...
>> Let alone the fact
>> that apparently no one knows how to view Thunderbird-specific data points.
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=956101
>

Telemetry could be useful in Thunderbird, but at least this dev knows
nothing about it. If anything, I was under the impression that this is
FF data.

Is telemetry available that would let us evaluate the usage of character
sets in TB, or is this something that could be easily added?

:rkent

ishikawa

unread,
Oct 1, 2014, 6:48:05 AM10/1/14
to
Re the bug reporting issue:

On 2014年09月30日 21:18, Henri Sivonen wrote:
> On Mon, Sep 29, 2014 at 3:50 PM, ISHIKAWA, Chiaki <ishi...@yk.rim.or.jp> wrote:
>> The above bugzilla entries
>> date back to 2004, and I just experienced the garbled text when I try to
>> forward an e-mail with an attachment the week before.
>
> Not exactly fast bug fixing. :-(
>

I have created a bugzilla entry

https://bugzilla.mozilla.org/show_bug.cgi?id=1075436

Trying to forward a Japanese e-mail with certain attachment files cause a
garbled main text in mail composite window

Based on the new knowledge that attaching a single plain text (Shift_JIS)
encoded Japanese text
to an outgoing ISO-2022-JP mail can cause this behavior, I may need to tweak
the title a little bit.

But the information there (sans screen shots: I will upload something later)
should be enough to recreate the problem.

TIA

ishikawa

unread,
Oct 2, 2014, 12:50:43 AM10/2/14
to
> I have created a bugzilla entry
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=1075436
>
> Trying to forward a Japanese e-mail with certain attachment files cause a
> garbled main text in mail composite window
>
...
> But the information there (sans screen shots: I will upload something later)
> should be enough to recreate the problem.

I have uploaded the screen shots that explain the behavior.

TIA

ishikawa

unread,
Oct 2, 2014, 1:20:21 AM10/2/14
to
On 2014年09月30日 20:26, Henri Sivonen wrote:

> E.g. http://telemetry.mozilla.org/#filter=release%2F32%2FDECODER_INSTANTIATED_HZ&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
> says that during the observed date range in Firefox 32 on the release
> channel, there were 113.11 million sessions in which the HZ-GB-2312
> decoder was not instantiated and 83 sessions in which it was.
>

This is a great pointer.
I looked at ISO2022JP info.

http://telemetry.mozilla.org/#filter=release%2F32%2FDECODER_INSTANTIATED_ISO2022JP&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table

The number 61K+ sounds too small to me.

It is either

- the corporate users (if any) don't allow telemetry data to be sent out
[and they are the ones who are very likely forced to stick to ISO-2022-JP
because
their valued customers may complain or already did if UTF-8 is used.
I have stuck to ISO-2022-JP for similar reasons. I got problems with art
design studio
to exchange e-mails, and reverted back to ISO-2022-JP.],

- or the numbers may not reflect the real usage due to the
recent bustage noted in the following bug.


>> Let alone the fact
>> that apparently no one knows how to view Thunderbird-specific data points.
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=956101

Or, maybe corporate users have abandoned TB for something else like a
custom-made
e-mail client with all the bells and whistles for corporate governance:
verifying the to: address (blacklist/whitelist),
content censorship for proper communication/corporate governance, automatic
routing of an e-mail to
one's superior for signing off, etc.
This is for real. People don't talk about this openly, but every time I see
the mention of
special mail client used in big corporations on TV news segments (today's
corporate business work),
the usage pattern suggests look something that cannot be created by
a simple add-on on top of an existing mail client. [I tend to think such
checking is done BOTH
on the client PC and the server.
The server-alone checking would put too much workload on it
during peak hours.]

Once the correct data begins accumulating, once bug 956101 is fixed,
I would be interested in learning how popular
UTF-8 is in China.
Is this something possible to learn (maybe by learning how often GB encoding
et al is used?).
I doubt if there is a way to correlate the data with the geographical area
(maybe the timezone?) of the client.
Oh well, I am treading into a privacy issue here, I suspect.

The reason I am asking is, until the recent past, many Japanese phone makers
subcontracted
the internal design work to China (they understand Kanji and Eastern Asian
character issue
better than, say, Indian companies). If UTF-8 is not used widely in China, I
can certainly
understand why UTF-8 was not well supported in Japanese mobile phones
(because most of the time,
the spec for internal software was done by the Chinese subcontractors: they
wrote specification
document in Japanese but the content reflects their experience of using
legitimate but legacy standard code vs UTF-8 in Japan and in China, maybe.]
I have begun to feel the way the phone software was done unfavorably to
UTF-8 may have something to do with the subcontracting in China where the
essential design was done.

Thank you for various tips regarding telemetry, these certainly help.
(Too bad about bug 95601)

Henri Sivonen

unread,
Oct 6, 2014, 3:16:24 AM10/6/14
to ishikawa, dev-apps-t...@lists.mozilla.org
On Thu, Oct 2, 2014 at 8:20 AM, ishikawa <ishi...@yk.rim.or.jp> wrote:
> On 2014年09月30日 20:26, Henri Sivonen wrote:
>
>> E.g. http://telemetry.mozilla.org/#filter=release%2F32%2FDECODER_INSTANTIATED_HZ&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
>> says that during the observed date range in Firefox 32 on the release
>> channel, there were 113.11 million sessions in which the HZ-GB-2312
>> decoder was not instantiated and 83 sessions in which it was.
>>
>
> This is a great pointer.
> I looked at ISO2022JP info.
>
> http://telemetry.mozilla.org/#filter=release%2F32%2FDECODER_INSTANTIATED_ISO2022JP&aggregates=multiselect-all!Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
>
> The number 61K+ sounds too small to me.

Note that the above telemetry numbers are for Firefox--not Thunderbird.

> It is either
>
> - the corporate users (if any) don't allow telemetry data to be sent out
> [and they are the ones who are very likely forced to stick to ISO-2022-JP
> because
> their valued customers may complain or already did if UTF-8 is used.
> I have stuck to ISO-2022-JP for similar reasons. I got problems with art
> design studio
> to exchange e-mails, and reverted back to ISO-2022-JP.],
>
> - or the numbers may not reflect the real usage due to the
> recent bustage noted in the following bug.

The numbers are for Firefox and not for Thunderbird due to the bug
that results in Thunderbird data not showing up at all. Apparently
ISO-2022-JP is not really that popular an encoding on the Web where
Shift_JIS is the dominant legacy encoding for Japanese text. It would
be interesting to know why anyone would use ISO-2022-JP on the Web.
Pass-through mail archives are an obvious use case. Are there other
reasons?

> Once the correct data begins accumulating, once bug 956101 is fixed,
> I would be interested in learning how popular
> UTF-8 is in China.
> Is this something possible to learn (maybe by learning how often GB encoding
> et al is used?).

I'm not aware of telemetry probes in place to gauge UTF-8 usage in
Thunderbird. (There could be a probe without me knowing about it,
though.)

> I doubt if there is a way to correlate the data with the geographical area
> (maybe the timezone?) of the client.
> Oh well, I am treading into a privacy issue here, I suspect.

It's not available on the public front end, but the back end data
store knows about the localization of the app that submitted telemetry
data. It is, therefore, possible for Mozilla staff who have non-public
access to the analysis machinery to analyze telemetry data partitioned
by the user interface language by running a custom map-reduce job.
Reply all
Reply to author
Forward
0 new messages