Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

TB Message encoding question(s)

4 views
Skip to first unread message

Alan Lord (News)

unread,
Nov 29, 2010, 5:32:01 PM11/29/10
to
I have an extension that, amongst other things, allows the TB user to
save messages to a CRM.

Recently we noticed a bug and it's rather beyond my area of expertise...

The user who reported it is in Hungary and his TB is set up to use
UTF-8. He had received an email in German using ISO-8859-1 encoding.

I *think* we send/store the messages in the CRM in UTF-8 but in this
message the encoding got corrupted (I hope these work in my message):

An interesting bit of message as seen (correctly) in TB:

one Pröbs
hestraße

Message content as seen in the CRM after saving:

one PrĂśbs
hestra�e

I'm not even sure I know the right questions to ask about this problem
;-) But I will try... Is this an issue others have experienced and could
offer any suggestions to the *right* way to deal with encoding text?

A selection of code and bits are below for completeness.

TIA

Al


The message contents are first loaded into a hidden iFrame (as is the
recommended way apparently) and I have tried to do a "detect and convert
to UTF-8" on the contents but this caused similar corruption to above.

var iframe = document.getElementById('iframeValue');
iframe.webNavigation.loadURI(url+"?header=quotebody",
iframe.webNavigation.LOAD_FLAGS_IS_LINK, null, null, null);
iframe.addEventListener('load', function() {
document.getElementById("TextAreaValue").value =
iframe.contentDocument.body.innerHTML;}, true);

if (charset != "UTF-8") {
document.getElementById("TextAreaValue").value = convertToUnicode(
charset, document.getElementById("TextAreaValue").value);}

The unicode conversion function is fairly straightforward:

function convertToUnicode (aCharset, aSrc) {
try{
var unicodeConverter =
Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]

.createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = aCharset;
aSrc=unicodeConverter.ConvertToUnicode( aSrc );
}catch(e){
alertMessage(walk(e));
}
return aSrc;
}

And I'm attempting to detect the charset of the message by one of two
methods

if (!msgWindow.mailCharacterSet) {
var charset = loadedMessageHdr.Charset;
} else {
var charset = msgWindow.mailCharacterSet;
}

The mailCharacterSet seemed to be a more reliable and consistent
approach but now I am not so sure...

Neil

unread,
Nov 29, 2010, 7:00:38 PM11/29/10
to
Alan Lord (News) wrote:

> The message contents are first loaded into a hidden iFrame (as is the
> recommended way apparently) and I have tried to do a "detect and
> convert to UTF-8" on the contents but this caused similar corruption
> to above.
>
> var iframe = document.getElementById('iframeValue');
> iframe.webNavigation.loadURI(url+"?header=quotebody",
> iframe.webNavigation.LOAD_FLAGS_IS_LINK, null, null, null);
> iframe.addEventListener('load', function() {
> document.getElementById("TextAreaValue").value =
> iframe.contentDocument.body.innerHTML;}, true);

As far as I know, innerHTML should always be in "Unicode" (actually
UTF-16) (assuming the message's character set was correct in the first
place). So the only conversion you have to worry about is the UTF-8 output.

--
Warning: May contain traces of nuts.

Alan Lord (News)

unread,
Nov 30, 2010, 4:14:22 AM11/30/10
to
On 30/11/10 08:28, Jonathan Protzenko wrote:
> You might want to have a look here
> https://github.com/protz/GMail-Conversation-View/blob/master/modules/message.js#L891

Many thanks for this. I have added the code to set the contentViewer on
my iFrame as in your example and asked my colleague in Hungary to do
some tests.

> As a side note, I just changed this code to take into account bug 594646
> on 3.3. I previously used cv.hintCharacterSetSource = 10 on 3.1, ymmv.

I can't find any description of what these properties

cv.hintCharacterSet = "UTF-8";
cv.hintCharacterSetSource = kCharsetFromMetaTag;

are supposed to indicate (I can sort of guess what the first is for) but
for my tests I have just set hintCharacterSetSource = 11; and we'll see.
Message handling on my own system seemed unaffected by this, but then I
am only really using UTF-8/standard Latin anyway...

If you have spare time, what does the kCharsetFromMetaTag number signify?

Many thanks again.

Alan

Jonathan Protzenko

unread,
Nov 30, 2010, 4:56:04 AM11/30/10
to Alan Lord (News), dev-ext...@lists.mozilla.org
I just figured out these issues because I'm French and I do deal with
email containing non-ascii characters. I'm basically doing the message
display myself as well for the Thunderbird Conversations extension, and
I ran into these encoding issues.

I've spent some time digging into the arcanes deeps for nsMessenger.cpp
and the following lines shed some light on the matter:
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#392
http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#571

Basically, the trick is that libmime (the component responsible for
parsing and decoding messages) already takes care of the encoding
issues. It decodes base64, quoted-printable, and all sorts of encodings,
and *always* outputs UTF8. So when streaming a message to your iframe,
the message goes through libmime, which outputs UTF8 already, regardless
of the original encoding of your HTML message.

Now the thing is, you must tell the iframe to *ignore* the <meta> tag in
the message body, as it's not relevant anymore, and just assume the hint
you just gave is the right value. I've found out the combo of these two
lines solves the display issues. I just hope it does for you as well!

jonathan


Alan Lord (News)

unread,
Nov 30, 2010, 7:12:06 AM11/30/10
to Jonathan Protzenko
On 30/11/10 09:14, Alan Lord (News) wrote:
> On 30/11/10 08:28, Jonathan Protzenko wrote:
>> You might want to have a look here
>> https://github.com/protz/GMail-Conversation-View/blob/master/modules/message.js#L891
>
> Many thanks for this. I have added the code to set the contentViewer on
> my iFrame as in your example and asked my colleague in Hungary to do
> some tests.

Jonathan, many thanks.

My colleague in Hungary has tested and is very happy :-)

Just adding the contentViewer code to the iFrame object works a treat. I
don't need to do any manual charset conversion.

Much appreciated.

Alan

Neil

unread,
Nov 30, 2010, 9:40:05 AM11/30/10
to
Alan Lord (News) wrote:

So, is the problem that the MIME channel doesn't itself specify that
it's really UTF-8, which is why the workaround is needed?

0 new messages