Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Javascript string encoding and channels

37 views
Skip to first unread message

Brian Fernandes

unread,
Dec 22, 2007, 12:27:59 AM12/22/07
to
Background: I have implemented my own protocol handler. The protocol
handler returns a new custom channel on request, this happens when you
type in a URL using that protocol in the address bar. The channel's
asyncOpen method fetches a String (which is an HTML document,
beginning at <html>) from some Java libraries based on the URL
entered. This string is placed in a nsIStringInputStream and is sent
on to the browser where it is displayed in the current tab.

The code for the channel's asyncOpen method is something like this:
asyncOpen: function(observer, context) {
var html = getStringFromJava(this.URI);
var sc = Components.classes["@mozilla.org/io/string-input-stream;
1"];
var ist =
sc.createInstance(Components.interfaces.nsIStringInputStream);
observer.onStartRequest(this, context);
var len = html.length;
ist.setData(html, len);
observer.onDataAvailable(this, context, ist, 0, len);
ist.close();
observer.onStopRequest(this, context, Components.results.NS_OK);
};

Problem: The above works well for regular ASCII text, it fails
miserably however, if the HTML string returned contains characters
outside ASCII (e.g. for Chinese characters). I just get mangled text
in my browser instead. If I try to log the HTML string to the
Javascript console OR display it using an alert box, it looks just
fine, with all the characters preserved. I have tried setting the
contentCharset property on the channel to UTF-8, UTF-16, UCS-2 and
ISO-10646(unicode?) with no luck. AFAIK, even if you do not make the
setting explicitly in the channel, you can make this change on the fly
in FF using View > Character Encoding; this did not help, I just could
not get it to work. Note that the HTML returned does not have any
encoding elements specified, just a style and hyperlinked text.

Notes:
1) Instead of using the channel, if I set the contents of the current
browser's document to the string returned, it looks fine:
gBrowser.selectedBrowser.contentDocument.childNodes[1].innerHTML =
html;
This was actually my initial implementation before I ran into the
encoding problem, but it really is a hack and I would like to use the
channel instead.

2) Instead of HTML, I can return an array composed of the raw string
bytes (using UTF-8 encoding). In such a case I have a loop which
reconstructs the string like so:

var html = "";
for (i = 0; i < bytes.length; i++) {
html += String.fromCharCode(bytes[i]);
}
If I use this string (html) while specifying UTF-8 as the
contentCharset, it works and I get the expected text in my browser.
This is my current solution, but it is an obviously inefficient and
unwanted o(n) routine. On my machine it takes nearly 3 seconds for 5
pages worth of text, though most regular queries have smaller results
and only take a quarter of a second.

3) The JS file which contains this code is encoded using UTF-8, so I
can hardcode a Javascript string to include some Chinese characters.
These look fine in my editor, but when I display this string using the
console or an alert, it looks mangled. However, if I use normal
channel code (without any of these mods), it looks just fine in the
browser if the contentCharset is set to UTF-8.

I don't think the Java source is the problem because Javascript is
capable of displaying the string correctly in the console / alert box.
It seems to merely be a question of finding the right encoding to use
but most of the common options have lead to disappointing results. Can
I use a different input stream? Any suggestions? What am I missing /
doing wrong?

Thanks!
Brian.

Mook

unread,
Dec 22, 2007, 1:35:14 AM12/22/07
to
Brian Fernandes wrote:
> Background: I have implemented my own protocol handler. The protocol
> handler returns a new custom channel on request, this happens when you
> type in a URL using that protocol in the address bar. The channel's
> asyncOpen method fetches a String (which is an HTML document,
> beginning at <html>) from some Java libraries based on the URL
> entered. This string is placed in a nsIStringInputStream and is sent
> on to the browser where it is displayed in the current tab.
>
> The code for the channel's asyncOpen method is something like this:
> asyncOpen: function(observer, context) {
> var html = getStringFromJava(this.URI);
> var sc = Components.classes["@mozilla.org/io/string-input-stream;
> 1"];
> var ist =
> sc.createInstance(Components.interfaces.nsIStringInputStream);
> observer.onStartRequest(this, context);
> var len = html.length;
> ist.setData(html, len);
> observer.onDataAvailable(this, context, ist, 0, len);
> ist.close();
> observer.onStopRequest(this, context, Components.results.NS_OK);
> };

<snip>

http://lxr.mozilla.org/mozilla/source/xpcom/io/nsIStringStream.idl#55

setData's first parameter is "string", which is XPIDL/XPConnect for
"random 8-bit data that got that way by taking UCS2 and chopping off the
high byte". You want to first encode your string into bytes then send
it over. nsIScriptableUConv is probably useful for you.

Basically, just send the *raw* bytes over (this means showing it in JS
won't show the right characters, but that's fine because decoding is
just done later).


HTH
--
Mook

Brian Fernandes

unread,
Dec 22, 2007, 3:26:45 PM12/22/07
to
On Dec 22, 11:35 am, Mook <mook.moz+nntp.news.mozilla....@gmail.com>
wrote:
Mook,

That was the ticket. Here is the revised asyncOpen method which works
like a charm.

asyncOpen: function(observer, context) {
var html = getStringFromJava(this.URI);

observer.onStartRequest(this, context);
var converter = Components.classes["@mozilla.org/intl/
scriptableunicodeconverter"]
.createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = "UTF-8";
var ist = converter.convertToInputStream(html);
var len = ist.available();


observer.onDataAvailable(this, context, ist, 0, len);

observer.onStopRequest(this, context, Components.results.NS_OK);
}

Thanks! I appreciate the quick response.
Brian.

0 new messages