<script> charset attribute ignored

Jan Odvarko

unread,

Dec 22, 2009, 4:26:49 AM12/22/09

to

An issue related to charset was reported in Firebug. If an html file
is encoded in e.g. UTF-8 and included file encoded with different
charest. The file is included as follows:

The content of that JS file is not converted properly.

Online test case:
http://getfirebug.com/tests/issues/429/index.xhtml

I checked the channel object (related to
dirrerentlyEncodedScriptFile.js request) and it's contentCharset is
not set.

Accorging to the doc the contentCharset uses charset given in
contentType (after ';'), but if the contentType doesn't specify any
charset, the "charset" attribute from the <script> element should be
respected, correct?

I'll file a new bug if this is expected behavior.

Honza

Boris Zbarsky

unread,

Dec 22, 2009, 12:24:08 PM12/22/09

to

On 12/22/09 4:26 AM, Jan Odvarko wrote:
> Online test case:
> http://getfirebug.com/tests/issues/429/index.xhtml

This testcase worksforme (in the sense that in a build without Firebug
installed it throws because |console| is undefined, and in a build with
Firebug installed it logs "got something:foo" to the console.

> Accorging to the doc the contentCharset uses charset given in
> contentType (after ';'), but if the contentType doesn't specify any
> charset, the "charset" attribute from the<script> element should be
> respected, correct?

Yes, and it is.

-Boris

John J. Barton

unread,

Dec 22, 2009, 12:48:44 PM12/22/09

to

Boris Zbarsky wrote:
> On 12/22/09 4:26 AM, Jan Odvarko wrote:
>> Online test case:
>> http://getfirebug.com/tests/issues/429/index.xhtml
>
> This testcase worksforme (in the sense that in a build without Firebug
> installed it throws because |console| is undefined, and in a build with
> Firebug installed it logs "got something:foo" to the console.

The bug here is that the Script panel is not readable for the .js files.

jjb

Gijs Kruitbosch

unread,

Dec 22, 2009, 1:14:17 PM12/22/09

to

How are you getting the source text? If you refetch the JS, and don't specify
the charset bits yourself, then that'd be the same as requesting the code
yourself in the webbrowser - which has it in UTF-16, which would explain why
it's not readable...

~ Gijs

Boris Zbarsky

unread,

Dec 22, 2009, 1:16:33 PM12/22/09

to

On 12/22/09 12:48 PM, John J. Barton wrote:
> The bug here is that the Script panel is not readable for the .js files.

Is the Script panel using the <script> node's charset attribute?

In fact, how is the Script panel getting the text it's getting?

-Boris

Boris Zbarsky

unread,

Dec 22, 2009, 1:22:33 PM12/22/09

to

On 12/22/09 4:26 AM, Jan Odvarko wrote:

> Accorging to the doc the contentCharset uses charset given in
> contentType (after ';'), but if the contentType doesn't specify any
> charset, the "charset" attribute from the<script> element should be
> respected, correct?

Oh, I see what you're asking. You want to know whether the charset
reported by the _channel_ should take into account the @charset
attribute of the <script>?

There's no guarantee of that whatsoever. The consumer can use @charset
as a charset hint (and set it on the channel), or it can just detect
when the channel didn't set a charset and look at @charset at that
point. Right now it does the latter. I suppose we could switch to
doing the former; that might help you if you're relying solely on the
channel to get the encoding information.

Of course you'll still have to do the BOM and document character set
stuff yourself. There really has to be a better way to do this.
Perhaps we should have a way for a debugger to be notified with the
Unicode script text instead of having to sniff channels for data? Or
something?

-Boris

Boris Zbarsky

unread,

Dec 22, 2009, 1:24:57 PM12/22/09

to

On 12/22/09 1:22 PM, Boris Zbarsky wrote:
> There's no guarantee of that whatsoever. The consumer can use @charset
> as a charset hint (and set it on the channel)

To be clear, this would only work with HTTP. Other channel impls do not
have this behavior, and the API doesn't describe it as an option (unlike
with content-type). So this is not in fact something the consumer can do.

-Boris

Jan Odvarko

unread,

Dec 23, 2009, 4:43:31 AM12/23/09

to

Firebug currently uses nsITraceableChannel to register a stream tee
listener and get an HTTP response from the tee using pipe. All
responses are cached (to avoid any refetch from the server) and used
later by the Script panel (and e.g. by the Net panel to show
responses).

Honza

Jan Odvarko

unread,

Dec 23, 2009, 5:13:19 AM12/23/09

to

> Oh, I see what you're asking. You want to know whether the charset
> reported by the _channel_ should take into account the @charset
> attribute of the <script>?

Yes, every tracked (text) response is converted into unicode before
putting into Firebug cache (using nsIScriptableUnicodeConverter).

In this specific case, the js script included in the page uses
different charset than the page itself so instead of:

var conv = // scriptableunicodeconverter component
conv.charset = document.characterSet;
return conv.ConvertToUnicode(text);

There should be something like (where request is nsIHttpChannel):
conv.charset = request.contentCharset;
return conv.ConvertToUnicode(text);

So, the conversion uses character set of the request instead of the
'global' page character set.

> There's no guarantee of that whatsoever. The consumer can use @charset
> as a charset hint (and set it on the channel), or it can just detect
> when the channel didn't set a charset and look at @charset at that
> point.

Is there any way how to get the <script> element and consequently the
@charset attribute from the _channel_ object? The code tracking http
responses doesn't have currently any glue about the UI. It's starting
in http-on-examine-response event. This could be actually useful even
for other things, but I actually doubt it's possible.

> Right now it does the latter. I suppose we could switch to
> doing the former; that might help you if you're relying solely on the
> channel to get the encoding information.

Yes, this would make better sense.
I have filled a bug for this.
https://bugzilla.mozilla.org/show_bug.cgi?id=536529
Of course the charset from the response Content-Type header (if any)
can have bigger priority.

> Of course you'll still have to do the BOM and document character set
> stuff yourself. There really has to be a better way to do this.
> Perhaps we should have a way for a debugger to be notified with the
> Unicode script text instead of having to sniff channels for data? Or
> something?

Where I can get more info about how to do the BOM stuff? Should I also
report a bug for this. Not sure what options do we have here...

Honza

Boris Zbarsky

unread,

Dec 23, 2009, 10:36:59 AM12/23/09

to

On 12/23/09 4:43 AM, Jan Odvarko wrote:
> Firebug currently uses nsITraceableChannel to register a stream tee
> listener and get an HTTP response from the tee using pipe. All
> responses are cached (to avoid any refetch from the server) and used
> later by the Script panel (and e.g. by the Net panel to show
> responses).

nsITraceableChannel gives you bytes. How do you convert the bytes to
characters?

-Boris

Boris Zbarsky

unread,

Dec 23, 2009, 10:39:50 AM12/23/09

to

On 12/23/09 5:13 AM, Jan Odvarko wrote:
> Is there any way how to get the<script> element and consequently the
> @charset attribute from the _channel_ object?

No.

> I have filled a bug for this.
> https://bugzilla.mozilla.org/show_bug.cgi?id=536529

And I just marked it invalid per my later post in this thread....

>> Of course you'll still have to do the BOM and document character set
>> stuff yourself. There really has to be a better way to do this.
>> Perhaps we should have a way for a debugger to be notified with the
>> Unicode script text instead of having to sniff channels for data? Or
>> something?
> Where I can get more info about how to do the BOM stuff? Should I also
> report a bug for this. Not sure what options do we have here...

Didn't I just mention an option above? The end of the paragraph you quoted.

-Boris