I want IE6 to handle ISO-8859-1 characters in an ajax response. (I
would of course prefer to use all UTF-8, but that is not possible int
his case).
Details:
- Server is correctly sending content-type and charset as ISO-8859-1
for all responses
- Page is correctly rendered in IE6 in ISO-8859-1
- Request is made to server via Microsoft.XMLHTTP with:
xmlhhtp.setRequestHeader("Content-Type","application/x-www-form-
urlencoded;charset=ISO-8859-1");
- I even tried setting the charset attribute in the <script> tag
I am simply trying to alert() the responseText or populate a DIV with
the response, but it will not correctly interpret the content returned
by the server.
In other browsers, the characters display just fine. Using the
"DebugBar" IE plugin to view the response from the server, it shows
correctly. But once it hits javascript, I lose the encoding.
How can I get ISO-8859-1-encoded content via ajax in IE6, and have it
handled correctly?
Must I manually encode it on the server and decode it in the browser?
Thanks!
Matt Kruse
Do you have a URL to check?
> - Page is correctly rendered in IE6 in ISO-8859-1
> - Request is made to server via Microsoft.XMLHTTP with:
> xmlhhtp.setRequestHeader("Content-Type","application/x-www-form-
> urlencoded;charset=ISO-8859-1");
The Content-Type request header is meaningful for the request body in
case you make a HTTP POST request (i.e. pass in something as the
argument of the send method of XMLHTTP). It does not have any meaning
for the response body you receive.
> - I even tried setting the charset attribute in the <script> tag
The charset attribute of the script element indicates the encoding of an
external script file referenced in the src attribute. It too does not
have any meaning for the response body you receive with XMLHTTP.
> I am simply trying to alert() the responseText or populate a DIV with
> the response, but it will not correctly interpret the content returned
> by the server.
>
> In other browsers, the characters display just fine. Using the
> "DebugBar" IE plugin to view the response from the server, it shows
> correctly. But once it hits javascript, I lose the encoding.
>
> How can I get ISO-8859-1-encoded content via ajax in IE6, and have it
> handled correctly?
I don't have access to an IE 6 at the moment but a script in IE 8 using
new ActiveXObject('Microsoft.XMLHTTP') properly decodes an ISO-8859-1
encoded response body.
--
Martin Honnen
http://msmvps.com/blogs/martin_honnen/
I have overriden the encoding of an AJAX response by using the
XMLHttpRequest element in standard compliant browsers.
I don’t recall the method right now though...
I made one that demonstrates the problem:
http://www.javascripttoolbox.com/temp/charset3.php
PHP Source: http://www.javascripttoolbox.com/temp/charset3.txt
Works in FF, fails in IE6
> I don't have access to an IE 6 at the moment but a script in IE 8 using
> new ActiveXObject('Microsoft.XMLHTTP') properly decodes an ISO-8859-1
> encoded response body.
I haven't been able to test the url above in IE8 yet. Will do so when
I get the chance.
Matt Kruse
It doesn't work in IE8 either, but...
...what exactly are you testing here? Your test $string contains the
characters 0x92 0x94 0x95 0x96 0x97 0x98 0x99 0x9a 0x9b - all of these
are defined in ISO-8859-1 as obscure control characters like "Private
Use 2", "Cancel Character", etc. Firefox, Opera and others probably
replace these code points with characters from a different character
set, most likely Windows-1252.
If you limit yourself to less unusual code points like 0xE4 (ä) in your
test string, the Ajax response will display correctly in both IE6 and IE8.
Anyway, the XMLHttpRequest object was always a bit problematic with
encodings other than UTF-8. For example, (IIRC) the send() method will
always encode its data payload as UTF-8, regardless of the document
encoding.
By the way, your example page also fails in FF 3.0, which requires an
argument for xhr.send(): 'uncaught exception: [Exception... "Not enough
arguments [nsIXMLHttpRequest.send]"'
--
stefan
>I made one that demonstrates the problem:
>http://www.javascripttoolbox.com/temp/charset3.php
>PHP Source: http://www.javascripttoolbox.com/temp/charset3.txt
Error also occurs in IE7.
I copied/pasted into an editor and tried to save it, but I get this
error (editor is Geany, on Linux):
An error occurred while converting the file from UTF-8 in "ISO-8859-1".
The file remains unsaved.
Error message: Invalid byte sequence in conversion input
The error occurred at "'" (line: 3, column: 15).
^ actual character is a smart quote
I don't get that problem trying to save as windows-1252, so I guess you
just learnt that ISO-8859-1 != windows-1252. Changing your Content-type
character encoding to windows-1252 fixes the problem in IE.
This is actually something I see quite a lot. Many websites incorrectly
identify their encoding as ISO-8859-1 when they are actually
windows-1252. Because I'm browsing from Firefox in Linux, which
"believes" what it is told about the character encoding, I often see
these character encoding problems as inverse-colour question marks
(whereas IE on Windows seems to sniff the data and correct the
identified encoding). The sooner everyone dumps both encodings in favour
of UTF-8, the better.
--
Ross McKay, Toronto, NSW Australia
"Let the laddie play wi the knife - he'll learn"
- The Wee Book of Calvin
Wow, this is great to know. I did some reading and learned about the
ISO-8859-1/Windows-1252 confusion, and that definitely seems to apply
here.
When I switch my PHP header to deliver a windows-1252 charset, IE6
shows it correctly! Fantastic!
So, is IE6 just so ancient that it does not do the automatic
conversion from iso-8859-1 to assuming it's windows-1252?
> Anyway, the XMLHttpRequest object was always a bit problematic with
> encodings other than UTF-8. For example, (IIRC) the send() method will
> always encode its data payload as UTF-8, regardless of the document
> encoding.
I am looking into this as the next step. I submit a lot of forms via
ajax, and I need to know how to correctly force the encoding to be
correct or at least handle the utf-8 on the server side, if that's ll
IE6 will ever send.
Thanks so much for your post about this, it definitely lead me in the
right direction!
Matt
You are absolutely correct. Thank you very much for this bit of
wisdom.
> This is actually something I see quite a lot. Many websites incorrectly
> identify their encoding as ISO-8859-1 when they are actually
> windows-1252.
It seems that newer standards actually demand that browsers treat
ISO-8859-1 as windows-1252.
> The sooner everyone dumps both encodings in favour of UTF-8, the better.
Agreed!
Matt
> Ross McKay wrote:
>> This is actually something I see quite a lot. Many websites incorrectly
>> identify their encoding as ISO-8859-1 when they are actually
>> windows-1252.
Comparing the tables, it appears that ISO-8859-1 misses a few rare
signs (which would be a relatively minor issue IMO) but also the Euro-
sign (which would be a very major issue IMO). I would think that the
Euro-symbol had put ISO-8859-1 in a untenable position. Both tables
also hold various representations for the control characters, which
probably explains your test results.
> It seems that newer standards actually demand that browsers treat
> ISO-8859-1 as windows-1252.
Yep - it looks like HTML5 will even *require* ISO-8859-1 to be parsed
as Windows-1252.
http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
> So, is IE6 just so ancient that it does not do the automatic
> conversion from iso-8859-1 to assuming it's windows-1252?
That would be my conclusion too, but a bit strange though (shouldn't
IE be the first to support its own charset?); maybe a misplaced
attempt at Microsoft to follow W3C, a wrong choice during the early
UTF days, or a bug:
http://support.microsoft.com/kb/315699/en/
From http://en.wikipedia.org/wiki/ISO/IEC_8859-1 :
| It is very common to mislabel text data with the charset
| label ISO-8859-1, even though the data is really Windows-1252
| encoded. Many web browsers and e-mail clients will interpret
| ISO-8859-1 control codes as Windows-1252 characters in order
| to accommodate such mislabeling but it is not standard behaviour
| and care should be taken to avoid generating these characters
| in ISO-8859-1 labeled content.
--
Bart
> Stefan Weiss wrote:
>> ...what exactly are you testing here? Your test $string contains the
>> characters 0x92 0x94 0x95 0x96 0x97 0x98 0x99 0x9a 0x9b - all of these
>> are defined in ISO-8859-1 as obscure control characters like "Private
>> Use 2", "Cancel Character", etc. Firefox, Opera and others probably
>> replace these code points with characters from a different character
>> set, most likely Windows-1252.
>
> Wow, this is great to know. I did some reading and learned about the
> ISO-8859-1/Windows-1252 confusion, and that definitely seems to apply
> here. When I switch my PHP header to deliver a windows-1252 charset, IE6
> shows it correctly! Fantastic!
>
> So, is IE6 just so ancient that it does not do the automatic
> conversion from iso-8859-1 to assuming it's windows-1252?
No. It is much more likely that you expect the browser to do something it
is not supposed to do, and that you have not been the only person having it
backwards. Which led Microsoft to believe it would be a good idea to
accomodate sloppy developers yet again, and decreased IE 7+'s efficience by
including the "guess-my-encoding" code.
PointedEars
--
var bugRiddenCrashPronePieceOfJunk = (
navigator.userAgent.indexOf('MSIE 5') != -1
&& navigator.userAgent.indexOf('Mac') != -1
) // Plone, register_function.js:16
> Matt Kruse wrote:
>> Ross McKay wrote:
>>> This is actually something I see quite a lot. Many websites incorrectly
>>> identify their encoding as ISO-8859-1 when they are actually
>>> windows-1252.
>
> Comparing the tables, it appears that ISO-8859-1 misses a few rare
> signs (which would be a relatively minor issue IMO) but also the Euro-
> sign (which would be a very major issue IMO). I would think that the
> Euro-symbol had put ISO-8859-1 in a untenable position. Both tables
> also hold various representations for the control characters, which
> probably explains your test results.
ISO-8859-9 has the EURO SIGN character. It has never been a major issue for
ISO-8859-1 due to the market domination of Microsoft Windows, where in
Windows-125x code point 0x80 has been used. (IIUC, this is also the cause
of a Usenet jargon to write a `?' or `FRZ' [short for "Fragezeichen" =
"question mark") when the EURO SIGN character is meant, at least in de.ALL,
since you could not type that character in Windows [with the AltGr+E
shortcut] so that it was displayed correctly everywhere. In particular, a
lot of people sent 0x80 on Windows' Outlook Expression but declared
ISO-8859-1. It looked OK to them but not to those with standards-compliant
newsreaders, since there is no character at 0x80 in ISO-8859-1.)
>> It seems that newer standards actually demand that browsers treat
>> ISO-8859-1 as windows-1252.
>
> Yep - it looks like HTML5 will even *require* ISO-8859-1 to be parsed
> as Windows-1252.
> http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
Thankfully it is only a working draft, since they must have gone completely
nuts.
PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
> ISO-8859-9 has the EURO SIGN character.
ISO-8859-*15*.
PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann
Re-reading this, I wonder why I didn't just say that your documents are
probably encoded as Windows-1252... oh well, it was late. Thankfully,
Ross McKay found a better way to explain it ;)
> Wow, this is great to know. I did some reading and learned about the
> ISO-8859-1/Windows-1252 confusion, and that definitely seems to apply
> here.
> When I switch my PHP header to deliver a windows-1252 charset, IE6
> shows it correctly! Fantastic!
>
> So, is IE6 just so ancient that it does not do the automatic
> conversion from iso-8859-1 to assuming it's windows-1252?
Not quite. IE6 has no problems displaying HTML documents encoded as
Windows-1252 but sent as Latin1. The problem only manifests with Ajax
requests (in all current IE releases). I think it's more likely that
Microsoft's implementation(s) of the XMLHttpRequest object handle
encodings differently than the browser itself. Maybe one of the many
available XHR ActiveXObject incarnations does support it, but I kind of
doubt it (only tested a few of them). It's funny when you consider that
Microsoft invented that object... sometimes the knockoffs are superior
to the original.
IMO, it actually makes a lot of sense to always interpret ISO-8859-1 as
Windows-1252 in text documents. I've never seen a text format which had
any use for the C0/C1 control characters, but I've seen a _lot_ of
Windows-1252 files sent as ISO-8859-1. If the HTML5 draft makes this
"error correction" official (which was news to me, by the way - thanks),
I have no complaints about that.
One thing to watch out for is that this auto-conversion (or alias, or
whatever) from ISO-8859-1 to Windows-1252 does _not_ apply to the
ISO-8859-15 character set (a.k.a Latin-9). This character set is almost
identical to Latin-1; it only introduces the Euro sign and adds a few
other characters which were "forgotten" in Latin-1 (like œ). I live in
the Euro-zone, so I've long ago switched from Latin-1 to Latin-9 when
UTF-8 was not an option. I've never had any problems with it, but it's
good to know that Windows-1252 documents will not be displayed correctly
when they are sent as Latin-9.
>> Anyway, the XMLHttpRequest object was always a bit problematic with
>> encodings other than UTF-8. For example, (IIRC) the send() method will
>> always encode its data payload as UTF-8, regardless of the document
>> encoding.
>
> I am looking into this as the next step. I submit a lot of forms via
> ajax, and I need to know how to correctly force the encoding to be
> correct or at least handle the utf-8 on the server side, if that's ll
> IE6 will ever send.
AFAIK, it's not possible to change the encoding used by send(). You can
use setRequestHeader("Content-Type: what/ever; charset=ISO-8859-1"), and
the header will be added to the request, but that's all. Any 8-bit
characters will still be sent as UTF-8.
I encountered this problem in a web application which used Latin-1
everywhere. The customer wanted enhancements for the old user interface,
with all kinds of ajaxy magick. To make it short, we couldn't find a way
to tell the XHR object to use our encoding, so the server side code had
to be patched in a few places. In addition to that, JSON doesn't allow
any other encoding than UTF-8. (There's no technical reason for this
restriction; I think it was more a question of politics/advocacy. While
I agree with Crockford's stance in principle, it makes using JSON with
legacy systems difficult.) Since my client declined an offer to convert
the whole application to Unicode, we were forced to use a bastardized
non-standard Latin-1 version of JSON :/
In short, using XHR and JSON in a non-Unicode environment can become a
little frustrating.
--
stefan
>[...]
>Yep - it looks like HTML5 will even *require* ISO-8859-1 to be parsed
>as Windows-1252.
>http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
Very interesting! I dirty hack, but probably quite necessary given the
prevalence of borked character encoding declarations. It's essentially
codifying what some browsers are having to do now anyway, so that newer
browsers know to do it too.
Maybe that's why I haven't noticed so many character encoding problems
this year -- a quick test confirms that a document encoded with
windows-1252 but sent with headers saying ISO-8859-1 is displayed
correctly in Firefox 3.6.12, Opera 10.63, Opera Mobile 10, Google Chrome
7.0.517.44, and the Android browser. Nice.
Not that it helps Matt of course, since he needs to change the declared
encoding type of his AJAX response to correctly match the response
encoding, given that the older XMLHttpRequest object doesn't do the
switch for him.
cheers,
Ross
--
Ross McKay, Toronto, NSW Australia
"Pay no attention to that man behind the curtain" - Wizard of Oz
> Bart Van der Donck wrote:
>
>> Yep - it looks like HTML5 will even *require* ISO-8859-1 to be parsed
>> as Windows-1252.
>> http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
>
> Thankfully it is only a working draft, since they must have gone completely
> nuts.
I would not agree. It simply depends on what you find more imperative.
When you think normatively and technically, then it's absolutely not
done to silently modify one encoding into another [*]. When your value
set is more practical and user-oriented, you can only conclude that
the crooked Euro-situation in those days (only look at things like the
FRZ-madness) needed to be solved urgently at a high price. It must
have been a tough choice for W3C (and it took them a lot of time). But
I would support their proposal.
[*] Good that such forced overrides are relatively rare. I remember a
similar case where MySQL silently converts a 'CHAR' to 'VARCHAR'
column (even when the developer specified 'CHAR') when both 'CHAR' and
'VARCHAR' are used in a same table.
https://groups.google.com/group/mailing.database.mysql/browse_frm/thread/af1e82e6faa85838/
--
Bart
> Thomas 'PointedEars' Lahn wrote:
>> Bart Van der Donck wrote:
>>> Yep - it looks like HTML5 will even *require* ISO-8859-1 to be parsed
>>> as Windows-1252.
>>> http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
>> Thankfully it is only a working draft, since they must have gone
>> completely nuts.
>
> I would not agree. It simply depends on what you find more imperative.
> When you think normatively and technically, then it's absolutely not
> done to silently modify one encoding into another [*]. When your value
> set is more practical and user-oriented, you can only conclude that
> the crooked Euro-situation in those days (only look at things like the
> FRZ-madness) needed to be solved urgently at a high price. It must
> have been a tough choice for W3C (and it took them a lot of time). But
> I would support their proposal.
One problem of this is that it is _not_ the W3C's proposal. What you are
seeing at dev.w3.org/html5/ is not being produced originally by the HTML WG,
but by the WHATWG. HTML5 is the latter's idea, the W3C (which apparently
failed on this because of its organization) had no choice but to adopt it
eventually.
I seriously doubt the Euro sign character is worth all this trouble; most
people would write it as `€' or `EUR' on the Web, just be be sure it is
displayed properly everywhere. And while I agree that many C1 control
characters (ISO-8859-1 0x80 to 0x9F) are historical and obsolete, there are
exceptions. I also cannot say I am pleased to see a standard character set
being virtually replaced by a proprietary one just because of *current*
market domination of the latter.
> [*] Good that such forced overrides are relatively rare. I remember a
> similar case where MySQL silently converts a 'CHAR' to 'VARCHAR'
> column (even when the developer specified 'CHAR') when both 'CHAR' and
> 'VARCHAR' are used in a same table.
>
https://groups.google.com/group/mailing.database.mysql/browse_frm/thread/af1e82e6faa85838/
I am currently seeing this with BOOL(EAN) and TINYINT(1), having several
flags in our MySQL database. The unfortunate thing is that MySQL Workbench
even stores and exports the database schema with BOOL(EAN), but imports it
as TINYINT(1) upon reengeneering.
And yet, this works:
http://javascripttoolbox.com/temp/charset4.php
If the content type is set to windows-1252, IE6 will correctly
interpret the special characters sent via ajax and display them.
So the key seems to be that when sent via ajax, IE will _not_ auto-
convert ISO-8859-1 to windows-1252.
But if the response has the charset specified as windows-1252, IE will
display it correctly.
So I just need to make sure that the proper encoding is sent from the
server for all ajax requests.
Unfortunately, now I'm faced with a different problem - not js
related. In Java, when I do
response.setCharacterEncoding("windows-1252");
the server seems to be trying to convert my already-windows-1252-
encoded content into windows-1252 (I assume because it thinks it is
unicode?), and the conversion is failing, resulting in some characters
being sent as ? question marks. So then the browser doesn't even get
to decide how to handle it at all. That sucks. I can't figure out how
to get Tomcat to play nicely.
> In short, using XHR and JSON in a non-Unicode environment can become a
> little frustrating.
So I've learned. And burned many hours in the process. From now on,
I'm demanding UTF-8 from bottom to top in all new webapps.
Matt Kruse
Well, I ended up figuring out this piece.
The only thing remaining is the sending of data via ajax. As noted, it
seems that IE will submit content as UTF-8 no matter what you do. So I
wrote a filter on the server side that sets the request character set
to be UTF-8 if it is coming in via ajax (identified with a header) and
if it is a POST request. Otherwise, a windows-1252 charset is forced.
This seems to work well.
It would be great to live in an all-UTF-8 world, but until then, this
seems like a compromise that works with the current environment.
Matt Kruse
FYI, I've written up a summary of observations and information about
this topic here:
http://mattkruse.com/2010/11/10/understanding-character-encoding-in-java-webapps-with-ie/
Matt Kruse