17.5.2016, 5:20, JJ wrote:
> On Mon, 16 May 2016 19:10:07 +0000 (UTC),
> hel...@asclothestro.multivax.dePhillip Helbig wrote:
>> HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as
>> Windows-1252, apparently because some websites claiming to be ISO-8859-1
>> are actually Windows-1252.
>
> In HTML5 W3C Recommendation 28 October 2014, I don't find anything that
> mention it.
This part of “HTML5” (a very vague concept) was moved out of the W3C
specification for HTML5, into the “Encoding” document, which is cited in
that spec and has the normative status of Candidate Recommendation. But
normative statuses have lost importance in the “HTML5” world.
The “Encoding” document lists several encoding names that shall be
interpreted as windows-1252, including iso-8859-1:
https://www.w3.org/TR/encoding/#legacy-single-byte-encodings
>> (Why in the world? More web sites advertise
>> themselves as ISO-8859-1 than as Windows-1252 and even if that were not
>> the case, standards shouldn't standardized wrong behaviour.)
>
> I agree. It made them lower than standard.
The reason is simple enough: There no use for C1 Controls (code points
from 80 to 9F in hexadecimal) in HTML documents, and it is virtually
certain that if an HTML document declared to be ISO-8859-1 encoded
contains such code points, then the document is in fact windows-1252
encoded and the code points should be interpreted as graphic characters
according to windows-1252. Well, in rare cases it might be a matter of
data error, but even then, windows-1252 interpretation makes more sense
than pretending that we are interpreting the data as iso-8859-1.
>> What does the standard say about a website advertising itself as
>> ISO-8859-15? Should it be parsed as ISO-8859-15?
>
> ISO-8859-15 itself is more like a small universal character set that covers
> several languages. It's not meant for a specific language or specific
> language group which most character sets are for - including ISO-8859 own
> character sets.
I think you have misunderstood what ISO-8859-15 was meant for. It was
introduced in order to include the euro sign, “€”; the extra letters
that were added due to the needs of Finnish and French were much less
important. ISO-8859-15 soon became obsolete, or was born obsolete, since
anyone who needs “€” in text can use windows-1252 or utf-8.
The “Encoding” document lists iso-8859-15 as a separate encoding, with a
few names in addition to the reference name. It is included as one of
the set of encodings for which support is required:
https://www.w3.org/TR/encoding/#names-and-labels
So the answer is that a document advertised as ISO-8859-15 shall be
parsed as ISO-8859-15.
>Considering that HTML5 mentions only the suggested character sets
> based on a list of languages
What you are referring to is a description of how the character encoding
is determined when it is not declared properly. That is, “guessing” or
“sniffing” the encoding—based on the language of the user’s environment!
The “sniffing” algorithm never results in using iso-8859-15 simply
because it is not commonly used anywhere.
--
Yucca,
http://www.cs.tut.fi/~jkorpela/