2014-12-10, 10:05, Helmut Richter wrote:
> Am 10.12.2014 05:21, schrieb Jukka K. Korpela:
[...]
>> And even this isn’t really relevant most of the time. If you need
>> characters that appear in Windows-1252 but not in ISO-8859-1, you *can*
>> use the Windows-1252 encoding (and declare it Windows-1252 or
>> ISO-8859-1)
>
> IMHO in this case you *should* declare it Windows-1252 for the simple
> reason that it is true, instead of relying that the wrong declaration
> will work all the same. If you will: for documentation purposes.
This is a moot point. What HTML5 says is really just one opinion. And if
I intentionally use only ISO-8859-1 characters in my HTML document,
encoded in ISO-8859-1, it *is* ISO-8859-1 for intents and purposes. If
it accidentally contains bytes in the C1 Controls range, then browsers
in fact treat them according to Windows-1252. Fact of life. But this
does not mean that it would be wrong to declare ISO-8859-1 datastream as
ISO-8859-1 encoded, any more than it is wrong to declare US-ASCII
datastream as US-ASCII encoded, even though we know that browsers will
treat bytes with first bit set according to Windows-1252 if they find
them in data declared to be US-ASCII.
Specifically for documentation purposes, it is meaningful to declare
ISO-8859-1 if that is what you intend to use.
> While we are talking about confusion: I find there are two things that
> are *really* confusing until you get used to them:
>
> 1. For an HTML page served by a web server (as distinct from a local
> file which the browser finds without worrying a web server), there can
> be up to *two* code declarations: one by the web server using the HTTP
> protocol and one in the HTML text, and the *former* takes precedence.
Well, it can be confusing indeed. And it is a *real* problem, unlike
some attempts at confusing us in matters where no confusion otherwise
exists. More widespread use of UTF-8, when carried out a wrong way, has
made the problem worse. For example, many web servers declare UTF-8 in
HTTP headers, no matter what authors say. This means that authors who
need to use ISO-8859-1 or Windows-1252, for some reason, are in trouble.
So are authors who could well use UTF-8 but don’t know about the issue
and don’t realize what the server is doing.
There is further confusion, also real, caused by the newer idea
(specified in HTML5, implemented by many browsers but not all) that the
presence of BOM, Byte Order Mark, implies UTF-8, overriding even HTTP
headers. Of course here “BOM” means “three bytes that constitute the BOM
if interpreted according to UTF-8”. This can be useful at times, but it
can also mess things up.
> 2. In the standards documents you find the term "document character set"
> with the explanation that it is Unicode.
It used to be confusing, but I think it’s water under the bridge now.
It’s what HTML specifications used to say when HTML was nominally
SGML-based, though never actually implemented that way. The statement
used the term “document character set” in the SGML sense, which has a
special meaning: it defines how numbers in character references like
{ are to be interpreted.
In XHTML and in HTML5, such a concept is not used, since they have
broken connection with SGML. Instead, they define directly how those
references are interpreted.
> I have still to find a context where an ordinary web author has to think
> about the document character set of his document.
They don’t need the term, but they need the information how character
references are interpreted. The may need to know that È is
interpreted as referring to character with Unicode code point 200
(decimal), quite independently of what byte 200 (decimal) might mean in
the character encoding of the document.
(Since many people have not known this and have used character
references like –, intending them to be interpreted so that the
number is the Windows-1252 code, browsers have adapted to this, and in
HTML5, even the specification was written to accommodate this mess. This
means that legacy code containing such constructs need not be corrected
in this respect, even though – was technically undefined in HTML 4.01.)
--
Yucca,
http://www.cs.tut.fi/~jkorpela/