Helmut Richter wrote:
> Am 18.12.2016 um 16:36 schrieb Jukka K. Korpela:
>> 18.12.2016, 15:16, Stan Brown wrote:
>>> I've been validating some pages at
validator.w3.org. It had a problem
>>> with … for the ellipsis character, and I can understand that
>>> since that's Windows character set, not iso-8859-1.
>>>
>>> So I changed them all to ….
>> That was a correct move in principle
>
> And demanded by W3C papers such as
>
https://www.w3.org/TR/WD-html40-970708/charset.html.
That is not a “W3C paper”, but a W3C Working Draft (WD), and because of the
latter it is irrelevant evidence and a fallacy to cite it as a “demand”:
,-<
https://www.w3.org/TR/WD-html40-970708/cover.html>
|
| Status of this document
|
| This is a W3C Working Draft for review by W3C members and other interested
| parties. It is a draft document and may be updated, replaced or obsoleted
| by other documents at any time. It is inappropriate to use W3C Working
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| Drafts as reference material or to cite them as other than "work in
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| progress". This is work in progress and does not imply endorsement by, or
^^^^^^^^^
| the consensus of, either W3C or members of the HTML working group.
As it is and has always been. (IIRC the two of us have discussed this
before. Is there a bug in the Matrix?)
The “other document” that replaced it, is
<
https://www.w3.org/TR/1999/REC-html401-19991224/>
The URI of the corresponding section is
<
https://www.w3.org/TR/1999/REC-html401-19991224/charset.html>.
And those are W3C _Recommendations_, not “demands”.
> Long before Unicode was generally used by many people, it was defined to
> be the document character set of HTML, irrespective of which encoding was
> actually used in a file with HTML text.
No, the document character set of HTML before version 5 is the _Universal
(Coded) Character Set_ (UCS; ISO/IEC 10646) which is only said in HTML 4.01
(REC) to be “character-by-character equivalent to Unicode”. And that is
only referring to *the Unicode version at the time* (1999 CE). In the case
of HTML 4.01, that was Unicode 3.0. [1]
However, as you can see in the changelog of Unicode 4.0 [2], among other
things it added support for Linear B. By coincidence recently I used the
first character of the Linear B Unicode range to demonstrate a problem with
characters beyond the Basic Multilingual Plane (BMP) of Unicode (in MySQL).
Therefore I know by heart that the code point of that character is U+10000,
just one codepoint beyond the BMP (U+0000 to U+FFFF). So if Unicode 4.0
introduced support for Linear B, this means that previous versions of
Unicode specified a character set that did not extend beyond the BMP.
Therefore HTML 4.01, which refers to Unicode 3.0 as an equivalent to the
Universal Character Set, is only specified to support characters within the
BMP.
This has changed since HTML5 which now uses “the Unicode character set” as
the one “used to represent textual data”, by which the most recent version
of Unicode is meant (as the version number was omitted from the reference):
<
https://www.w3.org/TR/2014/REC-html5-20141028/infrastructure.html#dependencies>
PointedEars
___________
[1] <
https://www.w3.org/TR/1999/REC-html401-19991224/references.html#ref-UNICODE>
[2] <
http://www.unicode.org/versions/Unicode4.0.0/>
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee