On Fri, 16 Oct 2020, Stan Brown wrote:
> > 4.2.5.4 Specifying the document's character encoding
> ...
> >
> > The Encoding standard requires use of the UTF-8 character encoding and
> > requires use of the "utf-8" encoding label to identify it.
> ...
> > To enforce the above rules, authoring tools must default to using UTF-8
> > for newly-created documents.
This *did* surprise me. I had thought the "<meta charset=..."> would have a
meaning beyond recognising that one has no choice. Well, I switched to UTF-8
before I switched to HTML5, so I did not notice that as a problem. After all,
UTF-8 has existed for 17 years now. And my native tongue requires much more
non-ASCII characters than English does, so there is more to change.
> Well, heck! It seems unfortunate that they would retroactively change
> the HTML 4.01 standard, which I am 100% certain allowed other
> charsets for quite a few years.
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. This hint is
actually ignored by the web server you use, I see only "Content-Type:
text/html" appearing¹). By many browsers it is also interpreted as if it were
a declaration of the encoding used in the document – this is why it works and
will probably work as long as HTML4 documents exist and are interpreted by
browsers. But strictly speaking, it is not a usage of anything that is
well-defined in HTML4. – meta_charset is indeed a declaration of the encoding
used in the document, albeit meaningless as there is no choice.
¹) The full answer of the web server to the browser's request for
https://brownmath.com/Charsets/charset_utf-8_html4.htm was:
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 16 Oct 2020 16:12:13 GMT
Content-Type: text/html
Content-Length: 798
Connection: keep-alive
Last-Modified: Fri, 16 Oct 2020 13:43:53 GMT
ETag: "31e-5b1c9f48d5840"
alt-svc: quic=":443"; ma=86400; v="43,39"
Host-Header: 5d77dd967d63c3104bced1db0cace49c
X-Proxy-Cache: MISS
Accept-Ranges: bytes
So, you are not in a hurry to change anything, but you should have a plan for
the future. You can even validate your non-UTF-8 HTML files:
* Declare them as HTML4, otherwise it will complain that only UTF-8 is allowed.
* Before starting the validator, check „More Options“ and fill in the correct encoding.
I tried it out with
https://brownmath.com/Charsets/charset_utf-8_html4.htm, and it worked.
I consider the behaviour of the validator extreme user-unfriendly. When people
use habits that were not only tolerated but even recommended in the past, it
could give a hint that and why they are no longer supported and what to do
instead.
> It seems like my only options are to completely redesign how I
> produce Web pages, or to declare utf-8, but only use characters 000-
> 127 and use numeric references for everything >=160, which will bloat
> my documents.
I am not sure it requires a complete redesign. When I changed to UTF-8, I had
only to tell the editor used that it should encode in UTF-8 instead of
ISO-8859-1. Well, I work on a Unix system, and the editor used is emacs, which
has such an option. Windows has the problem that it sometime changes the
encoding without any notice to the user. When I do have to use Windows, I use
Notepad++ which also has an option to control the code to be used. (People
always working on Windows will perhaps have better recommendations; I just
needed *anything* capable of reliably producing UTF-8 output.)
For recoding the existing web pages, I had a little script.
I warn you of installing a legacy workplace consisting of more and more lecagy
work-arounds. It is less work to switch to UTF-8 but there is no need to do it
all in one night.
--
Helmut Richter