Charset

Stan Brown

unread,

Oct 15, 2020, 5:31:18 PM10/15/20

to

I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!

I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.

So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?

So what charset should I use to represent a file where every
character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1
character set?

To make things even more murky, at
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#attr-
charset
I found this gem: "If the attribute is present, its value must be an
ASCII case-insensitive match for the string "utf-8", because UTF-8 is
the only valid encoding for HTML5 documents."
If that's true, it sounds very much like I can't generate my web
pages unless I code every 160-255 character as a six-byte &#nnn;
string, which is not only a pain but makes editing harder.

(I tried looking at character encodings in Vim, and indeed it does
have a utf-8 option, but after I do my editing I run all my pages
through a very complicated awk script, and it looks like awk can't
handle UTF-8, at least not in Windows.)

--
Stan Brown, Tehachapi, California, USA
https://BrownMath.com/
https://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You: http://preview.tinyurl.com/WhyWont

Eli the Bearded

unread,

Oct 15, 2020, 6:56:47 PM10/15/20

to

In comp.infosystems.www.authoring.html,

Stan Brown <the_sta...@fastmail.fm> wrote:
> I have this line in the <head> of my Web pages:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> But perfectly decent characters like é, ×, ² show up as a question
> mark in a lozenge. I figured out that that's because my HTML files
> are all plain text, 8 characters per byte, which is not UTF8 when I
> use characters above 127.

The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.

> So I changed the charset to latin-1, and then to iso-8859-1. With
> each of them, characters 160-255 display correctly, but the W3C's
> validator gives this error message:
> Bad value ?text/html; charset=iso-8859-1? for attribute
> ?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
>
> So what charset should I use to represent a file where every
> character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1
> character set?

...

> I found this gem: "If the attribute is present, its value must be an
> ASCII case-insensitive match for the string "utf-8", because UTF-8 is
> the only valid encoding for HTML5 documents."

I can't tell for sure without seeing your page, but I think you are
running into the declared document type specifies an allowed list of
"charset"s to conformant to that document type. One fix is to declare
your document to be a type that allows the charset you feel you need to
use, eg some variant of HTML4. Another fix is to find a compatible
chaset from the allowed list.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Permitted_encodings

That suggests that UTF-8 is not required, but only recommended. And
since you are coming from a Windows environment perhaps Windows-1252 is
the right HTML5 charset to use. It is a superset of ISO-8859-1, using
some additional characters between 128 and 159 as "printable" instead of
as control characters.

> (I tried looking at character encodings in Vim, and indeed it does
> have a utf-8 option, but after I do my editing I run all my pages
> through a very complicated awk script, and it looks like awk can't
> handle UTF-8, at least not in Windows.)
>
> --
> Stan Brown, Tehachapi, California, USA
> https://BrownMath.com/
> https://OakRoadSystems.com/
> HTML 4.01 spec: http://www.w3.org/TR/html401/

Ooo, look there.

https://www.w3.org/TR/html401/charset.html#doc-char-set

5.2.1 Choosing an encoding

[...] This specification does not mandate which character encodings
a user agent must support.

> validator: http://validator.w3.org/
> CSS 2.1 spec: http://www.w3.org/TR/CSS21/

That's an HTML4 document in ISO-8859-1.

> validator: http://jigsaw.w3.org/css-validator/
> Why We Won't Help You: http://preview.tinyurl.com/WhyWont

Okay, you validated. But you then you didn't provide a URL or complete
example.

Elijah
------
is not averse to & encodings to put Unicode into US-ASCII

JJ

unread,

Oct 15, 2020, 10:11:29 PM10/15/20

to

It depends on what character set which is used by the text editor/processor
you're using.

If the software doesn't have any setting regarding character set, and...

If it's a non cross-platform native Windows software, the character set
should be `Windows-1252` - assuming that the OS' locale is U.S. English.
Otherwise, other Windows-NNN character set should be used based on the
current locale. e.g. `Windows-1250` for central European (French, German,
etc.).

If it's a DOS software, i.e. a pure DOS program, instead of a text-mode
Windows program; then the character set should be `cp437` for U.S. English.
Otherwise, it's cpNNN.

Cross platform softwares mostly use UTF-8. But it case they don't, the
character set could be either be cpNNN, iso-XXX, or Windows-NNNN. Depending
on the active locale.

Arno Welzel

unread,

Oct 16, 2020, 4:07:01 AM10/16/20

to

Stan Brown:

> I'm trying, and failing, to write the proper charset in my meta tag.
> Help, please!
>
> I have this line in the <head> of my Web pages:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>
> But perfectly decent characters like é, ×, ² show up as a question
> mark in a lozenge. I figured out that that's because my HTML files
> are all plain text, 8 characters per byte, which is not UTF8 when I
> use characters above 127.
>
> So I changed the charset to latin-1, and then to iso-8859-1. With
> each of them, characters 160-255 display correctly, but the W3C's
> validator gives this error message:
> Bad value ?text/html; charset=iso-8859-1? for attribute
> ?content? on element ?meta?: ?charset=? must be followed by ?utf-8?

Did you try <meta charset="ISO-8859-1">?

--
Arno Welzel
https://arnowelzel.de

Helmut Richter

unread,

Oct 16, 2020, 4:09:33 AM10/16/20

to

On Thu, 15 Oct 2020, Eli the Bearded wrote:

> I can't tell for sure without seeing your page, [...]

Just tell us the URL (the web address) where we can see your page and thus
will discover

– what charset you are really using
– what the web server says about it
– what the web page tells about it
– what the default charset would be if none of the above applies

and whether these four contradict each other.

--
Helmut Richter

Stan Brown

unread,

Oct 16, 2020, 9:51:22 AM10/16/20

to

On Thu, 15 Oct 2020 22:56:44 +0000 (UTC), Eli the Bearded wrote:
>
> In comp.infosystems.www.authoring.html,
> Stan Brown <the_sta...@fastmail.fm> wrote:
> > I have this line in the <head> of my Web pages:
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> > But perfectly decent characters like é, ×, ² show up as a question
> > mark in a lozenge. I figured out that that's because my HTML files
> > are all plain text, 8 characters per byte, which is not UTF8 when I
> > use characters above 127.
>
> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
> it's all not plain.
>
> > So I changed the charset to latin-1, and then to iso-8859-1. With
> > each of them, characters 160-255 display correctly, but the W3C's
> > validator gives this error message:
> > Bad value ?text/html; charset=iso-8859-1? for attribute
> > ?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
> >
> > So what charset should I use to represent a file where every
> > character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1
> > character set?
> ...
> > I found this gem: "If the attribute is present, its value must be an
> > ASCII case-insensitive match for the string "utf-8", because UTF-8 is
> > the only valid encoding for HTML5 documents."
>
> I can't tell for sure without seeing your page,

A fair point:
https://brownmath.com/Charsets/

I created every combination of HTML 4.01 Strict or HTML 5 with utf-8,
iso-8859-1, windows-1252, and latin-1.

> but I think you are
> running into the declared document type specifies an allowed list of
> "charset"s to conformant to that document type. One fix is to declare
> your document to be a type that allows the charset you feel you need to
> use, eg some variant of HTML4.

HTML 4.01 Strict fails validation also, with charset UTF-8.

> Another fix is to find a compatible
> chaset from the allowed list.

Yes, I tried that, but the compatible charsets iso-8859-1, latin-1,
and windows-1252 all fail validation. I didn't see any others in the
list under "Encodings" on the W3C site, but maybe I missed one.

Stan Brown

unread,

Oct 16, 2020, 9:52:31 AM10/16/20

to

On Fri, 16 Oct 2020 10:09:31 +0200, Helmut Richter wrote:
>
> On Thu, 15 Oct 2020, Eli the Bearded wrote:
>
> > I can't tell for sure without seeing your page, [...]
>
> Just tell us the URL (the web address) where we can see your page and thus
> will discover
>

> ? what charset you are really using
> ? what the web server says about it
> ? what the web page tells about it
> ? what the default charset would be if none of the above applies

>
> and whether these four contradict each other.

Sorry, I should have done that in my original.

https://brownmath.com/Charsets/

Stan Brown

unread,

Oct 16, 2020, 9:56:13 AM10/16/20

to

Yes. In HTML 4.01 and 5, same problem as in the longer form

Arno Welzel

unread,

Oct 16, 2020, 10:43:46 AM10/16/20

to

Stan Brown:

Indeed - HTML 4 does not know anything about the charset attribute and
for HTML 5 using UTF-8 is a requiredment. In fact this is the *only*
allowed encoding for HTML 5. So you have convert your existing documents
to UTF-8 before publishing them.

Also see here:

<https://html.spec.whatwg.org/multipage/semantics.html#character-encoding-declaration>

4.2.5.4 Specifying the document's character encoding

A character encoding declaration is a mechanism by which the character
encoding used to store or transmit a document is specified.

The Encoding standard requires use of the UTF-8 character encoding and
requires use of the "utf-8" encoding label to identify it. Those
requirements necessitate that the document's character encoding
declaration, if it exists, specifies an encoding label using an ASCII
case-insensitive match for "utf-8". Regardless of whether a character
encoding declaration is present or not, the actual character encoding
used to encode the document must be UTF-8. [ENCODING]

To enforce the above rules, authoring tools must default to using UTF-8
for newly-created documents.

Stan Brown

unread,

Oct 16, 2020, 11:16:08 AM10/16/20

to

On Fri, 16 Oct 2020 16:43:45 +0200, Arno Welzel wrote:
>
> Stan Brown:
>
> > On Fri, 16 Oct 2020 10:06:59 +0200, Arno Welzel wrote:
> >> Did you try <meta charset="ISO-8859-1">?
> >
> > Yes. In HTML 4.01 and 5, same problem as in the longer form
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>
> Indeed - HTML 4 does not know anything about the charset attribute and
> for HTML 5 using UTF-8 is a requiredment. In fact this is the *only*
> allowed encoding for HTML 5. So you have convert your existing documents
> to UTF-8 before publishing them.
>
> Also see here:
>
> <https://html.spec.whatwg.org/multipage/semantics.html#character-encoding-declaration>
>
> 4.2.5.4 Specifying the document's character encoding

...

>
> The Encoding standard requires use of the UTF-8 character encoding and
> requires use of the "utf-8" encoding label to identify it.

...

> To enforce the above rules, authoring tools must default to using UTF-8
> for newly-created documents.

Well, heck! It seems unfortunate that they would retroactively change
the HTML 4.01 standard, which I am 100% certain allowed other
charsets for quite a few years.

It seems like my only options are to completely redesign how I
produce Web pages, or to declare utf-8, but only use characters 000-
127 and use numeric references for everything >=160, which will bloat
my documents.

Helmut Richter

unread,

Oct 16, 2020, 12:49:16 PM10/16/20

to

On Fri, 16 Oct 2020, Stan Brown wrote:

> > 4.2.5.4 Specifying the document's character encoding
> ...
> >
> > The Encoding standard requires use of the UTF-8 character encoding and
> > requires use of the "utf-8" encoding label to identify it.
> ...
> > To enforce the above rules, authoring tools must default to using UTF-8
> > for newly-created documents.

This *did* surprise me. I had thought the "<meta charset=..."> would have a
meaning beyond recognising that one has no choice. Well, I switched to UTF-8
before I switched to HTML5, so I did not notice that as a problem. After all,
UTF-8 has existed for 17 years now. And my native tongue requires much more
non-ASCII characters than English does, so there is more to change.

> Well, heck! It seems unfortunate that they would retroactively change
> the HTML 4.01 standard, which I am 100% certain allowed other
> charsets for quite a few years.

You should notice that

<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)

and

<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)

have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. This hint is
actually ignored by the web server you use, I see only "Content-Type:
text/html" appearing¹). By many browsers it is also interpreted as if it were
a declaration of the encoding used in the document – this is why it works and
will probably work as long as HTML4 documents exist and are interpreted by
browsers. But strictly speaking, it is not a usage of anything that is
well-defined in HTML4. – meta_charset is indeed a declaration of the encoding
used in the document, albeit meaningless as there is no choice.

¹) The full answer of the web server to the browser's request for
https://brownmath.com/Charsets/charset_utf-8_html4.htm was:

HTTP/1.1 200 OK
Server: nginx
Date: Fri, 16 Oct 2020 16:12:13 GMT
Content-Type: text/html
Content-Length: 798
Connection: keep-alive
Last-Modified: Fri, 16 Oct 2020 13:43:53 GMT
ETag: "31e-5b1c9f48d5840"
alt-svc: quic=":443"; ma=86400; v="43,39"
Host-Header: 5d77dd967d63c3104bced1db0cace49c
X-Proxy-Cache: MISS
Accept-Ranges: bytes

So, you are not in a hurry to change anything, but you should have a plan for
the future. You can even validate your non-UTF-8 HTML files:

* Declare them as HTML4, otherwise it will complain that only UTF-8 is allowed.
* Before starting the validator, check „More Options“ and fill in the correct encoding.

I tried it out with https://brownmath.com/Charsets/charset_utf-8_html4.htm, and it worked.

I consider the behaviour of the validator extreme user-unfriendly. When people
use habits that were not only tolerated but even recommended in the past, it
could give a hint that and why they are no longer supported and what to do
instead.

> It seems like my only options are to completely redesign how I
> produce Web pages, or to declare utf-8, but only use characters 000-
> 127 and use numeric references for everything >=160, which will bloat
> my documents.

I am not sure it requires a complete redesign. When I changed to UTF-8, I had
only to tell the editor used that it should encode in UTF-8 instead of
ISO-8859-1. Well, I work on a Unix system, and the editor used is emacs, which
has such an option. Windows has the problem that it sometime changes the
encoding without any notice to the user. When I do have to use Windows, I use
Notepad++ which also has an option to control the code to be used. (People
always working on Windows will perhaps have better recommendations; I just
needed *anything* capable of reliably producing UTF-8 output.)

For recoding the existing web pages, I had a little script.

I warn you of installing a legacy workplace consisting of more and more lecagy
work-arounds. It is less work to switch to UTF-8 but there is no need to do it
all in one night.

--
Helmut Richter

Jukka K. Korpela

unread,

Oct 16, 2020, 4:30:44 PM10/16/20

to

Eli the Bearded wrote:

> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
> it's all not plain.

That’s nonsense. Plain text is just text, as oppotite to “rich text”,
like MS Word format, or HTML.

Jukka K. Korpela

unread,

Oct 16, 2020, 4:42:50 PM10/16/20

to

Stan Brown wrote:

> I have this line in the <head> of my Web pages:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Just remove it, unless it matches the actual encoding used.

> But perfectly decent characters like é, ×, ² show up as a question
> mark in a lozenge.

Apparently it is not utf-8 encoded.

> I figured out that that's because my HTML files
> are all plain text,

HTML is by definition not plain text.

> So I changed the charset to latin-1, and then to iso-8859-1. With
> each of them, characters 160-255 display correctly,

Fine. Stop there. Latin-1 and iso-8859-1 are equivalent, and so is
windows-1252 in practice.

> but the W3C's
> validator gives this error message:
> Bad value ?text/html; charset=iso-8859-1? for attribute
> ?content? on element ?meta?: ?charset=? must be followed by ?utf-8?

Ignore it.

> So what charset should I use to represent a file where every
> character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1
> character set?

Windows-1252. But the “validator” still says it’s wrong.

> If that's true, it sounds very much like I can't generate my web
> pages unless I code every 160-255 character as a six-byte &#nnn;
> string, which is not only a pain but makes editing harder.

You can conform to utf-8 by doing so, or by actually using utf-8.

But as opposite to using latin-1, it only amounts to worshipping
whatever WHATWG (and W3C) declared holy. And there are obvious risks
whenever you eidt your pages using a tool that does not conform to the
same confession.

Stan Brown

unread,

Oct 16, 2020, 6:00:38 PM10/16/20

to

On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
>
> Stan Brown wrote:
>
> > I have this line in the <head> of my Web pages:
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>
> Just remove it, unless it matches the actual encoding used.

Brilliant! I tried with no <meta .. charset> tag. The characters were
displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
version passed validation. (The W3C validator failed the HTML4.01
version with "obsolete DOCTYPE", which seems a bit harsh.) The
revised examples are at <URL:https://brownmath.com/Charsets/>.

I know that encoding is complicated, but just because the characters
are displayed correctly in my browsers, is it safe to assume they'll
be correct in (the great majority of) other browsers?

I guess in a way I'm asking: what figures out the document encoding
if it's not specified, the Web server or the user-agent? If it's the
server, then the fact that they worked for me says they should work
for anyone. But if it's the browser, maybe not so much.

Arno Welzel

unread,

Oct 16, 2020, 6:14:24 PM10/16/20

to

Helmut Richter:

[...]

> You should notice that
>
> <meta http-equiv="Content-Type" content="text/html; charset=any-code">
> (HTML before HTML5 as well)
>
> and
>
> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>
> have different meanings. meta_http-equiv is a hint to the web server to
> declare the content type and encoding via the HTTP protocol. This hint is
> actually ignored by the web server you use, I see only "Content-Type:
> text/html" appearing¹).

[...]

Because it is not for the server but for the *browser*.

In fact this meta element is used *instead* sending a HTTP response
header. That's why it is called "http-equiv" - it should be treated by
the *browser* in the same way as the respective HTTP header for the
Content-Type.

Stan Brown

unread,

Oct 16, 2020, 6:20:01 PM10/16/20

to

On Fri, 16 Oct 2020 18:49:12 +0200, Helmut Richter wrote:

> You should notice that
>
> <meta http-equiv="Content-Type" content="text/html; charset=any-code">
> (HTML before HTML5 as well)
>
> and
>
> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>
> have different meanings. meta_http-equiv is a hint to the web server to
> declare the content type and encoding via the HTTP protocol. This hint is
> actually ignored by the web server you use, I see only "Content-Type:

> text/html" appearingš). By many browsers it is also interpreted as if it were
> a declaration of the encoding used in the document ? this is why it works and

> will probably work as long as HTML4 documents exist and are interpreted by
> browsers. But strictly speaking, it is not a usage of anything that is

> well-defined in HTML4. ? meta_charset is indeed a declaration of the encoding

> used in the document, albeit meaningless as there is no choice.
>

> š) The full answer of the web server to the browser's request for

> https://brownmath.com/Charsets/charset_utf-8_html4.htm was:
>
> HTTP/1.1 200 OK
> Server: nginx
> Date: Fri, 16 Oct 2020 16:12:13 GMT
> Content-Type: text/html
> Content-Length: 798
> Connection: keep-alive
> Last-Modified: Fri, 16 Oct 2020 13:43:53 GMT
> ETag: "31e-5b1c9f48d5840"
> alt-svc: quic=":443"; ma=86400; v="43,39"
> Host-Header: 5d77dd967d63c3104bced1db0cace49c
> X-Proxy-Cache: MISS
> Accept-Ranges: bytes

Interesting. The server doesn't seem to send information about
document encoding. I guess that means the browser is left to figure
it out?

> So, you are not in a hurry to change anything, but you should have a plan for
> the future. You can even validate your non-UTF-8 HTML files:
>
> * Declare them as HTML4, otherwise it will complain that only UTF-8 is allowed.

> * Before starting the validator, check ?More Options? and fill in the correct encoding.

>
> I tried it out with https://brownmath.com/Charsets/charset_utf-8_html4.htm, and it worked.

Actually in my build procedure I don't use the W3C validator. I use
NSGMLS, an ancient tool that parses against the DOCTYPE specified in
the document, using the referenced DOCTYPE file. If I'm not
mistaken, there is no DOCTYPE file for <!DOCTYPE html>, so if I want
to start publishing HTML5 pages (I do), I'll have to find a command-
line tool that returns an appropriate pass or fail value, so that my
makefile knows to stop or keep going.

My build sequence is:
1. Do manual edits to source files (not the HTML documents), using
vim. the source files contain a mix of ordinary text and HTML plus a
lot of macro and #includes and function calls.
2. Run a MAKE to rebuild the HTML pages that need it.

For each HTML page to be rebuilt:
a. Run the awk script that processes includes, macros and functions
into static HTML.
b. Call the local validator, which gives a status result.
c. If the page validates, go to the next file.
d. If the page fails to validate, stop.

> I consider the behaviour of the validator extreme user-unfriendly.
> When people use habits that were not only tolerated but even
> recommended in the past, it could give a hint that and why they are
> no longer supported and what to do instead.

Indeed yes!

> > It seems like my only options are to completely redesign how I
> > produce Web pages, or to declare utf-8, but only use characters 000-
> > 127 and use numeric references for everything >=160, which will bloat
> > my documents.
>
> I am not sure it requires a complete redesign. When I changed to UTF-8, I had
> only to tell the editor used that it should encode in UTF-8 instead of
> ISO-8859-1. Well, I work on a Unix system, and the editor used is emacs, which
> has such an option.

Vim does too. There are two problems: (a) I haven't figured out how
to do editing in that mode, and (b) according to what I read on the
Web, awk can't handle UTF-8 files correctly if they contain any
multi-byte characters.

Arno Welzel

unread,

Oct 16, 2020, 6:22:09 PM10/16/20

to

Stan Brown:

> On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
>>
>> Stan Brown wrote:
>>
>>> I have this line in the <head> of my Web pages:
>>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>>
>> Just remove it, unless it matches the actual encoding used.
>
> Brilliant! I tried with no <meta .. charset> tag. The characters were
> displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
> version passed validation. (The W3C validator failed the HTML4.01
> version with "obsolete DOCTYPE", which seems a bit harsh.) The
> revised examples are at <URL:https://brownmath.com/Charsets/>.

Well this is just by chance correct. In fact your server does not send
any charset at all:

HTTP/2 200 OK
server: nginx
date: Fri, 16 Oct 2020 22:16:16 GMT
content-type: text/html
content-length: 784
last-modified: Fri, 16 Oct 2020 13:44:01 GMT
etag: "310-5b1c9f5076a40"

alt-svc: quic=":443"; ma=86400; v="43,39"

host-header: 5d77dd967d63c3104bced1db0cace49c

> I know that encoding is complicated, but just because the characters
> are displayed correctly in my browsers, is it safe to assume they'll
> be correct in (the great majority of) other browsers?

That depends on your audience.

> I guess in a way I'm asking: what figures out the document encoding
> if it's not specified, the Web server or the user-agent? If it's the

Browsers use default characters sets or try to detect it.

> server, then the fact that they worked for me says they should work
> for anyone. But if it's the browser, maybe not so much.

The server has nothing to do with it - see above: no indication at all
what encoding is used.

Arno Welzel

unread,

Oct 16, 2020, 6:29:54 PM10/16/20

to

Eli the Bearded:

> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
> it's all not plain.

What exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?

And why do you define "ASCII = plain"? Even ASCII has its history of
changes and not all 7-bit characters had the same meaning in the past:

<https://www.aivosto.com/articles/charsets-7bit.html>

Eli the Bearded

unread,

Oct 16, 2020, 7:58:40 PM10/16/20

to

In comp.infosystems.www.authoring.html,

Arno Welzel <use...@arnowelzel.de> wrote:
> Eli the Bearded:
>> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
>> it's all not plain.
> What exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?

It is not "plain" in the sense of how documents without content types
should be interpreted according to the RFCs I remember reading. Consider

RFC-2045 - Multipurpose Inter Mail Extensions (MIME) Part One:

5.2. Content-Type Defaults

Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:

Content-type: text/plain; charset=us-ascii

This default is assumed if no Content-Type header field is specified.
It is also recommend that this default be assumed when a
syntactically invalid Content-Type header field is encountered. In
the presence of a MIME-Version header field and the absence of any
Content-Type header field, a receiving User Agent can also assume
that plain US-ASCII text was the sender's intent. Plain US-ASCII
^^^^^^^^^^^^^^
text may still be assumed in the absence of a MIME-Version or the
^^^^^^^^^^^^^^^^^^^^^^^^^
presence of an syntactically invalid Content-Type header field, but
the sender's intent might have been otherwise.

and

RFC-2046 - Multipurpose Inter Mail Extensions (MIME) Part Two:

4.1.2. Charset Parameter

A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set. This is specified with a
"charset" parameter, as in:

Content-type: text/plain; charset=iso-8859-1

Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
must be assumed in the absence of a charset parameter, is US-ASCII.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> And why do you define "ASCII = plain"? Even ASCII has its history of
> changes and not all 7-bit characters had the same meaning in the past:

Agreed that ASCII was not created in it's final form.

> <https://www.aivosto.com/articles/charsets-7bit.html>

The last change to ASCII there is in 1986. The last change there that
involved the characters enumerated by ASCII was in 1977. The list of
things that were important for computers in 1977 that are still
important today is very small. ASCII, awkward as it is for many
purposes, remains a bedrock upon which other, better things, are
built. I just don't call UTF-8, eg, "plain text".

Elijah
------
notes that the unicode character table is written in US-ASCII

The Doctor

unread,

Oct 16, 2020, 9:25:39 PM10/16/20

to

In article <eli$20101...@qaz.wtf>,

Move with the times!
--
Member - Liberal International This is doctor@@nl2k.ab.ca Ici doctor@@nl2k.ab.ca
Yahweh, Queen & country!Never Satan President Republic!Beware AntiChrist rising!
Look at Psalms 14 and 53 on Atheism https://www.empire.kred/ROOTNK?t=94a1f39b
BC save the Province; on 24 October 2020, vote Liberal and not NDP!

Phillip Helbig (undress to reply)

unread,

Oct 17, 2020, 2:03:53 AM10/17/20

to

In article <MPG.39f3b7d5d...@news.individual.net>, Stan Brown

<the_sta...@fastmail.fm> writes:

> I know that encoding is complicated, but just because the characters
> are displayed correctly in my browsers, is it safe to assume they'll
> be correct in (the great majority of) other browsers?

In general, no.

Helmut Richter

unread,

Oct 17, 2020, 4:36:38 AM10/17/20

to

Sounds reasonable. Thank you.

So the author of the wep page can be a little more sure that the browser
feels obliged to respect it. As far as I know, browsers also honour
<meta charset="iso-8859-1"> even though it does not conform to the
standard.

--
Helmut Richter

Jukka K. Korpela

unread,

Oct 17, 2020, 9:15:27 AM10/17/20

to

Stan Brown wrote:

> On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
>>
>> Stan Brown wrote:
>>
>>> I have this line in the <head> of my Web pages:
>>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>>
>> Just remove it, unless it matches the actual encoding used.
>
> Brilliant! I tried with no <meta .. charset> tag. The characters were
> displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
> version passed validation.

What I tried to say is that declaring an encoding that is not the actual
encoding used (or compatible with it) is worse than not declaring the
encoding at all. This gives the user agent a chance to guess right, as
opposite to applying wrong information.

> I know that encoding is complicated, but just because the characters
> are displayed correctly in my browsers, is it safe to assume they'll
> be correct in (the great majority of) other browsers?

Encoding isn’t that complicated, but guessing the encoding is. The
WHATWG description deals with the overall process rather than specific
heuristics, but it seems very probable that browsers will guess
correctly between windows-1252 and utf-8 if actual non-Ascii data
appears within 1,000 or so characters in the HTML file. But of course it
is not completely safe.
https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding

> I guess in a way I'm asking: what figures out the document encoding
> if it's not specified, the Web server or the user-agent?

In theory, it could also be the server. There is no law against a server
scan of a document to guess the encoding and to add an HTTP header
accordingly. But I have not heard of such things, and it does not sound
productive. So it’s the use agent.

The practical way is to use

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

and to ignore what the validator says about it. You can even use
automated ignoring by using the W3C validator tools for hiding messages
by type.

WHATWG and W3C just wish to promote UTF-8 on all pages at any cost.
That’s why they specify that only UTF-8 is kosher and make the validator
nag about it.

The theoretically most correct way is to make the server send HTTP
headers specifying the encoding. I have no idea how to do that when
using Nginx. You might need access to the server configuration files.

Jukka K. Korpela

unread,

Oct 17, 2020, 9:33:17 AM10/17/20

to

Eli the Bearded wrote:

>> What exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
>
> It is not "plain" in the sense of how documents without content types
> should be interpreted according to the RFCs I remember reading.

The RFC that defines “plain text” as a media type (text/plain) is RFC
2046. It has been updated by other RFCs, but this fundamental definition
has not changed:

(1) text -- textual information. The subtype "plain" in
particular indicates plain text containing no
formatting commands or directives of any sort. Plain
text is intended to be displayed "as-is". No special
software is required to get the full meaning of the
text, aside from support for the indicated character
set.

Thus, HTML is by definition not plain text. It is required to contain
markup, and it is not intended to be displayed “as-is”, with <!doctype
...> and start and end tags and entity references included.

> that plain US-ASCII text was the sender's intent. Plain US-ASCII
> ^^^^^^^^^^^^^^
> text may still be assumed in the absence of a MIME-Version or the
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> presence of an syntactically invalid Content-Type header field, but
> the sender's intent might have been otherwise.

Statements like this just formalize the idea that e-mail content is to
be taken as Ascii encoded plain text, unless specified otherwise.
“Plain” and “US-ASCII” are two distinct attributes here. A Content-Type
header may override either or both of them.

Stan Brown

unread,

Oct 17, 2020, 4:03:14 PM10/17/20

to

On Sat, 17 Oct 2020 16:15:23 +0300, Jukka K. Korpela wrote:
>
> Stan Brown wrote:
>
> > On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
> What I tried to say is that declaring an encoding that is not the
> actual encoding used (or compatible with it) is worse than not
> declaring the encoding at all. This gives the user agent a chance
> to guess right, as opposite to applying wrong information.
>
> > I know that encoding is complicated, but just because the characters
> > are displayed correctly in my browsers, is it safe to assume they'll
> > be correct in (the great majority of) other browsers?

[Answer: not completely safe]

> The practical way is to use
>
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

QUESTION 1: Any reason you suggest that rather than the simpler
<meta charset="windows-1252">
? This page says the two forms are equivalent in HTML5:
https://stackoverflow.com/questions/4696499/meta-charset-utf-8-vs-
meta-http-equiv-content-type

> and to ignore what the validator says about it. You can even use
> automated ignoring by using the W3C validator tools for hiding messages
> by type.

Yes, I found the vnu validator as a windows binary here:
https://github.com/validator/validator/releases/tag/20.6.30
and I noticed that one option lets me filter out messages.

I'm quite excited about this -- it should make it possible to switch
everything from valid HTML 4.01 to valid HTML5, _and_ do better
validation. (I found that vnu even parses CSS inside <style>...
</style>.)

QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?

QUESTION 3: If I should still stay away from 128-159, even with a
windows-1252 declaration, is there any particular reason you suggest
windows-1252 rather than iso-8859-1? know they're the same for 32-
127 and 160-255, but in my mind windows-1252 suggests that I'll be
using Windows 128-159, and iso-8859-1 does not.

> WHATWG and W3C just wish to promote UTF-8 on all pages at any cost.

> That?s why they specify that only UTF-8 is kosher and make the validator

> nag about it.
>
> The theoretically most correct way is to make the server send HTTP
> headers specifying the encoding. I have no idea how to do that when
> using Nginx. You might need access to the server configuration files.

I think I can get that access, probably via some override file in my
root directory. In fact, there's already a .htaccess file there with
one AddType, so I think it must be an Apache server or a workalike.
I should be able to add
AddType text/plain;charset=windows-1252
AddType text/html;charset=windows-1252
and have the server emit the desired headers. But the stackoverflow
article above makes the point that we still want to include a charset
in each file, for the folks who download a file for later reading.

Helmut Richter

unread,

Oct 17, 2020, 4:35:36 PM10/17/20

to

On Sat, 17 Oct 2020, Stan Brown wrote:

> QUESTION 2: It would be awfully convenient to type a Windows
> apostrophe (8-bit character 146) rather than ’ or ’. If
> I specify a charset of windows-1252, am I safe to do that, or should
> I still stay away from Windows characters 128-159?

There is no reason to stay away from code points that are defined in the
code.

(I don’t have the problem, though. If I want a real apostrophe like the
one in the preceding sentence, I just type it (on my keyboard AltGr+'),
and it lands in the file as the UTF-8 representation of that character.
When I look into that file on the screen or I when print it, I see exactly
the apostrophe I typed in – as an apostroph, not as a code point number.
No need ever to use &#... or to worry about code point numbers.)

> QUESTION 3: If I should still stay away from 128-159, even with a
> windows-1252 declaration, is there any particular reason you suggest
> windows-1252 rather than iso-8859-1? know they're the same for 32-
> 127 and 160-255, but in my mind windows-1252 suggests that I'll be
> using Windows 128-159, and iso-8859-1 does not.

If you use these code points, you have to specify windows-1252; if not,
the effect is the same for the two code names.

--
Helmut Richter

Arno Welzel

unread,

Oct 17, 2020, 5:41:56 PM10/17/20

to

Eli the Bearded:

> In comp.infosystems.www.authoring.html,
> Arno Welzel <use...@arnowelzel.de> wrote:
>> Eli the Bearded:
>>> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
>>> it's all not plain.
>> What exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
>
> It is not "plain" in the sense of how documents without content types
> should be interpreted according to the RFCs I remember reading. Consider
>
> RFC-2045 - Multipurpose Inter Mail Extensions (MIME) Part One:
>
> 5.2. Content-Type Defaults
>
> Default RFC 822 messages without a MIME Content-Type header are taken
> by this protocol to be plain text in the US-ASCII character set,
> which can be explicitly specified as:

Read carefully:

"plain text in the US-ASCII character set"

This means "plain text" *and* "in the US-ASCII character set".

There is no definition that "plain text" must be US-ASCII only.

> Content-type: text/plain; charset=us-ascii
>
> This default is assumed if no Content-Type header field is specified.

Yes - because there is an RFC which defines a specific context where a
missing Content-Type means that content is understood as text encoded in
US-ASCII. However *if* there is a content type then "text/plain" is also
valid with other encodings:

Content-type: text/plain; charset=utf-8

Is no less "plain text" as the one using US-ASCII. That's why I would
not say that "plain text" is the sime like "plain text using US-ASCII".

See for example: <https://arnowelzel.de/samples/plain-text-utf8/>

Stan Brown

unread,

Oct 17, 2020, 9:38:01 PM10/17/20

to

On Sat, 17 Oct 2020 00:22:07 +0200, Arno Welzel wrote:
>
> Stan Brown:
>
> > On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
> >>
> >> Stan Brown wrote:
> >>
> >>> I have this line in the <head> of my Web pages:
> >>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> >>
> >> Just remove it, unless it matches the actual encoding used.
> >
> > Brilliant! I tried with no <meta .. charset> tag. The characters were
> > displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
> > version passed validation. (The W3C validator failed the HTML4.01
> > version with "obsolete DOCTYPE", which seems a bit harsh.) The
> > revised examples are at <URL:https://brownmath.com/Charsets/>.
>
> Well this is just by chance correct. In fact your server does not send
> any charset at all:
>
> HTTP/2 200 OK
> server: nginx
> date: Fri, 16 Oct 2020 22:16:16 GMT
> content-type: text/html
> content-length: 784
> last-modified: Fri, 16 Oct 2020 13:44:01 GMT
> etag: "310-5b1c9f5076a40"
> alt-svc: quic=":443"; ma=86400; v="43,39"
> host-header: 5d77dd967d63c3104bced1db0cace49c

Thanks for this. Apparently nginx accepts Apache directives in the
.htaccess file. I've added them. I used W3C's i18n checker at
https://validator.w3.org/i18n-checker/
to verify that the server now sends "charset windows-1252". And in
both Firefox and Chrome, the Windows-1252 characters are displayed
correctly even if I have a conflicting charset declared in the actual
html file.

If you have a different browser, and if you care to check, could you
let me know how
https://brownmath.com/Charsets/charset_utf-8_html5.htm
shows up in your browser, whether the Windows characters in the last
paragraph are displayed?

And I'll change my scripts to declare a charset of windows-1252(*)
instead of utf-8, and <!DOCTYPE html>, and run everything through
W3C's command-line verifier. Fun!

(*)Unless someone thinks I should use iso-8859-1. But I'd kind of
like to use the Windows characters in code points 128-159: having the
quotes and dashes instead of &...; codes would simplify editing.

Jukka K. Korpela

unread,

Oct 18, 2020, 6:49:44 AM10/18/20

to

Stan Brown wrote:

>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
>
> QUESTION 1: Any reason you suggest that rather than the simpler
> <meta charset="windows-1252">

No good reason. I just wrote the original format because I learned it 25
years or so ago

> ? This page says the two forms are equivalent in HTML5:

They are. The people who worked on HTML5 detected that all browsers
treat <meta charset="windows-1252"> as equivalent to the defined format.
I’m not sure what kind of accident this was, but anyway they made it a rule.

> QUESTION 2: It would be awfully convenient to type a Windows
> apostrophe (8-bit character 146) rather than ’ or ’. If
> I specify a charset of windows-1252, am I safe to do that, or should
> I still stay away from Windows characters 128-159?

You’re safe. Twenty years ago it was different.

> QUESTION 3: If I should still stay away from 128-159, even with a
> windows-1252 declaration, is there any particular reason you suggest
> windows-1252 rather than iso-8859-1?

The reason is that browsers treat iso-8859-1 as windows-1252, and HTML5
made this the rule. In the old times it was different, mainly in the
sense that browsers running on Unix platforms actually treated
iso-8859-1 declared data so that octets 128–159 were control characters
and sometimes had odd effects.

> I think I can get that access, probably via some override file in my
> root directory. In fact, there's already a .htaccess file there with
> one AddType, so I think it must be an Apache server or a workalike.
> I should be able to add
> AddType text/plain;charset=windows-1252
> AddType text/html;charset=windows-1252
> and have the server emit the desired headers.

I’m afraid Nginx does not support .htaccess but has other tools.

> But the stackoverflow
> article above makes the point that we still want to include a charset
> in each file, for the folks who download a file for later reading.

That’s a valid point, because browsers probably still haven’t learned to
save a web page locally in a proper way. That is, they don’t use the
HTTP headers when saving the file. This is understandable, since file
systems generally lack a file type concept that involves character
encoding, and to save the encoding information in the file itself, the
browser would need to insert a tag there. This means that the browser
would need to 1) save the file as a serialization of the browser’s
internal data structure for it, with a meta element inserted, thereby
producing something that might differ very much from the original file,
or 2) to operate on the document as text and inserting a meta element at
the appropriate place.

Jukka K. Korpela

unread,

Oct 18, 2020, 8:06:15 AM10/18/20

to

Helmut Richter wrote:

> On Sat, 17 Oct 2020, Stan Brown wrote:
>
>> QUESTION 2: It would be awfully convenient to type a Windows
>> apostrophe (8-bit character 146) rather than ’ or ’. If
>> I specify a charset of windows-1252, am I safe to do that, or should
>> I still stay away from Windows characters 128-159?
>
> There is no reason to stay away from code points that are defined in the
> code.

Well, apart from some code points not being assigned to any character,
or some assigned characters being somewhat questionable. (For example,
how often would it make sense to use the florin sign ƒ?) Sorry, today is
my nitpicking day.

> (I don’t have the problem, though. If I want a real apostrophe like the
> one in the preceding sentence, I just type it (on my keyboard AltGr+'),

I just press the key labeled with the Ascii apostophe ('). Well, that’s
how I use my personal keyboard layout when typing text (as opposite to
code), and using the standard Finnish international layout I need to use
AltGr+'

> and it lands in the file as the UTF-8 representation of that character.

This depends on the software that processes the typed characters.

>> QUESTION 3: If I should still stay away from 128-159, even with a
>> windows-1252 declaration, is there any particular reason you suggest
>> windows-1252 rather than iso-8859-1? know they're the same for 32-
>> 127 and 160-255, but in my mind windows-1252 suggests that I'll be
>> using Windows 128-159, and iso-8859-1 does not.
>
> If you use these code points, you have to specify windows-1252; if not,
> the effect is the same for the two code names.

No, the effect is always the same on all browsers use nowadays (possibly
excluding some you might see in a museum of technology).

Browsers treat iso-8859-1 as an alias for windows-1252. Technically,
they are to distinct encodings and differ in the 128–159 range, but in
the HTML context, they are the same. You can see this by creating a test
document with loads of characters in that range, in windows-1252
encoding, and declaring the document as iso-8859-1 encoded in all
possible ways. It’s still processed and shown as windows-1252 encoded.

Helmut Richter

unread,

Oct 18, 2020, 9:34:51 AM10/18/20

to

On Sun, 18 Oct 2020, Jukka K. Korpela wrote:

> Helmut Richter wrote:
>
> > On Sat, 17 Oct 2020, Stan Brown wrote:
> >
> > > QUESTION 2: It would be awfully convenient to type a Windows
> > > apostrophe (8-bit character 146) rather than ’ or ’. If
> > > I specify a charset of windows-1252, am I safe to do that, or should
> > > I still stay away from Windows characters 128-159?
> >
> > There is no reason to stay away from code points that are defined in the
> > code.
>
> Well, apart from some code points not being assigned to any character, or some
> assigned characters being somewhat questionable. (For example, how often would
> it make sense to use the florin sign ƒ?) Sorry, today is my nitpicking day.

If you need it, you can use it. There are many usable characters I have
never used.

> > (I don’t have the problem, though. If I want a real apostrophe like the
> > one in the preceding sentence, I just type it (on my keyboard AltGr+'),
>
> I just press the key labeled with the Ascii apostophe ('). Well, that’s how I
> use my personal keyboard layout when typing text (as opposite to code), and
> using the standard Finnish international layout I need to use AltGr+'
>
> > and it lands in the file as the UTF-8 representation of that character.
>
> This depends on the software that processes the typed characters.

Yes, of course. My remark has another background: Instead of thinking how
to produce a character for this or that purpose, I have once and for all
installed the software that each „ä“, „é“, or „ע“ is the same whatever the
purpose: in a letter to be sent, as part of command, as text in an HTML
page, as part of a filename, or anything else. It is represented as the
same bit pattern in all uses, at least in all where I can control that bit
pattern (PDF does it differently AFAIK). Depending on the underlying
system, it may be some work until everything fits together, but from then
onward it is much easier. It is of no use to make an exception just for
one application. But eveybody may do so if they like.

> > > QUESTION 3: If I should still stay away from 128-159, even with a
> > > windows-1252 declaration, is there any particular reason you suggest
> > > windows-1252 rather than iso-8859-1? know they're the same for 32-
> > > 127 and 160-255, but in my mind windows-1252 suggests that I'll be
> > > using Windows 128-159, and iso-8859-1 does not.
> >
> > If you use these code points, you have to specify windows-1252; if not,
> > the effect is the same for the two code names.
>
> No, the effect is always the same on all browsers use nowadays (possibly
> excluding some you might see in a museum of technology).

Yes, but I hate to write iso-8859-1 when it is a lie, whereas windows-1252
would work exactly the same and would be true. In effect, one relies on a
(common and arguably user-friendly) bug in the browsers. This is the same
as writing a comma instead of an opening single quote which looks the
same, just because the comma is typed faster. Such tricks may be adequate
if there is no other work-around for a problem but not on a regular basis.

--
Helmut Richter

Stan Brown

unread,

Oct 18, 2020, 10:26:10 AM10/18/20

to

On Sun, 18 Oct 2020 13:49:41 +0300, Jukka K. Korpela wrote:
>
> Stan Brown wrote:
>
> >> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
> >
> > QUESTION 1: Any reason you suggest that rather than the simpler
> > <meta charset="windows-1252">
>
> No good reason. I just wrote the original format because I learned it 25
> years or so ago

Got it; thanks! I think I'll use the shorter form, then. Fortunately
it's inside an include file, so I only need to change it once.

> > QUESTION 2: It would be awfully convenient to type a Windows
> > apostrophe (8-bit character 146) rather than ’ or ’. If
> > I specify a charset of windows-1252, am I safe to do that, or should
> > I still stay away from Windows characters 128-159?
>

> You?re safe. Twenty years ago it was different.

That's good news; thanks. (And how bizarre it is to talk about
"twenty years ago" in a Web context. Where did the time go?

> > QUESTION 3: If I should still stay away from 128-159, even with a
> > windows-1252 declaration, is there any particular reason you suggest
> > windows-1252 rather than iso-8859-1?
>
> The reason is that browsers treat iso-8859-1 as windows-1252, and HTML5
> made this the rule. In the old times it was different, mainly in the
> sense that browsers running on Unix platforms actually treated

> iso-8859-1 declared data so that octets 128?159 were control characters

> and sometimes had odd effects.

Wow! I would never have guessed that: I take those character sets
literally.

> > I think I can get that access, probably via some override file in my
> > root directory. In fact, there's already a .htaccess file there with
> > one AddType, so I think it must be an Apache server or a workalike.
> > I should be able to add
> > AddType text/plain;charset=windows-1252
> > AddType text/html;charset=windows-1252
> > and have the server emit the desired headers.
>

> I?m afraid Nginx does not support .htaccess but has other tools.

Hmm ... it seems to work for me as though it were Apache. I added
these lines to my existing .htaccess file:

AddType 'text/html; charset=windows-1252' htm
AddType 'text/html; charset=windows-1252' html

and then tried a couple of retrieved with W3C's i18n tool at
<URL:https://validator.w3.org/i18n-checker/>, and the output showed
that the server was now sending windows-1252. Am I misinterpreting
something, or is that tool not reliable?

> > But the stackoverflow
> > article above makes the point that we still want to include a charset
> > in each file, for the folks who download a file for later reading.
>

> That?s a valid point, because browsers probably still haven?t learned to
> save a web page locally in a proper way. That is, they don?t use the
> HTTP headers when saving the file. This is understandable, ...

Makes sense.

Thanks again for your help!

Stan Brown

unread,

Oct 18, 2020, 10:30:01 AM10/18/20

to

On Sun, 18 Oct 2020 15:34:47 +0200, Helmut Richter wrote:
> Yes, but I hate to write iso-8859-1 when it is a lie, whereas windows-1252
> would work exactly the same and would be true.

I happened to read Jukka's followup before yours, but I think you put
my feeling into better words than I could.

Stan Brown

unread,

Oct 18, 2020, 11:03:46 AM10/18/20

to

On Thu, 15 Oct 2020 14:31:10 -0700, I started this thread with:

>
> I'm trying, and failing, to write the proper charset in my meta tag.
> Help, please!

A very big thank-you to all those who responded! I have learned quite
a lot in the past few days, and you were a big help in that. Here are
changes completed or in progress:

* Server now declares web pages as windows-1252 character set, which
is what they are.

* Learned about W3C's i18n tool, an easy way to check headers
relevant to encoding as sent by the server:
<URL:https://validator.w3.org/i18n-checker/>

* Dumped HTML 4.01 and now use <!DOCTYPE html>. I had put that off
for far too long.

* Replaced <meta http-equiv ...> with <meta charset ...>. This is
redundant for online viewing, but may be helpful for viewing saved
Web pages off line.

* W3C's command-line checker (vnu) is installed and is now part of my
build process (replacing NSGMLS). Not only does it validate according
to HTML, it checks inline CSS, both <style> and style="..."
attributes.

* Figured out the --filterfile option in vnu, so that it suppresses
messages about my non-utf-8 character set. (And they implemented that
right: if the suppressed messages are the only errors, the checker
returns an exit status of 0, not 1.)

* Now using actual characters for the whole range 32-255, instead of
&#...; in the range 128-255. That includes Windows quote marks and
dashes, for instance, which will reduce file sizes and of course be
easier for me to read in the raw files.

At this point I'm not converting to utf-8, though perhaps in the
future. But despite W3C pushing for it very hard, I've learned that
it's not necessary.

Thomas 'PointedEars' Lahn

unread,

Oct 19, 2020, 8:08:15 PM10/19/20

to

Eli the Bearded wrote:

> In comp.infosystems.www.authoring.html,

> Stan Brown <the_sta...@fastmail.fm> wrote:
>> I have this line in the <head> of my Web pages:

>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

>> But perfectly decent characters like é, ×, ² show up as a question

>> mark in a lozenge. I figured out that that's because my HTML files
>> are all plain text, 8 characters per byte, which is not UTF8 when I
>> use characters above 127.

>
> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
> it's all not plain.

Nonsense. “Plain text” means – literally – content that can be read by a
person as opposed to “binary” data; that is, content where byte sequences
represent characters, in particular digits and letters.

--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix>
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

Thomas 'PointedEars' Lahn

unread,

Oct 19, 2020, 8:14:26 PM10/19/20

to

Helmut Richter wrote:

> You should notice that
>
> <meta http-equiv="Content-Type" content="text/html; charset=any-code">
> (HTML before HTML5 as well)
>
> and
>
> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>
> have different meanings. meta_http-equiv is a hint to the web server to

> declare the content type and encoding via the HTTP protocol. […]

Not at all. How did you get that idea? It is not a job of a Web server to
*interpret* the body of a HTTP message in order to generate a header for
that HTTP message. Parsing and interpreting HTML, for example, is solely
the domain of a HTML user agent.

Instead, both HTML elements are a *substitute* – an *equivalent* – for the
Content-Type HTTP header field, to be used by the Web _browser_, if that
header field is not sent by the Web server.

The various HTML Specifications make that very clear.

Thomas 'PointedEars' Lahn

unread,

Oct 19, 2020, 8:21:44 PM10/19/20

to

By contrast to the Rich Text Format and MS Word format(s), HTML *is* a
plain-text format because whether a file is a “plain text” file does not
depend on the presentation of the content, only on the meaning of the octet
sequences.

Therefore its media type “text/html” belongs to (and starts with) the type
“text”. By contrast, e.g. the media types for the Rich Text Format (.rtf)
files is “application/rtf” and of MS Word 2003+ documents (.docx) is
“application/vnd.openxmlformats-officedocument.wordprocessingml.document”.

Thomas 'PointedEars' Lahn

unread,

Oct 19, 2020, 8:23:07 PM10/19/20

to

Jukka K. Korpela wrote:

> HTML is by definition not plain text.

That is plain false.

Jukka K. Korpela

unread,

Oct 20, 2020, 3:08:21 AM10/20/20

to

Lahn wrote:

> Helmut Richter wrote:
>
>> You should notice that
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=any-code">
>> (HTML before HTML5 as well)
>>
>> and
>>
>> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>>
>> have different meanings. meta_http-equiv is a hint to the web server to
>> declare the content type and encoding via the HTTP protocol. […]
>
> Not at all. How did you get that idea?

Perhaps from the HTML specifications.

The original idea was that servers could parse the start of an HTML
document and use <meta http-equiv=...> content to generate HTTP headers.

This didn’t happen. Instead, servers use various configuration files or
settings to decide on HTTP response headers. Browsers, on the other
hand, started using <meta> tags at least to some extent, e.g. when
server response does not specify charset.

> It is not a job of a Web server to
> *interpret* the body of a HTTP message in order to generate a header for
> that HTTP message. Parsing and interpreting HTML, for example, is solely
> the domain of a HTML user agent.
>
> Instead, both HTML elements are a *substitute* – an *equivalent* – for the
> Content-Type HTTP header field, to be used by the Web _browser_, if that
> header field is not sent by the Web server.
>
> The various HTML Specifications make that very clear.

”HTTP servers may read the content of the document HEAD to generate
header fields corresponding to any elements defining a value for the
attribute HTTP-EQUIV.”
https://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5

”HTTP servers may use the property name specified by the HTTP-EQUIV
attribute to create an RFC 822 style header in the HTTP response.”
https://www.w3.org/TR/2018/SPSD-html32-20180315/#meta

”http-equiv = name [CI]
This attribute may be used in place of the name attribute. HTTP servers
use this attribute to gather information for HTTP response message headers.”
https://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2

Since that’s not how things actually worked, HTML5 specs don’t even
mention the possibility of servers using <meta> tags. Neither do they
prohibit such things; they don’t really deal with the operation of
servers. The early HTML5 drafts/specs didn’t even allow <meta
http-equiv=...> and instead used the <meta charset=...> invention, which
was, from the beginning, meant to be handled by user agents.

Helmut Richter

unread,

Oct 20, 2020, 4:09:47 AM10/20/20

to

On Tue, 20 Oct 2020, Thomas 'PointedEars' Lahn wrote:

> Helmut Richter wrote:
>
> > You should notice that
> >
> > <meta http-equiv="Content-Type" content="text/html; charset=any-code">
> > (HTML before HTML5 as well)
> >
> > and
> >
> > <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
> >
> > have different meanings. meta_http-equiv is a hint to the web server to
> > declare the content type and encoding via the HTTP protocol. […]
>
> Not at all. How did you get that idea? It is not a job of a Web server to
> *interpret* the body of a HTTP message in order to generate a header for
> that HTTP message. Parsing and interpreting HTML, for example, is solely
> the domain of a HTML user agent.

Thank you for repeating <huuk9t...@mid.individual.net>. I understood
that one as well, though.

--
Helmut Richter

Arno Welzel

unread,

Oct 20, 2020, 9:34:55 AM10/20/20

to

Thomas 'PointedEars' Lahn:

> Helmut Richter wrote:
>
>> You should notice that
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=any-code">
>> (HTML before HTML5 as well)
>>
>> and
>>
>> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>>
>> have different meanings. meta_http-equiv is a hint to the web server to
>> declare the content type and encoding via the HTTP protocol. […]
>
> Not at all. How did you get that idea? It is not a job of a Web server to
> *interpret* the body of a HTTP message in order to generate a header for
> that HTTP message. Parsing and interpreting HTML, for example, is solely
> the domain of a HTML user agent.
>
> Instead, both HTML elements are a *substitute* – an *equivalent* – for the
> Content-Type HTTP header field, to be used by the Web _browser_, if that
> header field is not sent by the Web server.
>
> The various HTML Specifications make that very clear.

JFTR - HTML 4.01 already mentioned that servers parse the document and
use meta elements to create response headers, eventhough I have never
seen this in real world implementations:

<https://www.w3.org/TR/html401/struct/global.html#adef-http-equiv>

"http-equiv = name [CI]

This attribute may be used in place of the name attribute. HTTP servers
use this attribute to gather information for HTTP response message headers."

It seems there is a module for Apache 2 to deal with this - but I doubt
this is still in use anywhere:

<https://metacpan.org/pod/Apache2::HttpEquiv>

Arno Welzel

unread,

Oct 20, 2020, 9:36:55 AM10/20/20

to

Stan Brown:

> On Thu, 15 Oct 2020 14:31:10 -0700, I started this thread with:
>>
>> I'm trying, and failing, to write the proper charset in my meta tag.
>> Help, please!
>
> A very big thank-you to all those who responded! I have learned quite
> a lot in the past few days, and you were a big help in that. Here are
> changes completed or in progress:

[...]

Thank you for this summary of your findings.

Stan Brown

unread,

Oct 20, 2020, 9:46:34 AM10/20/20

to

It seemed the least I could do, after all the help I received. I've
learned a huge amount these last few days, and now I'm in the process
of bringing my Web pages up to date.

Eli the Bearded

unread,

Oct 20, 2020, 2:40:53 PM10/20/20

to

In comp.infosystems.www.authoring.html,

Thomas 'PointedEars' Lahn <cl...@PointedEars.de> wrote:
> Eli the Bearded wrote:

>> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
>> it's all not plain.

> Nonsense. "Plain text" means - literally - content that can be read

> by a person as opposed to "binary" data; that is, content where byte
> sequences represent characters, in particular digits and letters.

So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
those are _not plain text_.

(As an aside, I'm seeing that my stance that US-ASCII is "plain text"
and "plain text" does not necessarily mean "text/plain" is an unpopular
one. I'm tired of arguing the point, but no one has convinced me that
I'm wrong.)

Elijah
------
utf-8 in the sheets, ascii in the style sheets

Helmut Richter

unread,

Oct 20, 2020, 2:43:53 PM10/20/20

to

On Tue, 20 Oct 2020, Eli the Bearded wrote:

> (As an aside, I'm seeing that my stance that US-ASCII is "plain text"

This why ,d,,

Phillip Helbig (undress to reply)

unread,

Oct 20, 2020, 3:29:45 PM10/20/20

to

In article <eli$20102...@qaz.wtf>, Eli the Bearded

Is PostScript plain text?

Arno Welzel

unread,

Oct 20, 2020, 10:27:39 PM10/20/20

to

Phillip Helbig (undress to reply):

[...]
> Is PostScript plain text?

It can be:

<http://paulbourke.net/dataformats/postscript/>

Jukka K. Korpela

unread,

Oct 21, 2020, 6:52:20 AM10/21/20

to

Arno Welzel wrote:

> Phillip Helbig (undress to reply):
>
> [...]
>> Is PostScript plain text?
>
> It can be:
>
> <http://paulbourke.net/dataformats/postscript/>

That old document seems to say that PostScript is plain text, since you
can create, edit, and read a PostScript file using a text editor. But
that’s not how ”plain text” is defined in MIME:

The simplest and most important subtype of "text" is "plain". This
indicates plain text that does not contain any formatting commands or
directives. Plain text is intended to be displayed "as-is", that is,
no interpretation of embedded formatting commands, font attribute
specifications, processing instructions, interpretation directives,
or content markup should be necessary for proper display. T
https://tools.ietf.org/html/rfc2046#section-4.1.3

ObHTML: Similarly, HTML is not plain text.

Technically, PostScript isn’t even classified as text; the media type
for it is application/postscript. This does not mean that it would be
impossible to write PostScript using a text editor.

ObHTML: For XHTML, the media type application/xhtml+xml is specified.

Arno Welzel

unread,

Oct 21, 2020, 12:08:07 PM10/21/20

to

Jukka K. Korpela:

> Arno Welzel wrote:
>
>> Phillip Helbig (undress to reply):
>>
>> [...]
>>> Is PostScript plain text?
>>
>> It can be:
>>
>> <http://paulbourke.net/dataformats/postscript/>
>
> That old document seems to say that PostScript is plain text, since you
> can create, edit, and read a PostScript file using a text editor. But
> that’s not how ”plain text” is defined in MIME:
>
> The simplest and most important subtype of "text" is "plain". This
> indicates plain text that does not contain any formatting commands or
> directives. Plain text is intended to be displayed "as-is", that is,
> no interpretation of embedded formatting commands, font attribute
> specifications, processing instructions, interpretation directives,
> or content markup should be necessary for proper display. T
> https://tools.ietf.org/html/rfc2046#section-4.1.3
>
> ObHTML: Similarly, HTML is not plain text.

Correct - HTML has to be interpreted by a browser to get the final
display. Nevertheless you still can also edit it with a text editor
which does not know anything about HTML at all.

> Technically, PostScript isn’t even classified as text; the media type
> for it is application/postscript. This does not mean that it would be
> impossible to write PostScript using a text editor.
>
> ObHTML: For XHTML, the media type application/xhtml+xml is specified.

But even application/xhtml+xml is in fact plain text which is
*interpreted* as XHTML.

The important point is, that the content of a file of that type can be
read as plain text as well.

Eli the Bearded

unread,

Oct 21, 2020, 2:04:46 PM10/21/20

to

In comp.infosystems.www.authoring.html,

Jukka K. Korpela <juk...@gmail.com> wrote:
> Arno Welzel wrote:
> > Phillip Helbig (undress to reply):

> >> Is PostScript plain text?
> > It can be:

It "can" be plain text (but is not text/plain).

> That old document seems to say that PostScript is plain text, since you
> can create, edit, and read a PostScript file using a text editor. But
> that’s not how ”plain text” is defined in MIME:

Broken "smart" quotes, woo-hoo. But more seriously, the real objection
to calling Postscript plain "text" is very often Postscript contains
binary data. Either in 7-bit clean encoded form (eg hex, base64, or
base85) as actual raw binary inclusions. The language makes it easy to
say "the next 1289683 octets are data" and not worry about encoding
the data.

Alas I can't find an example on this computer, but I have seen actual
JPEG files inlined in Postscript. Since Postscript is a programming
language it is easy enough to have a program that can interpret a binary
blob to simplify the creation of programs using raster images. Or to
have "self extracting" compressed programs.

For reasons like that, alone, giving Postscript an "application/" MIME
type is quite reasonable.

Elijah
------
also there is something reasonable about calling programs "application/"

Jukka K. Korpela

unread,

Oct 21, 2020, 2:27:07 PM10/21/20

to

Arno Welzel wrote:

> But even application/xhtml+xml is in fact plain text which is
> *interpreted* as XHTML.
>
> The important point is, that the content of a file of that type can be
> read as plain text as well.

Please read
this as plain
text.

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 2:29:24 PM10/21/20

to

Eli the Bearded wrote:

> In comp.infosystems.www.authoring.html,
> Thomas 'PointedEars' Lahn <cl...@PointedEars.de> wrote:
>> Eli the Bearded wrote:
>>> The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
>>> it's all not plain.
>> Nonsense. "Plain text" means - literally - content that can be read
>> by a person as opposed to "binary" data; that is, content where byte
>> sequences represent characters, in particular digits and letters.
>
> So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
> on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
> those are _not plain text_.

No, of course not. Not all code points of US-ASCII or Unicode represent
digits and letters. In particular, the first 32 code points do not; they
represent non-printable control characters or are left unassigned. That
is, they represent *data*, but not necessarily *text*.

> [Ex falso quodlibet]

PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300...@news.demon.co.uk>

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 2:48:21 PM10/21/20

to

Jukka K. Korpela wrote:

> Arno Welzel wrote:
>> Phillip Helbig (undress to reply):
>> [...]
>>> Is PostScript plain text?
>>
>> It can be:
>>
>> <http://paulbourke.net/dataformats/postscript/>
>
> That old document seems to say that PostScript is plain text, since you
> can create, edit, and read a PostScript file using a text editor. But
> that’s not how ”plain text” is defined in MIME:

This definition is NOT what is commonly being used to distinguish which
*files* are considered plain text and “binary” *files* by software
developers; they use common sense instead (which arguably some people do not
appear to have):

"Plain text" *files* are *human*-readable¹, while "binary" files are not.
I wager that further information can be found in the standards that define
the Unix operating system as various tools standardized there are using this
definition.

Therefore, for software developers and authors who actually *write* HTML –
HTML can be *written* with a *plain-text editor* like Vim, Emacs, Atom etc.;
it does not need to be generated by a special application like graphics
software – (instead of only discussing about it), HTML *is* considered a
plain-text *file* format, as I explained before.

For clarification, see also <https://en.wikipedia.org/wiki/Plain_text>

> https://tools.ietf.org/html/rfc2046#section-4.1.3

| Updated by: 2646, 3798, 5147, 6657, 8098
| […]
| November 1996

_______
¹ substitute the name of your favorite intelligent fully-biological species

PointedEars
--
Sometimes, what you learn is wrong. If those wrong ideas are close to the
root of the knowledge tree you build on a particular subject, pruning the
bad branches can sometimes cause the whole tree to collapse.
-- Mike Duffy in cljs, <news:Xns9FB6521286...@94.75.214.39>

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 2:49:28 PM10/21/20

to

| This definition is NOT what is commonly being used to distinguish which
| *files* are considered plain text and “binary” *files* by software
| developers; they use common sense instead (which arguably some people do
| not appear to have):

q.e.d.

*facepalm*

Eli the Bearded

unread,

Oct 21, 2020, 2:53:21 PM10/21/20

to

In comp.infosystems.www.authoring.html,
Jukka K. Korpela <juk...@gmail.com> wrote:

> Please read
> this as plain
> text.

Reading it as plain text is trivial. Ampersand hash lower-case-x five
zero semicolon. Ampersand hash lower-case-x six upper-case-C semicolon.
Et cetera. As text/plain it leaves a lot to be desired.

Elijah
------
%77%72%6f%74%65%20%61%20%43%4c%49%20%74%6f%6f%6c%20%66%6f%72%20%74%68%69%73

Jukka K. Korpela

unread,

Oct 21, 2020, 3:08:30 PM10/21/20

to

Eli the Bearded wrote:

> In comp.infosystems.www.authoring.html,
> Jukka K. Korpela <juk...@gmail.com> wrote:
>> Please read
>> this as plain
>> text.
>
> Reading it as plain text is trivial.

Didn’t someone quote this from the relevant RFC:

Plain text is intended to be displayed "as-is", that is,
no interpretation of embedded formatting commands, font attribute
specifications, processing instructions, interpretation directives,
or content markup should be necessary for proper display.

Do I need to point out that it says that “no interpretation of [...]
content markup should be necessary for proper display”?

Are you saying tai displaying the character sequence “as-is” is proper
display?

Eli the Bearded

unread,

Oct 21, 2020, 3:16:29 PM10/21/20

to

In comp.infosystems.www.authoring.html,
Thomas 'PointedEars' Lahn <dci...@PointedEars.de> wrote:
>

I note the lack of an attribution there[*].

My writing:

>> So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
>> on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
>> those are _not plain text_.

Thomas's reply:

> No, of course not. Not all code points of US-ASCII or Unicode represent
> digits and letters. In particular, the first 32 code points do not; they
> represent non-printable control characters or are left unassigned. That
> is, they represent *data*, but not necessarily *text*.

Control characters between 0 and 31 are either generally not used in
output or have very well defined meanings in output. Many of them are
used in _input_ though, eg in Unix <ctrl-c>, <ctrl-d>, <ctrl-w> are all
things I've used today. Calling <tab> or <line feed> "non-printable"
might be true for some narrow version of "non-printable", but that same
narrow version of "non-printable" also holds for the 33rd entry in
US-ASCII, U+0032, which you explicitly left out of your example.

The lexographer Jesse Sheidlower was once asked what his favorite
punctionation mark was:
https://www.theatlantic.com/culture/archive/2012/09/writers-favorite-punctuation-marks/323287/

I once participated in a similar exercise, and in the end I
concluded that the humble space is the punctuation mark to beat.
People tend to argue for the expressiveness of the semicolon, or
the esoteric old-fashionedness of the diaeresis. But these are all
seasonings. The meat of it is the space, and if you've ever tried
to read manuscripts from the era before the space was regularly
used, you'll know just how important it is. It's what gives us
words instead of a big lump.

ALLCAPSTEXTWITHNOWHITESPACEISPLAINTEXTANDEVENMIMETEXTPLAINBUTTHATDOESNOTMEANITISEASILYREAD

The thirty-three codepoints between U+0000 and U+0032 (inclusive) are
all punctuation marks of a sort, some of which never found general use.

>> [Ex falso quodlibet]

[*] This is not something I wrote, although the >> implies it was in
my article. So perhaps the lack of attribution was deliberate?

Elijah
------
boustrophedonic inscriptions are plain text but not easily read

Arno Welzel

unread,

Oct 21, 2020, 3:21:11 PM10/21/20

to

Jukka K. Korpela:

Is this the way *you* create your XHTML files?

Arno Welzel

unread,

Oct 21, 2020, 3:22:30 PM10/21/20

to

Jukka K. Korpela:

> Eli the Bearded wrote:
>
>> In comp.infosystems.www.authoring.html,
>> Jukka K. Korpela <juk...@gmail.com> wrote:
>>> Please read
>>> this as plain
>>> text.
>>
>> Reading it as plain text is trivial.
>
> Didn’t someone quote this from the relevant RFC:
> Plain text is intended to be displayed "as-is", that is,

Which is possible:

Ampersand, Hash, Five, Zero, Colon...

[...]

> Are you saying tai displaying the character sequence “as-is” is proper
> display?

Yes. You did not ask for "interpret what this text means".

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 3:24:30 PM10/21/20

to

Jukka K. Korpela wrote:

> Lahn wrote:
^^^^
>> Helmut Richter wrote:
^^^^^^^^^^^^^^
(sic)

You don’t *like* me, I get it. No need to point it out every time!

What an obnoxious character :-(

>>> You should notice that
>>>
>>> <meta http-equiv="Content-Type" content="text/html;
>>> charset=any-code">
>>> (HTML before HTML5 as well)
>>>
>>> and
>>>
>>> <meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
>>>
>>> have different meanings. meta_http-equiv is a hint to the web server to
>>> declare the content type and encoding via the HTTP protocol. […]
>>
>> Not at all. How did you get that idea?
>
> Perhaps from the HTML specifications.

Perhaps, but that would be a common misconception. (How common it was/is
became apparent when Apache’s “AddDefaultCharset” directive had to be
removed from/disabled in the default configuration, and the display errors
showed up in the very same bug report because the server used to display the
report was an Apache server which was still misconfigured in that way.¹)

No non-obsolete HTML Specification specifies this. In fact, only version
2.0 did, and it was literally made obsolete decades ago.

The next version, HTML 3.2 of 1997, already clarified:

,-<https://www.w3.org/TR/2018/SPSD-html32-20180315/#meta>
|
| […] This can't be used to set certain HTTP headers though, see the HTTP
| specification for details.

Since a Web server "now" must provide at least “Content-Type: text/html”
(see below) for a resource to be parsed as HTML if it is requested via HTTP,
it is not intended for

<meta http-equiv='Content-Type' value='text/html; charset=foo'>

to supersede the server-specified encoding.

>> It is not a job of a Web server to *interpret* the body of a HTTP message
>> in order to generate a header for that HTTP message. Parsing and
>> interpreting HTML, for example, is solely the domain of a HTML user
>> agent.
>>
>> Instead, both HTML elements are a *substitute* – an *equivalent* – for
>> the Content-Type HTTP header field, to be used by the Web _browser_, if
>> that header field is not sent by the Web server.
>>
>> The various HTML Specifications make that very clear.
>
> ”HTTP servers may read the content of the document HEAD to generate
> header fields corresponding to any elements defining a value for the
> attribute HTTP-EQUIV.”
> https://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5

This was nonsense to begin with, because, as I indicated, it would require a
HTTP *server* to interpret HTML (and keep up-to-date with the respective
current HTML standard as well), and tacitly assume that no non-US-ASCII-
compatible code sequences occur before the respective META element.

Unicode 1.0 was introduced in 1992 already, and other character encodings
than US-ASCII existed before, so this was a clear oversight in this
specification that became an IETF standards track document (RFC 1866)
in 1995-11.

It is obsolete since 2000-06: <https://tools.ietf.org/html/rfc2854>

> Since that’s not how things actually worked, HTML5 specs don’t even
> mention the possibility of servers using <meta> tags. Neither do they
> prohibit such things; they don’t really deal with the operation of
> servers.

So it is not reasonable to assume that this would work. AISB.

> The early HTML5 drafts/specs didn’t even allow <meta
> http-equiv=...> and instead used the <meta charset=...> invention,

Questionable. Evidence?

> which was, from the beginning, meant to be handled by user agents.

Yes.

______
¹ <https://bz.apache.org/bugzilla/show_bug.cgi?id=23421>

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 3:26:40 PM10/21/20

to

Go to hell.

Arno Welzel

unread,

Oct 21, 2020, 3:27:14 PM10/21/20

to

Arno Welzel:

> Jukka K. Korpela:
>
>> Eli the Bearded wrote:
>>
>>> In comp.infosystems.www.authoring.html,
>>> Jukka K. Korpela <juk...@gmail.com> wrote:
>>>> Please read
>>>> this as plain
>>>> text.
>>>
>>> Reading it as plain text is trivial.
>>
>> Didn’t someone quote this from the relevant RFC:
>> Plain text is intended to be displayed "as-is", that is,
>
> Which is possible:
>
> Ampersand, Hash, Five, Zero, Colon...

Well, I forgot the x after the ampersand...

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 3:36:20 PM10/21/20

to

A pseudonymous coward and liar trolled:

> In comp.infosystems.www.authoring.html,
> Thomas 'PointedEars' Lahn <dci...@PointedEars.de> wrote:
>>
>
> I note the lack of an attribution there[*].

Have your eyes checked.

> My writing:
>>> So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
>>> on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
>>> those are _not plain text_.
>
> Thomas's reply:

See below.

>> No, of course not. Not all code points of US-ASCII or Unicode represent
>> digits and letters. In particular, the first 32 code points do not; they
>> represent non-printable control characters or are left unassigned. That
>> is, they represent *data*, but not necessarily *text*.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> Control characters between 0 and 31 are either generally not used in
> output or have very well defined meanings in output.

You don’t say. *facepalm*

>>> [Ex falso quodlibet]
>
> [*] This is not something I wrote, although the >> implies it was in
> my article. So perhaps the lack of attribution was deliberate?

Why are you lying?

<https://www.netmeister.org/news/learn2quote.html>

Score adjusted

Jukka K. Korpela

unread,

Oct 21, 2020, 3:50:09 PM10/21/20

to

Arno Welzel wrote:

> Ampersand, Hash, Five, Zero, Colon...
>
> [...]
>> Are you saying tai displaying the character sequence “as-is” is proper
>> display?
>
> Yes. You did not ask for "interpret what this text means".

For HTML (which is what we are discussing here), “proper display” means
displaying the content as defined in HTML specifications. It would
inappropriate for a browser to display the tags, the character
references, the comments, etc., “as-is”. It would mean rendering an HTML
document as plain text (which it is not, by definition), refusing to do
the job of a browser.

Eli the Bearded

unread,

Oct 21, 2020, 4:07:58 PM10/21/20

to

In comp.infosystems.www.authoring.html,

Thomas 'PointedEars' Lahn <use...@PointedEars.de> wrote:
> A pseudonymous coward and liar trolled:

Ever the classy person there.

>>>> [Ex falso quodlibet]
>> [*] This is not something I wrote, although the >> implies it was in
>> my article. So perhaps the lack of attribution was deliberate?
> Why are you lying?

$ lynx -source -dump 'news:<eli$20102...@qaz.wtf>' |grep
quodlibet
$ lynx -source -dump 'news:<2173853.E...@PointedEars.de>' |grep quodlibet
> [Ex falso quodlibet]
$

Elijah
------
still recalls <eli$16031...@qz.little-neck.ny.us>

Stan Brown

unread,

Oct 21, 2020, 5:38:53 PM10/21/20

to

Reminds me of the old days at my college computing center, when we
would have to key in a series of octal codes to cold boot the Univac
1107 after repairs.

Thomas 'PointedEars' Lahn

unread,

Oct 21, 2020, 9:28:06 PM10/21/20

to

Eli the Bearded wrote:

> In comp.infosystems.www.authoring.html,
> Thomas 'PointedEars' Lahn <use...@PointedEars.de> wrote:
>> A pseudonymous coward and liar trolled:
>
> Ever the classy person there.

At least you can’t say now that there was no proper attribution :-p

>>>>> [Ex falso quodlibet]
>>> [*] This is not something I wrote, although the >> implies it was in
>>> my article. So perhaps the lack of attribution was deliberate?
>> Why are you lying?
>
> $ lynx -source -dump 'news:<eli$20102...@qaz.wtf>' |grep
> quodlibet
> $ lynx -source -dump 'news:<2173853.E...@PointedEars.de>' |grep
> quodlibet
>> [Ex falso quodlibet]
> $

Oh honey, the buses don’t go where you live, yes?

It was a SUMMARY of what you wrote as indicated by the BRACKETS. As you
could have READ in “How do I quote correctly in Usenet?” which I REFERRED
YOU TO.

As you can’t be smart enough to understand Latin, there is the translation:

<https://en.wikipedia.org/wiki/Principle_of_explosion>

*facepalm*

Arno Welzel

unread,

Oct 22, 2020, 4:19:38 AM10/22/20

to

Jukka K. Korpela:

> Arno Welzel wrote:
>
>> Ampersand, Hash, Five, Zero, Colon...
>>
>> [...]
>>> Are you saying tai displaying the character sequence “as-is” is proper
>>> display?
>>
>> Yes. You did not ask for "interpret what this text means".
>
> For HTML (which is what we are discussing here), “proper display” means

"proper display" is not required to read something as plain text.

You can even print this on a sheet of paper and give it to someone to
type it in and you ge the the same file again which can again be
displayed using a web browser.

Try this with a PNG image or a MP3 file.

Stan Brown

unread,

Oct 22, 2020, 1:08:20 PM10/22/20

to

On Thu, 22 Oct 2020 10:19:37 +0200, Arno Welzel wrote:
>
> Jukka K. Korpela:
>
> > Arno Welzel wrote:
> >
> >> Ampersand, Hash, Five, Zero, Colon...
> >>
> >> [...]

> >>> Are you saying tai displaying the character sequence ?as-is? is proper

> >>> display?
> >>
> >> Yes. You did not ask for "interpret what this text means".
> >

> > For HTML (which is what we are discussing here), ?proper display? means

>
> "proper display" is not required to read something as plain text.
>
> You can even print this on a sheet of paper and give it to someone to
> type it in and you ge the the same file again which can again be
> displayed using a web browser.
>
> Try this with a PNG image or a MP3 file.

I think the two of you are actually using different terminology. To
Arno, and to me, "plain text" is not something with no codes in it,
it's something where a "text editor" can see all the characters.

I think Jukka is equating plain text" to type="text/plain". I won't
say that's wrong, but it's not the only interpretation.

Phillip Helbig (undress to reply)

unread,

Oct 22, 2020, 2:05:52 PM10/22/20

to

In article <rmq3df$ddn$1...@dont-email.me>, "Jukka K. Korpela"

<juk...@gmail.com> writes:

> For HTML (which is what we are discussing here), proper display means
> displaying the content as defined in HTML specifications. It would
> inappropriate for a browser to display the tags, the character

> references, the comments, etc., as-is. It would mean rendering an HTML

> document as plain text (which it is not, by definition), refusing to do
> the job of a browser.

Jukka knows his stuff! Just today I came across
jkorpela.fi/forms/file.html and from there to a lot of really
interesting stuff concerning HTML, the web, character encodings, and so
on.

Jukka K. Korpela

unread,

Oct 22, 2020, 2:09:07 PM10/22/20

to

Stan Brown wrote:

> I think the two of you are actually using different terminology. To
> Arno, and to me, "plain text" is not something with no codes in it,
> it's something where a "text editor" can see all the characters.
>
> I think Jukka is equating plain text" to type="text/plain". I won't
> say that's wrong, but it's not the only interpretation.

It is the definition given in the RFC for MIME types (media types), so I
would argue that when discussing e.t. whether HTML is plain text, it is
the correct definition.

You are confusing plain text, subtype text/plain, with the broader
concept of text, major type text. Note that HTML is labelled and served
as text/html (unless an application type not used), specifically
distinguishing HTML text from other types of text, such as plain text or
Rich Text Format (text/rtf).