Apparent contradiction of interpretation by W3c Validator

Robin Thornton

unread,

Dec 2, 2014, 9:19:11 AM12/2/14

to

I should be grateful if someone could explain why one of these markups is
valid and the other is not according to the W3C HTML5 Markup Validation
Service:-

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page01-Accueil</title>
<link rel="stylesheet" href="page01-style.css" type="text/css" />
</head>
<body>
<div class="header">
<h1>Saint Jean de la Blaquière</h1>
<h6>Soyez les bien venus chez le site non-officiel de la commune de
Saint Jean de la Blaquière</h6>
</div>
.
.
.
is valid - despite the inclusion of a 'è' in lines 10 and 11.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page 2 Plan d'accès </title>
<link rel="stylesheet" href="page02-style.css" type="text/css" />
</head>
<body>
<div class="header">
<h1>Saint Jean de la Blaquière</h1>
<h4>Comment nous trouver</h4>
</div>
.
.
.
is not valid because of the inclusion of a 'è' in line 5. I replaced the
offending 'è' with 'è' and resubmitted the markup. The validator then
found the 'è' on line 10 and rejected the markup as invalid.
I am completely mystified.
In view of the number of characters with accents in the French language I am
anxious to resolve this problem as early as possible.
Thank you in advance.
Robin Thornton

Thomas Mlynarczyk

unread,

Dec 2, 2014, 9:52:18 AM12/2/14

to

Robin Thornton schrieb:

My guess is that in the first example, the è was correctly UTF-8
encoded, while in the second example it was not.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Jukka K. Korpela

unread,

Dec 2, 2014, 10:50:27 AM12/2/14

to

2014-12-02 16:52, Thomas Mlynarczyk wrote:

> My guess is that in the first example, the è was correctly UTF-8
> encoded, while in the second example it was not.

Maybe. It would have helped to make sure of this if the OP had included
the exact error message and included the URL.

It is also possible that the second example was correctly UTF-8 encoded
but the server claims it to be ISO-8859-1 encoded.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

David E. Ross

unread,

Dec 2, 2014, 11:23:28 AM12/2/14

to

The text is obviously French, but you indicate <html lang="en">. That
is wrong even if that is not the source of your problem.

--
David E. Ross

The Crimea is Putin's Sudetenland.
The Ukraine will be Putin's Czechoslovakia.
See <http://www.rossde.com/editorials/edtl_PutinUkraine.html>.

Helmut Richter

unread,

Dec 2, 2014, 12:03:48 PM12/2/14

to

I think it would be very helpful if the OP could prepare show us the
mystical effect with one or two *publicly accessible" web pages. Then we
all could see whether the problem is reproducible for us, and more
importantly, we could see the actual encoding (as distinct from intended
encoding) and the HHTP headers which may tell something different from
the HTML text.

--
Helmut Richter

tlvp

unread,

Dec 2, 2014, 5:15:21 PM12/2/14

to

As for the "'è' in line 5", perhaps the TITLE element is more finicky?
As for the disparate treatments of the 'è' in lines 10, I draw a blank.

Sorry, not much help here, was I? Cheers, -- tlvp
--
Avant de repondre, jeter la poubelle, SVP.

dorayme

unread,

Dec 2, 2014, 6:02:32 PM12/2/14

to

In article <547dca5e$0$12751$426a...@news.free.fr>,

Not according to the W3C Markup Validation Service, both came through
with "This document was successfully checked as HTML5!".

--
dorayme

Robin Thornton

unread,

Dec 5, 2014, 8:28:10 AM12/5/14

to

Thank you Thomas you have hit the nail on the head! I am ashamed to admit
that I have not paid attention to encoding when creating source? I have been
in the habit of using the basic blocknote editor which does not allow
control over the encoding method. When I looked at my pages with Notepad++ I
was horified to see that some were in ASCII and some in UTF-8.
I am now presented with the problem of converting several hundred accented
characters from ASCCII to UTF-8(sans BOM). Any ideas?

Once again many thanks to you and the others who took the trouble to offer
their help.
Robin Thornton

"Thomas Mlynarczyk" a écrit dans le message de groupe de discussion :
m5kjn1$cac$1...@news.albasani.net...

Christoph M. Becker

unread,

Dec 5, 2014, 9:05:50 AM12/5/14

to

Robin Thornton wrote:

> Thank you Thomas you have hit the nail on the head! I am ashamed to
> admit that I have not paid attention to encoding when creating source? I
> have been in the habit of using the basic blocknote editor which does
> not allow control over the encoding method. When I looked at my pages
> with Notepad++ I was horified to see that some were in ASCII and some in
> UTF-8.
> I am now presented with the problem of converting several hundred
> accented characters from ASCCII to UTF-8(sans BOM). Any ideas?

Well, ASCII is a proper subset of UTF-8, so there is no need for the
conversion. However, ASCII has no accented characters -- do you
actually mean ISO-8859-1 (aka ISO Latin 1) or something like that?

--
Christoph M. Becker

Manuel Collado

unread,

Dec 5, 2014, 10:05:35 AM12/5/14

to

El 05/12/2014 14:27, Robin Thornton escribió:
> ... When I looked at my pages

> with Notepad++ I was horified to see that some were in ASCII and some in
> UTF-8.
> I am now presented with the problem of converting several hundred
> accented characters from ASCCII to UTF-8(sans BOM). Any ideas?

Several hundred characters of several hundred files?

Notepad++ can convert the encoding of the file (one file at a time):
"Encoding" -> "Convert to utf-8 without BOM"

For several hundred files you could use a Windows port of an adequate
utility, like "iconv" or "recode", and write a batch script that loops
over the file set to be converted.

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Thomas 'PointedEars' Lahn

unread,

Dec 6, 2014, 2:49:00 PM12/6/14

to

Christoph M. Becker wrote:

> Robin Thornton wrote:
>> Thank you Thomas you have hit the nail on the head! I am ashamed to
>> admit that I have not paid attention to encoding when creating source? I
>> have been in the habit of using the basic blocknote editor which does
>> not allow control over the encoding method. When I looked at my pages
>> with Notepad++ I was horified to see that some were in ASCII and some in
>> UTF-8.
>> I am now presented with the problem of converting several hundred
>> accented characters from ASCCII to UTF-8(sans BOM). Any ideas?
>
> Well, ASCII is a proper subset of UTF-8, so there is no need for the
> conversion.

Correct. More precisely, the _8-bit variant of US-ASCII without extensions_
is (US-ASCII was originally a 7-bit encoding). [0]

> However, ASCII has no accented characters -- do you actually mean
> ISO-8859-1 (aka ISO Latin 1) or something like that?

“ISO-8859-1” is _not_ ISO Latin 1; the latter would be ISO/IEC 8859-1,
which, by contrast to the former, does not assign characters/meaning to the
C0 and C1 control codes from ISO/IEC 6429.

What is called “ISO-8859-1” is byte-by-byte equivalent to Windows-1252
except for the C1 control codes which are replaced in the latter by
additional characters. Because of that, several Web clients process source
code labeled to be encoded as “ISO-8859-1” as if it was declared
Windows-1252. [1] The Encoding Specification (CR), which is referred by
HTML5 (REC) [2] makes this behavior a Web standard. [3]

For the moment at least, the affected codes should not be used, and UTF-8-
encoded Unicode characters should be used instead for the whole content in
that case.

As for the question: Yes. [4]

PointedEars

___________
[0] <https://en.wikipedia.org/wiki/ASCII#8-bit>
[1] <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>
[2] <http://www.w3.org/TR/2014/REC-html5-20141028/infrastructure.html#encoding-terminology>
[3] <http://www.w3.org/TR/2014/CR-encoding-20140916/>
[4] <http://www.catb.org/~esr/faqs/smart-questions.html>
--
> If you get a bunch of authors […] that state the same "best practices"
> in any programming language, then you can bet who is wrong or right...
Not with javascript. Nonsense propagates like wildfire in this field.
-- Richard Cornford, comp.lang.javascript, 2011-11-14

Christoph M. Becker

unread,

Dec 9, 2014, 7:22:18 PM12/9/14

to

Thomas 'PointedEars' Lahn wrote:

> Christoph M. Becker wrote:
>
>> Robin Thornton wrote:

>> However, ASCII has no accented characters -- do you actually mean
>> ISO-8859-1 (aka ISO Latin 1) or something like that?
>
> “ISO-8859-1” is _not_ ISO Latin 1; the latter would be ISO/IEC 8859-1,
> which, by contrast to the former, does not assign characters/meaning to the
> C0 and C1 control codes from ISO/IEC 6429.

Thanks for the correction. I have been confused by the IANA defined
character set[1] name "ISO-8859-1" which seems to actually denote
ISO/IEC 8859-1.

> What is called “ISO-8859-1” is byte-by-byte equivalent to Windows-1252
> except for the C1 control codes which are replaced in the latter by
> additional characters. Because of that, several Web clients process source
> code labeled to be encoded as “ISO-8859-1” as if it was declared
> Windows-1252. [1] The Encoding Specification (CR), which is referred by
> HTML5 (REC) [2] makes this behavior a Web standard. [3]

Good to know -- unfortunately, it's hard to remember, because it is so
confusing.

[1] <http://www.iana.org/assignments/character-sets/character-sets.xhtml>

--
Christoph M. Becker

Jukka K. Korpela

unread,

Dec 9, 2014, 11:21:47 PM12/9/14

to

2014-12-10, 2:22, Christoph M. Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>
>> Christoph M. Becker wrote:
>>
>>> Robin Thornton wrote:
>
>>> However, ASCII has no accented characters -- do you actually mean
>>> ISO-8859-1 (aka ISO Latin 1) or something like that?
>>
>> “ISO-8859-1” is _not_ ISO Latin 1; the latter would be ISO/IEC 8859-1,
>> which, by contrast to the former, does not assign characters/meaning to the
>> C0 and C1 control codes from ISO/IEC 6429.
>
> Thanks for the correction.

It wasn’t a correction, it was pointless nitpicking that just confuses
people. There is no point in treating ISO-8859-1 and ISO 8859-1 as two
different things, and ISO Latin 1 is an informal name. The relevant
thing is that the preferred IANA name, to be used e.g. in Content-Type
headers, is ISO-8859-1 (case-insensitively). When writing about the
standard or the encoding, any of these is OK; ignore any whining about
it as trolling.

>> What is called “ISO-8859-1” is byte-by-byte equivalent to Windows-1252
>> except for the C1 control codes which are replaced in the latter by
>> additional characters. Because of that, several Web clients process source
>> code labeled to be encoded as “ISO-8859-1” as if it was declared
>> Windows-1252. [1] The Encoding Specification (CR), which is referred by
>> HTML5 (REC) [2] makes this behavior a Web standard. [3]
>
> Good to know -- unfortunately, it's hard to remember, because it is so
> confusing.

HTML5 (though with good intentions) indeed makes it confusing, and
trolls (with no good intentions) try to confuse people even much more.

The only thing that really matters (except for HTML5 validation, which
may pointlessly whine about the use of ISO-8859-1) is that browsers
interpret a declaration of ISO-8859-1 as Windows-1252, because that’s
really the only reasonable thing to do. There is no use for C1 Controls
in HTML, and any byte in the C1 Controls range in purported ISO-8859-1
data is almost certainly meant to be interpreted as in Windows-1252.

And even this isn’t really relevant most of the time. If you need
characters that appear in Windows-1252 but not in ISO-8859-1, you *can*
use the Windows-1252 encoding (and declare it Windows-1252 or
ISO-8859-1), but it is normally more sensible to use UTF-8.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Helmut Richter

unread,

Dec 10, 2014, 3:05:50 AM12/10/14

to

Am 10.12.2014 05:21, schrieb Jukka K. Korpela:

> HTML5 (though with good intentions) indeed makes it confusing, and
> trolls (with no good intentions) try to confuse people even much more.
>
> The only thing that really matters (except for HTML5 validation, which
> may pointlessly whine about the use of ISO-8859-1) is that browsers
> interpret a declaration of ISO-8859-1 as Windows-1252, because that’s
> really the only reasonable thing to do. There is no use for C1 Controls
> in HTML, and any byte in the C1 Controls range in purported ISO-8859-1
> data is almost certainly meant to be interpreted as in Windows-1252.
>
> And even this isn’t really relevant most of the time. If you need
> characters that appear in Windows-1252 but not in ISO-8859-1, you *can*
> use the Windows-1252 encoding (and declare it Windows-1252 or
> ISO-8859-1)

IMHO in this case you *should* declare it Windows-1252 for the simple
reason that it is true, instead of relying that the wrong declaration
will work all the same. If you will: for documentation purposes.

> but it is normally more sensible to use UTF-8.

Certainly so, otherwise the whole code issue comes up again when you
have the first non-Latin character in it, e.g. a mathematical symbol or
a somewhat exotic punctuation.

While we are talking about confusion: I find there are two things that
are *really* confusing until you get used to them:

1. For an HTML page served by a web server (as distinct from a local
file which the browser finds without worrying a web server), there can
be up to *two* code declarations: one by the web server using the HTTP
protocol and one in the HTML text, and the *former* takes precedence.
IMHO best practice is to use the specification in the HTML text but to
be aware that it has no effect when the HTTP header tells otherwise --
which is bad practice since the web server should not specify anything
that is unknown to it.

2. In the standards documents you find the term "document character set"
with the explanation that it is Unicode. This has *nothing* to do with
the question how your document is actually encoded. If it is encoded in
any code, it will be converted to Unicode (or treated as if it were). I
have still to find a context where an ordinary web author has to think
about the document character set of his document.

--
Helmut Richter

Jukka K. Korpela

unread,

Dec 10, 2014, 3:48:31 AM12/10/14

to

2014-12-10, 10:05, Helmut Richter wrote:

> Am 10.12.2014 05:21, schrieb Jukka K. Korpela:

[...]

>> And even this isn’t really relevant most of the time. If you need
>> characters that appear in Windows-1252 but not in ISO-8859-1, you *can*
>> use the Windows-1252 encoding (and declare it Windows-1252 or
>> ISO-8859-1)
>
> IMHO in this case you *should* declare it Windows-1252 for the simple
> reason that it is true, instead of relying that the wrong declaration
> will work all the same. If you will: for documentation purposes.

This is a moot point. What HTML5 says is really just one opinion. And if
I intentionally use only ISO-8859-1 characters in my HTML document,
encoded in ISO-8859-1, it *is* ISO-8859-1 for intents and purposes. If
it accidentally contains bytes in the C1 Controls range, then browsers
in fact treat them according to Windows-1252. Fact of life. But this
does not mean that it would be wrong to declare ISO-8859-1 datastream as
ISO-8859-1 encoded, any more than it is wrong to declare US-ASCII
datastream as US-ASCII encoded, even though we know that browsers will
treat bytes with first bit set according to Windows-1252 if they find
them in data declared to be US-ASCII.

Specifically for documentation purposes, it is meaningful to declare
ISO-8859-1 if that is what you intend to use.

> While we are talking about confusion: I find there are two things that
> are *really* confusing until you get used to them:
>
> 1. For an HTML page served by a web server (as distinct from a local
> file which the browser finds without worrying a web server), there can
> be up to *two* code declarations: one by the web server using the HTTP
> protocol and one in the HTML text, and the *former* takes precedence.

Well, it can be confusing indeed. And it is a *real* problem, unlike
some attempts at confusing us in matters where no confusion otherwise
exists. More widespread use of UTF-8, when carried out a wrong way, has
made the problem worse. For example, many web servers declare UTF-8 in
HTTP headers, no matter what authors say. This means that authors who
need to use ISO-8859-1 or Windows-1252, for some reason, are in trouble.
So are authors who could well use UTF-8 but don’t know about the issue
and don’t realize what the server is doing.

There is further confusion, also real, caused by the newer idea
(specified in HTML5, implemented by many browsers but not all) that the
presence of BOM, Byte Order Mark, implies UTF-8, overriding even HTTP
headers. Of course here “BOM” means “three bytes that constitute the BOM
if interpreted according to UTF-8”. This can be useful at times, but it
can also mess things up.

> 2. In the standards documents you find the term "document character set"
> with the explanation that it is Unicode.

It used to be confusing, but I think it’s water under the bridge now.
It’s what HTML specifications used to say when HTML was nominally
SGML-based, though never actually implemented that way. The statement
used the term “document character set” in the SGML sense, which has a
special meaning: it defines how numbers in character references like
{ are to be interpreted.

In XHTML and in HTML5, such a concept is not used, since they have
broken connection with SGML. Instead, they define directly how those
references are interpreted.

> I have still to find a context where an ordinary web author has to think
> about the document character set of his document.

They don’t need the term, but they need the information how character
references are interpreted. The may need to know that È is
interpreted as referring to character with Unicode code point 200
(decimal), quite independently of what byte 200 (decimal) might mean in
the character encoding of the document.

(Since many people have not known this and have used character
references like , intending them to be interpreted so that the
number is the Windows-1252 code, browsers have adapted to this, and in
HTML5, even the specification was written to accommodate this mess. This
means that legacy code containing such constructs need not be corrected
in this respect, even though  was technically undefined in HTML 4.01.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Christoph M. Becker

unread,

Dec 10, 2014, 6:10:52 PM12/10/14

to

Jukka K. Korpela wrote:

> 2014-12-10, 2:22, Christoph M. Becker wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>
>>> Christoph M. Becker wrote:
>>>
>>>> Robin Thornton wrote:
>>
>>>> However, ASCII has no accented characters -- do you actually mean
>>>> ISO-8859-1 (aka ISO Latin 1) or something like that?
>>>
>>> “ISO-8859-1” is _not_ ISO Latin 1; the latter would be ISO/IEC 8859-1,
>>> which, by contrast to the former, does not assign characters/meaning
>>> to the
>>> C0 and C1 control codes from ISO/IEC 6429.
>>
>> Thanks for the correction.
>
> It wasn’t a correction, it was pointless nitpicking that just confuses
> people. There is no point in treating ISO-8859-1 and ISO 8859-1 as two
> different things, and ISO Latin 1 is an informal name.

You may consider to read more carefully what has been written before
claiming it to be "pointless nitpicking". Thomas did not distinguish
between ISO-8859-1 and ISO 8859-1, but between ISO-8859-1 and ISO_/IEC_
8859-1, and he pointed out this very difference.

This difference is obviously irrelevant for HTML, and as such might be
ignored for the purposes of this newsgroup, but it is good to know that
there may be a relevant difference in other contexts.

--
Christoph M. Becker

Jukka K. Korpela

unread,

Dec 10, 2014, 6:21:56 PM12/10/14

to

2014-12-11, 1:10, Christoph M. Becker wrote:

> You may consider to read more carefully what has been written before
> claiming it to be "pointless nitpicking".

There is little reason to read what the Lahn troll writes, but I
actually took a quick look this time, just to confirm that he has the
same agenda as before.

> Thomas did not distinguish
> between ISO-8859-1 and ISO 8859-1, but between ISO-8859-1 and ISO_/IEC_
> 8859-1, and he pointed out this very difference.

Are you trying to outnitpick him? It is absolutely irrelevant to
distinguish between ISO and ISO/IEC here when the two standards
organizations have issued a joint standard.

> This difference is obviously irrelevant for HTML

Like virtually always that Lahn the troll writes here, except when he
writes nonsense about HTML, packaged into off-topic nitpicking just to
confuse people.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Thomas 'PointedEars' Lahn

unread,

Dec 17, 2014, 2:40:08 PM12/17/14

to

Helmut Richter wrote:

> While we are talking about confusion: I find there are two things that
> are *really* confusing until you get used to them:
>
> 1. For an HTML page served by a web server (as distinct from a local
> file which the browser finds without worrying a web server), there can
> be up to *two* code declarations: one by the web server using the HTTP

s/code/encoding/

> protocol and one in the HTML text, and the *former* takes precedence.
> IMHO best practice is to use the specification in the HTML text but to
> be aware that it has no effect when the HTTP header tells otherwise --

Correct. The value of the “value” attribute of a “meta” element with “name”
attribute value "charset" (HTML5) –

<meta name="charset" value="…">

–, or the value of the “value” attribute of a “meta” element with “http-
equiv” attribute value "Content-Type" (HTML 4.01) –

<meta http-equiv="Content-Type" value="text/html; charset=…">

–, should be specified only to facilitate displaying the document when no
HTTP server is present (in the local filesystem). If specified, it should
match the “charset” parameter of the “Content-Type” HTTP header field value:

Content-Type: text/html; charset=…

But actually, there can be at least three: the third one is an XML
declaration –

<?xml … encoding="…"?>.

– before the DOCTYPE (declaration). This applies to other XML-based
document types as well. I am presently not certain as to which declaration
takes precedence then.

<http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2.2>
<http://www.w3.org/TR/2014/REC-html5-20141028/infrastructure.html#extracting-character-encodings-from-meta-elements>
<http://www.w3.org/TR/2008/REC-xml-20081126/#sec-prolog-dtd>

> which is bad practice since the web server should not specify anything
> that is unknown to it.

It is bad practice that the Web server would send a “Content-Type” header
field with a default value for the “charset” parameter in the response. As
a result, that feature has been disabled by default since Apache 2.0.
However, there are several ways for the author to specifiy the value of that
parameter, and they should know which encoding they are using.

<https://issues.apache.org/bugzilla/show_bug.cgi?id=23421>
<http://httpd.apache.org/docs/2.2/en/mod/core.html#adddefaultcharset>

> 2. In the standards documents you find the term "document character set"
> with the explanation that it is Unicode. This has *nothing* to do with
> the question how your document is actually encoded. If it is encoded in
> any code, it will be converted to Unicode (or treated as if it were). I
> have still to find a context where an ordinary web author has to think
> about the document character set of his document.

“Unicode” as used there refers to the character set specified by the Unicode
standard, not the character encodings specified there (which are the Unicode
Transformation Formats: UTF-8, UTF-16, and UTF-32). The document is _not_
„converted to Unicode“; „treated as if it were“ is probably close enough a
description.

<http://unicode.org/faq/>

PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

Jukka K. Korpela

unread,

Dec 17, 2014, 3:08:29 PM12/17/14

to

2014-12-17, 21:39, Lahn wrote:

>I am presently not certain as to which declaration
> takes precedence then.

Admitting ignorance would be a good thing, if it could be taken
seriously. But why lecture pointlessly on things already covered in
better responses then?

--
Yucca, http://www.cs.tut.fi/~jkorpela/