Re: HTML5 examples where one needs to "escape" characters.

James Moe

unread,

Feb 17, 2018, 3:31:35 PM2/17/18

to

On 02/17/2018 12:44 PM, Stefan Ram wrote:

> 2&3<4
> into the HTML5 source code, a recent browser will display this
> paragraph as
> 2&3<4
>
Don't do that.
The "recent browser" is doing some fancy error recovery to decide what
you might have meant. Such situations vary with browsers and versions. YMMV.
You apparently know about character entities, those characters which
must be "escaped" to display properly. Those are the ones to "escape."

--
James Moe
jmm-list at sohnen-moe dot com
Think.

Sam Hill

unread,

Feb 17, 2018, 4:22:22 PM2/17/18

to

On Sat, 17 Feb 2018 19:44:26 +0000, Stefan Ram wrote:

> When I write
>
> 2&3<4

Write this instead:

2&3<4

When using "pairs" together such as < and > in the content, use > for
the >

2&3<4 and 5&6>4

Here's one result a search turned up:
https://www.freeformatter.com/html-entities.html

Jukka K. Korpela

unread,

Feb 18, 2018, 7:14:36 AM2/18/18

to

Stefan Ram wrote:

> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>> Can you show me other situations where escaping is required?
>
> Ok, when I want to show
>
> <abc
>
> literally in the browser, and when I want to show a
> numerical entity
>
> &
>
> or a named entity
>
> &
>
> literally in the browser, I would have to escape "<"
> and "&" in the source code, respectively.

Yes, because otherwise the data is parsed as a start tag, as a numerical
character reference, or as a named entity reference, respectively.

More exactly, “<” needs to be escaped when it is immediately followed by
a letter, also in e.g. “x<y”, even though there is (currently) no HTML
tag with a name starting with “y”. The construct “<y” and some
characters after it would be parsed as an element with an unknown name.

And “&” needs to be escaped when
a) immediately followed by “#” or “#x” and a digit
or
b) immediately followed by a letter sequence that is defined in HTML (as
interpreted by the browser) as an entity name.

It is actually simpler to always escape “<” and “&” than to understand
and remember the rules above. In particular, there is a very large
number of named entities in HTML5, and this set might be expanded at any
time, or browsers might have their own extensions to the listö.

> Any other situations?

No, except within an attribute value, where the character used as the
value delimiter, namely Ascii quotation mark (") or Ascii apostrophe
('), needs to be escaped. For example, title="Say "Hello!"" won’t work,
so you would need to write e.g.
"Say "Hello!""
or
'Say "Hello!"'
if the value needs to contain Ascii quotation marks; but
"Say “Hello!”"
does not need escaping.

--
Yucca, http://jkorpela.fi

JJ

unread,

Feb 18, 2018, 1:11:10 PM2/18/18

to

On 17 Feb 2018 19:44:26 GMT, Stefan Ram wrote:
> When I write
>
> 2&3<4
>

> into the HTML5 source code, a recent browser will display this
> paragraph as
>
> 2&3<4
>

> . I would like to have some examples, where "escaping" is still
> needed. I found that
>
> 2&3<a
>
> ("a" instead of "4") will need the "<" to be escaped, i.e.,
>
> 2&3<a
>
> . Can you show me other situations where escaping is required?

I think why a "<" character would be displayed as is, is how the browser
interprets HTML codes. For examples:

<0
<#
<"
<@

They will be displayed as is because a HTML tag name can not start with a
number or any other invalid character. IIRC, a HTML tag name can only start
with character "A"/"a" to "Z"/"z".

Below code however, will not be displayed as is:

<!
<?

Because the "!" and "?" characters are special characters which are used for
HTML comment, document type, and processing (or prolog). There may be other
special character(s) used for HTML tag, but I only know these two. You might
want to check the HTML specifications if you want to find out.

Same thing goes for the "&" character. It will be displayed as is if that
character and its following character doesn't form a valid HTML entity
syntax which is can be (no quotes):

- "&A;". Where "A" is a valid HTML entity name. e.g. "quot", "amp", etc.

- "&#N;". Where "N" is a decimal number for a Unicode character code.

- "&#xH;". Where "H" is a hexadecimal number for a Unicode character code.

While you can use this as an exploit, there's a possibility that in the
future, a strict HTML parsing will be enforced - where web browsers will
reject a HTML code if it has an invalid syntax. Similar like how browsers
reject JavaScript code which contains invalid syntax.

Jukka K. Korpela

unread,

Feb 18, 2018, 1:58:53 PM2/18/18

to

JJ wrote:

> Below code however, will not be displayed as is:
>
> <!
> <?
>
> Because the "!" and "?" characters are special characters which are used for
> HTML comment, document type, and processing (or prolog).

Right. I forgot that “<” also needs to be encoding if it is immediately
followed by “!” or “>?”.

And also when immediately followed by “/”, since “</” is parsed as start
of end tag.

--
Yucca, http://jkorpela.fi

Jukka K. Korpela

unread,

Feb 21, 2018, 5:49:55 AM2/21/18

to

Stefan Ram wrote:

> Am I correct, when I presume that in XHTML5 and therefore
> also in polyglot HTML5, "<", ">" and "&" always have to be
> escaped when they are meant to be part of the text or
> attribute content?

Only “<” and “&”. The “>” character need not be escaped, except in the
rather rare special case when it appears as part of the string “]]>”.

This follows from general XML rules.
https://www.w3.org/TR/xml/#syntax

> I took this for granted, but then I got confused because of
> one XHTML5 validator, which did not confirm this. (But I was
> in a hurry then, and maybe I operated it incorrectly.)

I can’t comment on that without knowing what test revealed that and what
validator (or “validator”) was used.

--
Yucca, http://jkorpela.fi

Thomas 'PointedEars' Lahn

unread,

Feb 21, 2018, 7:37:25 AM2/21/18

to

Jukka K. Korpela wrote:

> Stefan Ram wrote:
>> Am I correct, when I presume that in XHTML5 and therefore
>> also in polyglot HTML5, "<", ">" and "&" always have to be
>> escaped when they are meant to be part of the text or
>> attribute content?
>
> Only “<” and “&”. The “>” character need not be escaped, except in the
> rather rare special case when it appears as part of the string “]]>”.
>
> This follows from general XML rules.
> https://www.w3.org/TR/xml/#syntax

That is correct _for XHTML5_.

Among other undefined (ad-hoc invented) terms here it is unclear what is
meant with “polyglot HTML5”. But in HTML 4 (all subversions and variants)
and HTML5 (in the _HTML_ syntax of all subversions so far), *standalone* “&”
characters (those followed by whitespace or other characters that cannot
make up a character reference) does not need to be replaced with a character
reference (“escaped”). This follows from SGML rules for HTML 4.01 and its
parser algorithms for HTML5:

<http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm#P62>

<https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#character-reference-state>

However, for consistency and maintainability, and maybe even
interoperability, it is recommended.

PointedEars
--
var bugRiddenCrashPronePieceOfJunk = (
navigator.userAgent.indexOf('MSIE 5') != -1
&& navigator.userAgent.indexOf('Mac') != -1
) // Plone, register_function.js:16