On Thu, Sep 29, 2016 at 6:32 AM, 'Klaas Berger' via OWASP Java HTML
Sanitizer Support <
owasp-java-html-...@googlegroups.com>
wrote:
> Hello,
>
> I tried to sanitize HTML which happens to contain a misplaced title tag:
>
> <html>
> <body>
> <div>div1</div>
> <title>title</title>
> <div>div2</div>
> </body>
> </html>
>
> I got the following output with PolicyFactory.sanitize(String html) :
>
> Title was not allowed:
>
> <html><body>
> <div>div1</div>
> </body></html>
> <div>div2</div>
>
> Title was allowed:
>
> <html><body>
> <div>div1</div>
> </body></html><title>title</title>
> <div>div2</div>
>
> Is this considered the desired outcome?
https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody explains
that when a <title> is seen inside the <body> it is hoisted out into
the head.
"""
A start tag whose tag name is one of: "base", "basefont", "bgsound",
"link", "meta", "noframes", "script", "style", "template", "title"An
end tag whose tag name is "template"
Process the token using the rules for the "in head" insertion mode.
"""
The other main constraint is
https://www.w3.org/TR/html-markup/title.html#title-context which says
that <title> is only allowed as a child of the <head>.
The sanitizer doesn't try to do the hoisting since, for efficiency
reasons, we're building an output left to right.
What happens seems to be the intersection of
1. an efficient heuristic that almost always works: the <body> and
<html> elements are closed because we've seen a tag that it can't
contain so we pop the element stack, closing elements as long as we're
in an incompatible context
2. a bug: the <title> is emitted when there is nothing on the element
stack, so it's not obviously in a place other than <head>
3. an optimistic assumption: we assume that we're sanitizing a
fragment of HTML, not a whole document so we allow the <div> after the
<title> even when there is no <body> on the stack.
Very few policies allow tags like <html>, <head>, <body> since the
output is usually embedded in a larger page, so your configuration has
probably not been tested as thoroughly as others.
I'd be interested to hear why you want to preserve <body> and <html> tags.
If you mostly want to preserve the <title> content, you can always use
a custom element policy
(
http://javadoc.io/doc/com.googlecode.owasp-java-html-sanitizer/owasp-java-html-sanitizer/20160924.1
)
myPolicyBuilder
.allowElements(
new ElementPolicy() {
@Override public String apply(String elementName,
List<String> attrs) {
attrs.clear();
attrs.add("class");
attrs.add("sanitized-title");
return "h1";
}
},
"title")
.allowElements("h1")
which should replace all
<title>foo</title>
with
<h1 class="sanitized-title">foo</h1>
which you might be able to visually hoist to the top of the containing
<section> with some style-fu.