body/html gets prematurely closed after misplaced title tag

44 views
Skip to first unread message

Klaas Berger

unread,
Sep 29, 2016, 11:10:06 AM9/29/16
to OWASP Java HTML Sanitizer Support
Hello,

I tried to sanitize HTML which happens to contain a misplaced title tag:

<html>
<body>
    <div>div1</div>
    <title>title</title>
    <div>div2</div>
</body>
</html>

I got the following output with PolicyFactory.sanitize(String html) :

Title was not allowed:

<html><body>
    <div>div1</div>
    </body></html>
    <div>div2</div>

Title was allowed:

<html><body>
    <div>div1</div>
    </body></html><title>title</title>
    <div>div2</div>


Is this considered the desired outcome?
Is there any trick to not get the body/html tags closed prematurely?

Regards,
Klaas







Mike Samuel

unread,
Sep 29, 2016, 12:47:35 PM9/29/16
to OWASP Java HTML Sanitizer Support
On Thu, Sep 29, 2016 at 6:32 AM, 'Klaas Berger' via OWASP Java HTML
Sanitizer Support <owasp-java-html-...@googlegroups.com>
wrote:
> Hello,
>
> I tried to sanitize HTML which happens to contain a misplaced title tag:
>
> <html>
> <body>
> <div>div1</div>
> <title>title</title>
> <div>div2</div>
> </body>
> </html>
>
> I got the following output with PolicyFactory.sanitize(String html) :
>
> Title was not allowed:
>
> <html><body>
> <div>div1</div>
> </body></html>
> <div>div2</div>
>
> Title was allowed:
>
> <html><body>
> <div>div1</div>
> </body></html><title>title</title>
> <div>div2</div>
>
> Is this considered the desired outcome?

https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody explains
that when a <title> is seen inside the <body> it is hoisted out into
the head.

"""
A start tag whose tag name is one of: "base", "basefont", "bgsound",
"link", "meta", "noframes", "script", "style", "template", "title"An
end tag whose tag name is "template"

Process the token using the rules for the "in head" insertion mode.
"""

The other main constraint is
https://www.w3.org/TR/html-markup/title.html#title-context which says
that <title> is only allowed as a child of the <head>.



The sanitizer doesn't try to do the hoisting since, for efficiency
reasons, we're building an output left to right.

What happens seems to be the intersection of
1. an efficient heuristic that almost always works: the <body> and
<html> elements are closed because we've seen a tag that it can't
contain so we pop the element stack, closing elements as long as we're
in an incompatible context
2. a bug: the <title> is emitted when there is nothing on the element
stack, so it's not obviously in a place other than <head>
3. an optimistic assumption: we assume that we're sanitizing a
fragment of HTML, not a whole document so we allow the <div> after the
<title> even when there is no <body> on the stack.


Very few policies allow tags like <html>, <head>, <body> since the
output is usually embedded in a larger page, so your configuration has
probably not been tested as thoroughly as others.


I'd be interested to hear why you want to preserve <body> and <html> tags.


If you mostly want to preserve the <title> content, you can always use
a custom element policy
( http://javadoc.io/doc/com.googlecode.owasp-java-html-sanitizer/owasp-java-html-sanitizer/20160924.1
)

myPolicyBuilder
.allowElements(
new ElementPolicy() {
@Override public String apply(String elementName,
List<String> attrs) {
attrs.clear();
attrs.add("class");
attrs.add("sanitized-title");
return "h1";
}
},
"title")
.allowElements("h1")

which should replace all
<title>foo</title>
with
<h1 class="sanitized-title">foo</h1>
which you might be able to visually hoist to the top of the containing
<section> with some style-fu.

Jim Manico

unread,
Sep 29, 2016, 3:51:04 PM9/29/16
to owasp-java-html-...@googlegroups.com

I want to echo Mike's comments here.

On the latest build (20160924.1) I tried the following policy.

org.owasp.html.PolicyFactory sanitizer = new HtmlPolicyBuilder()
.allowElements("div")
.toFactory();

I used the following input:


<html>
<body>
    <div>div1</div>
    <title>title</title>
    <div>div2</div>
</body>
</html>

And got the following output:

    <div>div1</div>
   
    <div>div2</div>

This is a bit closer to the expected use case for this tool.

Aloha, Jim

--
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Klaas Berger

unread,
Sep 30, 2016, 9:38:13 AM9/30/16
to OWASP Java HTML Sanitizer Support, mikes...@gmail.com
Hi,
thanks for your detailed answer!


On Thursday, September 29, 2016 at 6:47:35 PM UTC+2, Mike Samuel wrote:

I'd be interested to hear why you want to preserve <body> and <html> tags.

We want to display the html content in a sandboxed iframe for a further layer
of restrictions. Hope that makes sense..

Ok, I guess it would be better to throw away the html/head/body in any case
like in the examples you gave and just wrap it in html/body for display in iframe.
Declaration of content encoding could be important, but I'll have to check what
can happen there.

So I think I can go with that, thank you. Out of curiosity: why does the "closing
elements as long as we're in an incompatible context" logic apply to incoming
html and not to the outgoing html? The latter case would prevent the closing of
body/html when disallowing or converting the title tag.

Regards,
Klaas
Reply all
Reply to author
Forward
0 new messages