Changes to Sanitizer output and HTML5 support in OWASP Sanitizer

166 views
Skip to first unread message

Mike Samuel

unread,
Jan 26, 2017, 11:15:04 AM1/26/17
to OWASP Java HTML Sanitizer Announce
TLDR; the next version of the OWASP HTML Sanitizer will include better
HTML 5 support but will also include some cosmetic changes to the way
tags are nested which may break tests that assert the exact textual
output of a string that includes a sanitizer output.

----

MOTIVATION
https://github.com/OWASP/java-html-sanitizer/labels/html5

The initial design of the sanitizer had as a goal producing output that

1. when parsed as a document fragment by either a browser's
   HTML or XHTML parser, included only whitelisted tags and attributes
2. where possible, did not trigger error paths in a validating parser
3. when embedded in a larger document, did not break lexical confinement
4. when embedded with common CSS style rules, did not break visual confinement


I tried to meet these goals by closely reading the HTML specifications
and doing experiments by hand in the browser.  That seems to have
worked pretty well.

While it's part of my job to track changes to the HTML 5 and other
specification documents, I haven't been updating data tables and there
have been a lot of bugs filed asking for better support of newer HTML
elements.

I haven't managed to keep those tables up to date with changes to the
spec documents and don't expect to have more time or help to do so.

My solution is to stop using the HTML 5 specification and hand crafted
experiments as a basis for HTML element metadata.

Instead I plan to interrogate browsers to infer tag relationships.  [1]
does a series of pairwise experiments of the form
    HTML_ELEMENTS.forEach( (a) =>
      HTML_ELEMENTS.forEach( (b) => {
        iframe.contentDocument.open(),
        iframe.contentDocument.write(experimentHtml(a, b));
        iframe.contentDocument.close();
      }));


USER VISIBLE CHANGES

There are no changes to the meaning of a policy.

There are no changes to the API used to define a policy or to use
the policy to sanitize HTML.

Changes to the tag balancer might lead to cosmetic changes to some
HTML which might break overly constrained unit tests.
For example the input

    <p><table><tr><td>foo</td></tr></table></p>

used to produce the same string as output but now produces

    <p></p><table><tr><td>foo</td></tr></table>

because the latter is what browsers produce (modulo implied <tbody>)
and allowing <table>s in <p> was what I (probably mistakenly) concluded
was allowable from spec documents.


CAVEATS

There are a few things that pairwise testing will not reveal.
(1) It's hard to examine worst-cast assumptions about <noscript>
     when your experimental framework depends on JavaScript
     running in a browser.
(2) Scoping relationships [2] and element transparency [3] are
     often only apparent with deeper element stacks than pairwise
     experiments can deduce.


IMPLEMENTATION CHANGES

[1] is an HTML page that uses JavaScript to perform pair-wise
experiments on HTML tags, and the data from that has been integrated
into the tag balancer [4] in the master branch.

That tag balancer still uses scoping and element transparency tables derived
from the spec and manual experimentation.  Scoping is also used so that
we can make worst-case assumptions about <noscript> et al -- that the
<noXYZ> element contains tags and attributes that need to be filtered, and
that any such tags that survive filtering should lexically be contained by
the <noXYZ> element's start and end tags.


[1] https://github.com/OWASP/java-html-sanitizer/blob/master/empiricism/html-containment.html
[2] http://w3c.github.io/html/single-page.html#as-that-element-in-the-specific-scope
[3] http://w3c.github.io/html/single-page.html#transparent
[4] https://github.com/OWASP/java-html-sanitizer/commit/be8e547daf595b30da5e1986a41d55b1a4cd77f4#diff-feb951c9f2968d74b7ff0d66f90bf514
Reply all
Reply to author
Forward
0 new messages