Changes to Sanitizer output and HTML5 support in OWASP Sanitizer
166 views
Skip to first unread message
Mike Samuel
unread,
Jan 26, 2017, 11:15:04 AM1/26/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to OWASP Java HTML Sanitizer Announce
TLDR; the next version of the OWASP HTML Sanitizer will include better HTML 5 support but will also include some cosmetic changes to the way tags are nested which may break tests that assert the exact textual output of a string that includes a sanitizer output.
The initial design of the sanitizer had as a goal producing output that
1. when parsed as a document fragment by either a browser's HTML or XHTML parser, included only whitelisted tags and attributes 2. where possible, did not trigger error paths in a validating parser 3. when embedded in a larger document, did not break lexical confinement 4. when embedded with common CSS style rules, did not break visual confinement
I tried to meet these goals by closely reading the HTML specifications and doing experiments by hand in the browser. That seems to have worked pretty well.
While it's part of my job to track changes to the HTML 5 and other specification documents, I haven't been updating data tables and there have been a lot of bugs filed asking for better support of newer HTML elements.
I haven't managed to keep those tables up to date with changes to the spec documents and don't expect to have more time or help to do so.
My solution is to stop using the HTML 5 specification and hand crafted experiments as a basis for HTML element metadata.
Instead I plan to interrogate browsers to infer tag relationships. [1] does a series of pairwise experiments of the form HTML_ELEMENTS.forEach( (a) => HTML_ELEMENTS.forEach( (b) => { iframe.contentDocument.open(), iframe.contentDocument.write(experimentHtml(a, b)); iframe.contentDocument.close(); }));
USER VISIBLE CHANGES
There are no changes to the meaning of a policy.
There are no changes to the API used to define a policy or to use the policy to sanitize HTML.
Changes to the tag balancer might lead to cosmetic changes to some HTML which might break overly constrained unit tests. For example the input
<p><table><tr><td>foo</td></tr></table></p>
used to produce the same string as output but now produces
<p></p><table><tr><td>foo</td></tr></table>
because the latter is what browsers produce (modulo implied <tbody>) and allowing <table>s in <p> was what I (probably mistakenly) concluded was allowable from spec documents.
CAVEATS
There are a few things that pairwise testing will not reveal. (1) It's hard to examine worst-cast assumptions about <noscript> when your experimental framework depends on JavaScript running in a browser. (2) Scoping relationships [2] and element transparency [3] are often only apparent with deeper element stacks than pairwise experiments can deduce.
IMPLEMENTATION CHANGES
[1] is an HTML page that uses JavaScript to perform pair-wise experiments on HTML tags, and the data from that has been integrated into the tag balancer [4] in the master branch.
That tag balancer still uses scoping and element transparency tables derived from the spec and manual experimentation. Scoping is also used so that we can make worst-case assumptions about <noscript> et al -- that the <noXYZ> element contains tags and attributes that need to be filtered, and that any such tags that survive filtering should lexically be contained by the <noXYZ> element's start and end tags.