Documentation on Sanitization Algorithm

44 views
Skip to first unread message

Sresan Thevarajah

unread,
Jun 15, 2018, 2:33:47 PM6/15/18
to OWASP Java HTML Sanitizer Support
Hey

I was wondering if there is any documentation on how the sanitizer works without having to get into the code. I am specifically interested in how the algorithm determines what a html element is? Can it identify poorly written html elements as well. (I.E What happens if the html being sanitized is poorly written and might no rendered in all browser but might in some?)


Mike Samuel

unread,
Jun 15, 2018, 2:55:51 PM6/15/18
to OWASP Java HTML Sanitizer Support
On Fri, Jun 15, 2018 at 2:33 PM Sresan Thevarajah <sresan...@gmail.com> wrote:
Hey

I was wondering if there is any documentation on how the sanitizer works without having to get into the code. I am specifically interested in how the algorithm determines what a html element is? Can it identify poorly written html elements as well. (I.E What happens if the html being sanitized is poorly written and might no rendered in all browser but might in some?)

https://github.com/OWASP/java-html-sanitizer/blob/master/src/main/java/org/owasp/html/HtmlLexer.java is responsible for breaking an input into tags.  Start tags correspond to elements, though there is tag balancer which might introduce implied elements.  The only other source of elements are a few quirks for HTML5 compatibility like treating </br ...> as equivalent to <br ...>

Can you provide examples of poorly written HTML elements?



The sanitizer strives to produce an easily parsed subset of HTML as output regardless of the input, so messy input should not cause different browsers to conclude different things about the DOM structure of the sanitized output. 

 
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages