Default policy unescaping HTML entities?

1,248 views
Skip to first unread message

brad...@gmail.com

unread,
Mar 25, 2014, 2:30:56 PM3/25/14
to owasp-java-html-...@googlegroups.com

I'm new to the owasp sanitizer, and would like to know why some HTML entities are being unescaped and how to avoid it if possible?

For example, it seems no matter what policy I pass in to the sanitizer the following string gets converted from this:

@ test &#33

into this:

@ test !

The difference is that the "&#33" text has been converted into it's character "!". I don't see how to configure this on the sanitizer, and I want what my users input to be matched as much as possible when output by the sanitizer.

Thanks!

PS: Here's my sample code, which is a unit test to verify behavior I was expecting, and of course is failing right now.

package com.my.company.test;

import org.junit.Test;
import org.owasp.html.PolicyFactory;
import org.owasp.html.Sanitizers;
import junit.framework.TestCase;

public class OwaspSanitizerTest extends TestCase {

  public static final PolicyFactory POLICY = Sanitizers.IMAGES;

  @Test
  public static final void testTextFilter() throws Exception {

      String data = "@ test &#33";
      String result = POLICY.sanitize(data);

      System.out.println(result);

      assertEquals("@ test !", result);     
  }
}

Mike Samuel

unread,
Mar 25, 2014, 4:38:21 PM3/25/14
to owasp-java-html-...@googlegroups.com
There is no option and I'm loathe to do that.

The more options, the harder it is to maintain a test suite that
checks all the interactions between options.
The more tightly users specify non-semantic aspects of the output, the
harder it becomes to respond to changes to HTML/CSS/etc. and to test
and release updates to deal with newly discovered attack vectors.

Why do you care how '!' is encoded?
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to owasp-java-html-saniti...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

brad...@gmail.com

unread,
Mar 26, 2014, 8:30:09 AM3/26/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
The only reason I care is that I want what the user inputs to be the same as what we output as often as possible.

The company I work for made up some sample test cases for user inputs and expected user outputs, and this came up as a failed test.

This makes me wonder the following:

- Why is the "@" treated differently? I know it's the @ sign used in emails, but is there another reason as well?

- Its seems when the input and output seem to only differ by entity encoding that we could just allow the raw, unsanitized original. This would let me default to what a user actually input in this case.... But this is probably dangerous in some way, correct?

ie in pseudocode, 

   var inputEqualsOutput = (input === html_dencode(output));
   var result = inputEqualsOutput ? input : output;




Jim Manico

unread,
Mar 26, 2014, 9:28:57 AM3/26/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
I'm a bit confused by this. Is your input really...

@ test &#33

If so, why is  &#33 being decoded?  It's against the HTML Entity encoding standard but still decodes for backward compatibility in most browsers. My instinct is that it should not be decoded.

However, if your test code was:

@ test !

...and the Sanitizer decoded it to reduce the output size, I think that's ok -but we should talk about the fact that some input is "compressed" for performance reasons.

My 1 rupee (hello from India),
Jim
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

brad...@gmail.com

unread,
Mar 26, 2014, 10:03:28 AM3/26/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com, j...@manico.net
I didn't realize that the example I posted was missing the ; on the &#33, but it doesn't matter, as it gives the exact same output in both cases.... ie converts "&#33" into an exclamation mark "!" 

and I realized for my pseudocode example, i should be html decoding both input and output for my comparison:

   var inputEqualsOutput = (html_decode(input) === html_decode(output));
   var result = inputEqualsOutput ? input : output;

So if decoding both my input and output are equal, is it safe to return the input instead? 

Jim Manico

unread,
Mar 26, 2014, 10:10:41 AM3/26/14
to brad...@gmail.com, owasp-java-html-...@googlegroups.com, mikes...@gmail.com
I'll leave Mike to answer that question, but I think the answer is no :(

- Jim

brad...@gmail.com

unread,
Mar 27, 2014, 8:09:57 AM3/27/14
to owasp-java-html-...@googlegroups.com, brad...@gmail.com, mikes...@gmail.com, j...@manico.net
Thanks guys... and to be clear, by "output" in my previous example, I meant output to mean the result from the sanitizer. So perhaps my pseudocode would've been better written as:

function sanitizeHtml(input)
{
   var output = owasp_sanitizer(input);
   var inputEqualsOutput = (html_decode(input) === html_decode(output));
   var result = inputEqualsOutput ? input : output;

   return result;
}

The intent would be to sanitize the input, and if the input only differed from the output by entity encodings, to return the input instead. This would allow my users inputs to actually be displayed "as they were" most of the time. Mike mentioned elsewhere that this could still be unsafe for the case where inputs could be concatted together. This isn't the case for me, but I'll definitely consider this moving forward.... 

> Maybe it is safe individually for
>     input1 = "<"
>     input2 = "img onload=alert(1337)>"
> but it is not the case that
>     html_decode(input1 + input2) == html_decode(sanitize(input1 + input2))
> tests safety.



Mike Samuel

unread,
Mar 27, 2014, 8:25:59 AM3/27/14
to brad...@gmail.com, owasp-java-html-...@googlegroups.com, James Manico
What does html_decode do?
>>>> > owasp-java-html-saniti...@googlegroups.com.
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "OWASP Java HTML Sanitizer Support" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to owasp-java-html-saniti...@googlegroups.com.

brad...@gmail.com

unread,
Mar 27, 2014, 9:58:30 AM3/27/14
to owasp-java-html-...@googlegroups.com, brad...@gmail.com, James Manico, mikes...@gmail.com
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "OWASP Java HTML Sanitizer Support" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an

Mike Samuel

unread,
Mar 27, 2014, 10:39:43 AM3/27/14
to Brad Parks, owasp-java-html-...@googlegroups.com, James Manico
2014-03-27 9:58 GMT-04:00 <brad...@gmail.com>:
> It'd be this owasp decodeForHtml function.... does that make sense?
>
> https://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/Encoder.java#275

That makes sense. The answer then is no, it's not safe.

Some browsers drop NULs, so

"<\x00script>"

is treated by some browsers the same as

"<script>"

because they want to avoid differences between strlen(s.c_str()) and s.length().

The sanitizer, like HTML5
(http://www.w3.org/TR/html5/syntax.html#tag-open-state), treats NUL as
a character, so for

input = "<\x00script>alert(1337)<\x00/script>"
output = "&lt;&#0;script&gt;alert(1337)&lt;/script&gt;"

If the only thing done is decoding entities then

decode_html(input) == input

and

decode_html(output) == input

but it is not safe to use input, but

input.replace("\x00", "") == "<script>alert(1337)</script>"

so it is not safe to use input.

brad...@gmail.com

unread,
Mar 27, 2014, 11:15:26 AM3/27/14
to owasp-java-html-...@googlegroups.com, Brad Parks, James Manico, mikes...@gmail.com
Hmmm! Interesting... it's this type of info that's great to know... so is that the only attack that affects this approach? there's probably others, right? And I guess even if there aren't more now, there may be more in the future... So it's probably best to just return the sanitized output, as I expected from the first ;-( 

Thanks a ton for your help... it really cleared this up for me, and is much appreciated!

Jim Manico

unread,
Mar 27, 2014, 2:29:27 PM3/27/14
to brad...@gmail.com, owasp-java-html-...@googlegroups.com, mikes...@gmail.com
This is what Mike was trying to say; interpreted in the form of an internet meme.

http://www.memecreator.org/static/images/memes/2552028.jpg

Aloha!
- Jim
Reply all
Reply to author
Forward
0 new messages