Issue 30 in owasp-java-html-sanitizer: " " returns space after sanitize instead of returning same " "

186 views
Skip to first unread message

owasp-java-h...@googlecode.com

unread,
May 23, 2014, 9:17:20 AM5/23/14
to owasp-java-html-...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 30 by urvishmp...@gmail.com: " " returns space after
sanitize instead of returning same " "
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

I can not say this is bug but may be the policy we configure is wrong.


On the string if have html entites " " than after sanitize it show
(empty space) but not return " " while for other example "&lt", "&gt"
shows correctly after sanitize.

example,

final String test = " >";

final PolicyFactory policy = Sanitizers.FORMATTING.and(
Sanitizers.BLOCKS).and(Sanitizers.STYLES);
final String safeHTML = policy.sanitize(test);

System.out.println("Before:" +test);
System.out.println("After:" +safeHTML);

Result:
-------
Before: >
After: >

Actually we need   after sanitize so can your provide guidance on this
how to achieve.

Thx in advance!

Kr,
Urvish

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

owasp-java-h...@googlecode.com

unread,
Jun 24, 2014, 12:33:07 PM6/24/14
to owasp-java-html-...@googlegroups.com

Comment #1 on issue 30 by sproket...@gmail.com: " " returns space
after sanitize instead of returning same " "
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

This is a bug for sure. The sanitize should not replace entities as they
are the generalized way of expressing extended characters regardless of
encoding. They are perfectly valid and safe in HTML.

owasp-java-h...@googlecode.com

unread,
Jun 24, 2014, 5:10:43 PM6/24/14
to owasp-java-html-...@googlegroups.com

Comment #2 on issue 30 by mikes...@gmail.com: " " returns space
after sanitize instead of returning same " "
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

Are you not seeing codepoint U+A0 in the output?

owasp-java-h...@googlecode.com

unread,
Jun 25, 2014, 6:09:26 AM6/25/14
to owasp-java-html-...@googlegroups.com

Comment #3 on issue 30 by sproket...@gmail.com: " " returns space
after sanitize instead of returning same " "
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

I'm seeing "?" in the output. We use CKEditor which has a feature to insert
extended characters which are inserted as HTML entities.

You can see the ckeditor demo here:
http://ckeditor.com/demo

If the user does this, the result after sanitizing replaces these entities
with "?". This occurs whether I use the entity name or number. E.g.

<p>blah blah blah &diams;</p>
<p>blah blah blah &#9830;!</p>

owasp-java-h...@googlecode.com

unread,
Jun 25, 2014, 9:09:33 AM6/25/14
to owasp-java-html-...@googlegroups.com

Comment #4 on issue 30 by mikes...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

I don't know what CKEditor has to do with this bug, and you talk about '?'
in the output but to understand the output, I'd have to actually see the
headers and meta-content of the response you're serving.

I added a test (see patch below) and entities seems to work fine. The HTML
sanitizer decodes entities just fine and normalizes them in the output.

As long as your content-type header's charset matches the charset you used
to encode your string, then the content should reach the browser just fine.

Index: src/tests/org/owasp/html/HtmlPolicyBuilderTest.java
===================================================================
--- src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (revision 235)
+++ src/tests/org/owasp/html/HtmlPolicyBuilderTest.java (working copy)
@@ -282,6 +282,17 @@
"<select>\n <option>1</option>\n
<option>2</option>\n</select>"));
}

+ @Test
+ public static final void testEntities() throws Exception {
+ assertEquals(
+ "(Foo)\u00a0(Bar)\u2666\u2666\u2666\u2666(Baz)"
+ + "&#x14834;&#x14834;&#x14834;(Boo)",
+ apply(
+ new HtmlPolicyBuilder(),
+ "(Foo)&nbsp;(Bar)&diams;&#9830;&#x2666;&#X2666;(Baz)"
+ + "\ud812\udc34&#x14834;&#x014834;(Boo)"));
+ }
+
private static String apply(HtmlPolicyBuilder b) throws Exception {
return apply(b, EXAMPLE);

owasp-java-h...@googlecode.com

unread,
Jun 25, 2014, 11:10:16 AM6/25/14
to owasp-java-html-...@googlegroups.com

Comment #5 on issue 30 by sproket...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

I added a test to HtmlSanitizerTest

@Test
public static final void testDiamond() throws Exception {
assertEquals("<p>blah blah blah &diams;</p>", sanitize("<p>blah
blah blah &diams;</p>"));
}

I get:

junit.framework.ComparisonFailure: null
Expected :<p>blah blah blah &diams;</p>
Actual :<p>blah blah blah ♦</p>

owasp-java-h...@googlecode.com

unread,
Jul 5, 2014, 4:28:19 PM7/5/14
to owasp-java-html-...@googlegroups.com

Comment #6 on issue 30 by mikes...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

And what I don't understand is

> Expected :<p>blah blah blah &diams;</p>
> Actual :<p>blah blah blah ♦</p>

Why is the actual result a problem?

owasp-java-h...@googlecode.com

unread,
Jul 6, 2014, 6:53:48 AM7/6/14
to owasp-java-html-...@googlegroups.com

Comment #7 on issue 30 by sproket...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

The issue is text portability. Named entities preserve the text as simple
ascii which is universal. In our situation we send various html content to
3rd party systems so we can't guarantee a common unicode for special
characters. Currently we added jericho as a dependency to reparse and
convert these back to entities.

Unless you think there's some security issue with entities I think they
should be preserved. Best would probably be to have a setting for this in
the HtmlPolicyBuilder.

Thanks

owasp-java-h...@googlecode.com

unread,
Oct 11, 2014, 1:42:41 PM10/11/14
to owasp-java-html-...@googlegroups.com

Comment #8 on issue 30 by elenn...@gmail.com: "&nbsp;" returns space after
sanitize instead of returning same "&nbsp;"
https://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

Having same issue &apos; is replaced with it's &#39; code @ with &#64;
Version: r239
What shoud I do to avoid such behaviour?

owasp-java-h...@googlecode.com

unread,
Oct 11, 2014, 2:29:07 PM10/11/14
to owasp-java-html-...@googlegroups.com

Comment #9 on issue 30 by elenn...@gmail.com: "&nbsp;" returns space after
The adventures of &apos; from the input:

Version: r239
org.owasp.html.Encoding
line 55-60:
it calls method HtmlEntities.decodeEntityAt(s, amp, n);
it calculates codepoint 39 for &apos; so it puts ' symbol in StringBuilder
instead of &apos;...
After it in encodeHtmlOnto method line:168-176
it rereplaces ' with it's code from String[] REPLACEMENTS (it's for sure
&#39;).
but yet I don't understand why, or how to easily add &apos; to allowed
symbols?

owasp-java-h...@googlecode.com

unread,
Oct 13, 2014, 6:23:43 PM10/13/14
to owasp-java-html-...@googlegroups.com

Comment #10 on issue 30 by mikes...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
https://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

Why are these replacements problems?

owasp-java-h...@googlecode.com

unread,
Oct 13, 2014, 7:11:06 PM10/13/14
to owasp-java-html-...@googlegroups.com

Comment #11 on issue 30 by sproket...@gmail.com: "&nbsp;" returns space
after sanitize instead of returning same "&nbsp;"
https://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=30

Numerical entities would be interpreted differently depending on unicode.
Named entities should really be preserved for text portability.

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

owasp-java-h...@googlecode.com

unread,
Oct 14, 2014, 4:31:15 AM10/14/14
to owasp-java-html-...@googlegroups.com

Comment #12 on issue 30 by elenn...@gmail.com: "&nbsp;" returns space after
Agree with sproket...@gmail.com.
Character entity references in HTML are already a symbol references, widely
used.
It will be great if they will be just ignored by, maybe some optional
setting implemented to HtmlEntities.
Text pocessing could be bore predictable and reversable in this case.

owasp-java-h...@googlecode.com

unread,
Jan 28, 2015, 11:41:19 AM1/28/15
to owasp-java-html-...@googlegroups.com

Comment #13 on issue 30 by Fox...@gmail.com: "&nbsp;" returns space after
Replacing &nbsp; with a unicode no-break-space messes me up because I use
UTF-8 encoding, and when I UTF-8 encode a unicode no-break-space it results
in Â. If I pass in "&nbsp;" then I want to get "&nbsp;" back. Why bother
replacing it?
Reply all
Reply to author
Forward
0 new messages