Strange behavior.

146 views
Skip to first unread message

Johannes Lichtenberger

unread,
Jun 25, 2014, 4:15:32 AM6/25/14
to owasp-java-html-...@googlegroups.com
Hello,

I'm sanitizing the following String (with the standard policies and a policy which allows tables and text inside table-elements[1]) : "<table>Hallo\r\n<script>SCRIPT</script>\nEnde\n\r"

Expected output is:"<table>Hallo\r\n\nEnde\n\r"

There probably is something wrong in the Encoding-class:

  static void encodeHtmlOnto(String plainText, Appendable output)
      throws IOException {
    int n = plainText.length();
    int pos = 0;
    for (int i = 0; i < n; ++i) {
      char ch = plainText.charAt(i);
      if (ch < REPLACEMENTS.length) {
        String repl = REPLACEMENTS[ch];
        if (repl != null) {

It's just replacing the characters from the String "Hello" with null it seems.

kind regards
Johannes

[1] http://pastebin.com/kSYasg1U

Johannes Lichtenberger

unread,
Jun 25, 2014, 4:47:59 AM6/25/14
to owasp-java-html-...@googlegroups.com
I think \r and \n is not permitted, that is the expected output probably should be "<table>HalloEnde" or it might even close the table-Element _after_ the character content?

Jim Manico

unread,
Jun 25, 2014, 5:08:15 AM6/25/14
to owasp-java-html-...@googlegroups.com
Johannes,

Thank you for all of your astute feedback. I'll triage these into bug reports as soon as I can. I'm certain Mr. Samuel will respond as soon as he returns from leave.

I'll give you an update in a week at the latest.

Aloha,
--
Jim Manico
@Manicode
--
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Johannes Lichtenberger

unread,
Jun 25, 2014, 5:38:46 AM6/25/14
to owasp-java-html-...@googlegroups.com
Thank you, BTW the problem seems to be that the table-element is closed before the characters are written even if the input contains a table end-tag. That's probably strange.

kind regards
Johannes
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-sanitizer-support+unsubscribe@googlegroups.com.

Jim Manico

unread,
Jun 25, 2014, 8:47:12 AM6/25/14
to owasp-java-html-...@googlegroups.com
Can you kindly send us a copy of the policy that you are using Johannes?

Aloha,
Jim
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jun 25, 2014, 9:16:39 AM6/25/14
to owasp-java-html-...@googlegroups.com, j...@manico.net
Strange, it must be the prepareForContent(ElementContainmentInfo)-Method in TagBalancingHtmlStreamEventReceiver or that this method is called at all. There the opened table-element is closed before the text is written.

      if (ch > 0x20 || (HTML_SPACE_CHAR_BITMASK & (1 << ch)) == 0) {
        prepareForContent(ElementContainmentRelationships.CHARACTER_DATA_ONLY);
        break;
      }

It's the Java-class I've added and then simply

Sanitizers.BLOCKS.and(Sanitizers.FORMATTING)
                                           .and(Sanitizers.LINKS)
                                           .and(Sanitizers.STYLES)
                                           .and(Sanitizers.IMAGES)
                                           .and(XssProtectionRendererPolicies.TABLES).sanitize("<table>Hallo\r\n<script>SCRIPT</script>\nEnde\n\r");
XssProtectionRendererPolicies.java

Mike Samuel

unread,
Jun 25, 2014, 9:38:01 AM6/25/14
to owasp-java-html-...@googlegroups.com
<table> is an odd beast in that non-table-content nodes inside it are
hoisted out of it. I'll look into it.

2014-06-25 5:38 GMT-04:00 Johannes Lichtenberger
<lichtenberg...@gmail.com>:
>> email to owasp-java-html-saniti...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to owasp-java-html-saniti...@googlegroups.com.

David Kitchen

unread,
Jun 25, 2014, 9:38:48 AM6/25/14
to owasp-java-html-...@googlegroups.com
If you close the table tag the sanitizer will not attempt to close it early.

It expects well-formatted input, but does it's best to clean it up if the input isn't well-formatted.

Input: "<table>Hallo\n<script>SCRIPT</script>\nEnde</table>\n"

Will produce an output closer to your expectation.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jun 25, 2014, 9:52:51 AM6/25/14
to owasp-java-html-...@googlegroups.com
David, are you sure? My unit test still produces (with a sysout), the two newlines:

"<table></table>Hallo

Ende"

I'm using r239 from the maven repository.

kind regards
Johannes

David Kitchen

unread,
Jun 26, 2014, 7:44:01 AM6/26/14
to owasp-java-html-...@googlegroups.com
Johannes,

That was my error, I sloppily typed in a short version in the email but I had tested a properly nested table with closed tags.

Tables are closing in a valid (HTML) way, and a text node and SCRIPT are both invalid in the context of table, so the safe way to close the unclosed table tag is to do it immediately, which is what the sanitizer is doing.


On 25 June 2014 14:52, Johannes Lichtenberger <lichtenberg...@gmail.com> wrote:
David, are you sure? My unit test still produces (with a sysout), the two newlines:

"<table></table>Hallo

Ende"

I'm using r239 from the maven repository.

kind regards
Johannes

Am Mittwoch, 25. Juni 2014 15:38:48 UTC+2 schrieb David Kitchen:
If you close the table tag the sanitizer will not attempt to close it early.

It expects well-formatted input, but does it's best to clean it up if the input isn't well-formatted.

Input: "<table>Hallo\n<script>SCRIPT</script>\nEnde</table>\n"

Will produce an output closer to your expectation.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Mike Samuel

unread,
Jun 26, 2014, 9:09:42 AM6/26/14
to owasp-java-html-...@googlegroups.com
Johannes, I suspect something widgy is happening with the code that
suppresses text output when inside an element like <script>/<style>,
but I'm having trouble repeating the bug.

@Test
public static final void testScriptInTable() {
String input = "<table>Hallo\r\n<script>SCRIPT</script>\nEnde\n\r";
PolicyFactory pf = Sanitizers.BLOCKS.and(Sanitizers.FORMATTING)
.and(Sanitizers.LINKS)
.and(Sanitizers.STYLES)
.and(Sanitizers.IMAGES)
.and(Sanitizers.TABLES);
assertEquals("<table></table>Hallo\r\n\nEnde\n\r", pf.sanitize(input));
}

passes for me where Sanitizers.TABLES is defines thus:

/**
* Allows common table elements.
*/
public static final PolicyFactory TABLES = new HtmlPolicyBuilder()
.allowStandardUrlProtocols()
.allowElements(
"table", "tr", "td", "th",
"colgroup", "caption", "col",
"thead", "tbody", "tfoot")
.allowAttributes("summary").onElements("table")
.allowAttributes("align", "valign")
.onElements("table", "tr", "td", "th",
"colgroup", "col",
"thead", "tbody", "tfoot")
.toFactory();


Can you see where I might be doing something materially different?


2014-06-25 9:16 GMT-04:00 Johannes Lichtenberger
<lichtenberg...@gmail.com>:
>>> email to owasp-java-html-saniti...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OWASP Java HTML Sanitizer Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to owasp-java-html-saniti...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jun 27, 2014, 2:37:43 AM6/27/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
Hi Mike,

that's the problem I guess. Even if the input is closed by a table end-tag (<table>Hallo\r\n...</table>) and character content is allowed inside the table-element the table-element is closed immediately after the table start-tag. At least with the input:

<table>Hallo<script>SCRIPT</script>Ende</table>

and my table-policy with character content allowed (.allowTextIn("table")) I would expect the output:

<table>HalloEnde</table>

But if I'm not really mistaken it's still

<table></table>HalloEnde

kind regards
Johannes
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OWASP Java HTML Sanitizer Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Jim Manico

unread,
Jun 27, 2014, 3:14:57 AM6/27/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
Johannes.

I'm not an expert on the exact standard, but in 15 years I've never put content between table tags, only cells <td>or<th>. It looks very strange to me to see non markup content between two table tags like you are suggesting. Heck, I'd even balk at putting content between row tags <tr>.

Regards,

--
Jim Manico
@Manicode
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OWASP Java HTML Sanitizer Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jun 27, 2014, 3:40:47 AM6/27/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
Yes, me too, but we have a strange test case, which just tests this String, and we've used a whitelist based approach before. I'm not sure why this test case exists, but probably there's a reason and I just want the test to pass again ;-)

Though I don't know why, but it must be possible to allow such things. I think allowTextIn(String) must be used to allow this kind of "strange" markup, but it somehow seems it doesn't work.


      if (ch > 0x20 || (HTML_SPACE_CHAR_BITMASK & (1 << ch)) == 0) {
        prepareForContent(
ElementContainmentRelationships.CHARACTER_DATA_ONLY);
        break;
      }

The prepareForContent(...)-method closes the table-element which is on the stack.

kind regards
Johannes
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OWASP Java HTML Sanitizer Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-sanitizer-support+unsubscribe@googlegroups.com.

Jim Manico

unread,
Jun 27, 2014, 5:12:48 AM6/27/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com
Maybe the test case is wrong?

<table>content<table> does not validate as good HTML. In firefox Gecko provides an error, "Misplaced non-space characters inside a table".

In the w3c html verification engine, I get "character data is not allowed here".

Aloha,
--
Jim Manico
@Manicode
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jul 2, 2014, 3:04:32 AM7/2/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com, j...@manico.net
Maybe, but nontheless it should be possible to whitelist character content in the table-element and it is possible to whitelist character content in other elements with allowTextIn(String), for instance in the style-element.

kind regards
Johannes

Jim Manico

unread,
Jul 4, 2014, 11:25:44 AM7/4/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com, j...@manico.net
Maybe, but nontheless it should be possible to whitelist character content in the table-element and it is possible to whitelist character content in other elements with allowTextIn(String), for instance in the style-element.

I fully agree with that.

Can we also agree that <table>content<table> is fully invalid markup? I want to make sure that HTML Sanitizer ONLY returns valid markup regardless of the input.

Aloha,
Jim
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Johannes Lichtenberger

unread,
Jul 7, 2014, 7:56:23 AM7/7/14
to owasp-java-html-...@googlegroups.com, mikes...@gmail.com, j...@manico.net
Am Freitag, 4. Juli 2014 17:25:44 UTC+2 schrieb Jim Manico:
Maybe, but nontheless it should be possible to whitelist character content in the table-element and it is possible to whitelist character content in other elements with allowTextIn(String), for instance in the style-element.

Okay, I've also deactivated the test case (at least for a certain time) and I have no clue why it even exists, but probably I'll just remove the test case as it makes no sense.

kind regards
Johannes
Reply all
Reply to author
Forward
0 new messages