href is removed from <a> tag it it contains new line character

109 views
Skip to first unread message

rasm...@gmail.com

unread,
Dec 21, 2017, 8:12:09 AM12/21/17
to OWASP Java HTML Sanitizer Support
Hi,
href is removed from the <a> tag on sanitization.
Input:
"<a style=\"text-decoration: none\"" + "target=\"_blank\"" + "\\r\\n" + "href=\"http://abc.com\""+ ">" + "<span style=\"font-size: medium;\">xxxxxxxxx</span></a>",
Output:
<a style="text-decoration:none" target="_blank"><span style="font-size:medium">xxxxxxxxx</span></a>]

If there is no "\\r\\n" in the input string then the sanitizer doesn't remove the href below example
Input:
"<a style=\"text-decoration: none\"" + "target=\"_blank\" " + "href=\"http://abc.com\""+ ">" + "<span style=\"font-size: medium;\">XXXX</span></a>",
Output:
<a style="text-decoration:none" target="_blank" href="http://abc.com" rel="noopener noreferrer"><span style="font-size:medium">XXXX</span></a>

Please let me know if this works as design or its a bug in the library.

Thanks


Mike Samuel

unread,
Dec 21, 2017, 8:15:23 AM12/21/17
to OWASP Java HTML Sanitizer Support
"\\r\\n" encode the four character string "\r\n", not CRLF.

IIUC, the string seen by the parser is
<a style="text-decorations: none"
target="_blank"\r\nhref="http://abc.com"><span style="font-size:
medium;">xxxxxxxxx</span></a>
so the a tag has 3 attributes
1. style
2. target
3. \r\nhref

The last gets stripped because it has not been whitelisted.
> --
> You received this message because you are subscribed to the Google Groups
> "OWASP Java HTML Sanitizer Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to owasp-java-html-saniti...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Rasmita Mahapatra

unread,
Jan 4, 2018, 7:14:54 AM1/4/18
to OWASP Java HTML Sanitizer Support
Hi,

Thanks for the response. 

I tried adding  "\r\nhref@a" and "\\r\\nhref@a"  to the whitelist, but still the sanitizer is stripping out the href
Input string:
"<a style=\"text-decoration: none\" target=\"_blank\" " + "\\r\\n" + "href=\"http://abc.com\""+ "><span style=\"font-size: medium;\">XXXX</span></a>"

Assertion result:
[<a style="text-decoration: none" target="_blank" href="http://abc.com"><span style="font-size: medium;">XXXX</span></a>] but found [<a style="text-decoration:none" target="_blank"><span style="font-size:medium">XXXX</span></a>]

Please help.
Thanks
Rasmita

Mike Samuel

unread,
Jan 4, 2018, 8:20:06 AM1/4/18
to owasp-java-html-...@googlegroups.com


On Jan 4, 2018 7:14 AM, "Rasmita Mahapatra" <rasm...@gmail.com> wrote:
Hi,

Thanks for the response. 

I tried adding  "\r\nhref@a" and "\\r\\nhref@a"  to the whitelist, but still the sanitizer is stripping out the href
Input string:

Why would you add those to a whitelist?

Rasmita Mahapatra

unread,
Jan 8, 2018, 10:53:05 AM1/8/18
to OWASP Java HTML Sanitizer Support

Let me put in this way I have a html like below
<p>Link <a 
href="http://www.unimi.it">UniMi</a><br>
</p>
There is a new line between <a tag and href and when  CRLF is encountered it is replaced with "\r\n" before its, passed to sanitizer. But the sanitizer removes "href" attribute completely. When the CRLF is replaces it becomes
<p>Link <a  \r\nhref="http://www.unimi.it">UniMi</a><br>
</p>
Please help how can I instruct the sanitizer not to remove href if is preceded by "\r\n"
Thanks
Rasmita

Mike Samuel

unread,
Jan 8, 2018, 10:58:31 AM1/8/18
to OWASP Java HTML Sanitizer Support
On Mon, Jan 8, 2018 at 6:33 AM, Rasmita Mahapatra <rasm...@gmail.com> wrote:
>
> when CRLF is encountered it is replaced with "\r\n" before its, passed to sanitizer.

IIUC, something else is replacing "\r\n" with "\\r\\n".
This seems like the problem you need to solve.

If you want the behavior you expect of the sanitizer, you need to give
it the kind
of input it expects.
The sanitizer takes a string of HTML and produces a string of sanitized HTML.
It doesn't take a string containing a C-style string literal token
that embeds HTML.

Rasmita Mahapatra

unread,
Jan 11, 2018, 6:49:28 AM1/11/18
to OWASP Java HTML Sanitizer Support
When there is space between "\r\n href" the sanitizer is not stripping the href when there is no space between href and control character "\r\nhref" sanitizer is stripping the href. Basically sanitizer is not identifying "\r\n" when there is no space before and after "\r\n" This looks like  a bug to me
Thanks
Rasmita

Jim Manico

unread,
Jan 11, 2018, 10:03:24 AM1/11/18
to owasp-java-html-...@googlegroups.com, Rasmita Mahapatra

Can you try replacing \r\n with an empty string "" before sending into the sanitizer - as a quick fix while Mike reviews this more?

Aloha, Jim

To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.

Mike Samuel

unread,
Jan 11, 2018, 10:05:15 AM1/11/18
to OWASP Java HTML Sanitizer Support
On Thu, Jan 11, 2018 at 6:49 AM, Rasmita Mahapatra <rasm...@gmail.com> wrote:
> When there is space between "\r\n href" the sanitizer is not stripping the
> href when there is no space between href and control character "\r\nhref"
> sanitizer is stripping the href. Basically sanitizer is not identifying
> "\r\n" when there is no space before and after "\r\n" This looks like a bug
> to me

When I view the following HTML in my browser, the first is not a link
and the rest are.

<style>li { white-space: pre }</style>
<ul>
<li><a \r\nhref="#">a \r\nhref</a></li>
<li><a \r\n href="#">a \r\n href</a></li>
<li><a
href="#">a
href</a></li>
</ul>

If I understand correctly, you want the first to display as a link.
But that's not the way browsers do it.
Do not convert CRLF to C escape sequences before passing it to the sanitizer.
Reply all
Reply to author
Forward
0 new messages