URLs in href gets encoded

56 views
Skip to first unread message

Salman Khan

unread,
Jun 2, 2016, 8:21:02 AM6/2/16
to OWASP Java HTML Sanitizer Support
URLs gets encoded when passed through OWASP HTML Sanitizer. A sample policy (Ebay policy) was used.
Input String <a href="mailto:x...@company.com" > Email </a> results in <a href="mailto:xxx&#64;company.com"> Email </a>

In case there is space infront between quotes of href it is not accepted. For eg
Input String <a href = " http://www.google.com "> GOOGLE </a> reuts in <a> GOOGLE </a>

Thanks,
Salman

Jim Manico

unread,
Jun 2, 2016, 1:37:26 PM6/2/16
to owasp-java-html-...@googlegroups.com

I can't repeat this. What version are you using?

- Jim

--
You received this message because you are subscribed to the Google Groups "OWASP Java HTML Sanitizer Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to owasp-java-html-saniti...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Samuel

unread,
Jun 2, 2016, 2:29:10 PM6/2/16
to OWASP Java HTML Sanitizer Support
On Thu, Jun 2, 2016 at 8:21 AM, Salman Khan <sakha...@gmail.com> wrote:
> URLs gets encoded when passed through OWASP HTML Sanitizer. A sample policy
> (Ebay policy) was used.
> Input String <a href="mailto:x...@company.com" > Email </a> results in <a
> href="mailto:xxx&#64;company.com"> Email </a>

How is this causing problems? This seems like a reasonable output for
that input.

> In case there is space infront between quotes of href it is not accepted.
> For eg
> Input String <a href = " http://www.google.com "> GOOGLE </a> reuts in <a>
> GOOGLE </a>

This looks like a bug.
https://www.w3.org/TR/html5/links.html#attr-hyperlink-href says

"""The href attribute on a and area elements must have a value that is
a valid URL potentially surrounded by spaces."""

https://www.w3.org/TR/html5/infrastructure.html#strip-leading-and-trailing-whitespace
explains what space means

"""When a user agent is to strip leading and trailing whitespace from
a string, the user agent must remove all space characters that are at
the start or end of the string."""

and https://www.w3.org/TR/html5/infrastructure.html#space-character says

"""The space characters, for the purposes of this specification, are
U+0020 SPACE, "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR"
(U+000D)."""

so I can try to get the URL sanitizing code to strip leading and
trailing space first.

Jim Manico

unread,
Jun 2, 2016, 3:10:55 PM6/2/16
to owasp-java-html-...@googlegroups.com
Comments inline...
> On Thu, Jun 2, 2016 at 8:21 AM, Salman Khan <sakha...@gmail.com> wrote:
>> URLs gets encoded when passed through OWASP HTML Sanitizer. A sample policy
>> (Ebay policy) was used.
>> Input String <a href="mailto:x...@company.com" > Email </a> results in <a
>> href="mailto:xxx&#64;company.com"> Email </a>
> How is this causing problems? This seems like a reasonable output for
> that input.

Yes, HTML entity encoded URI's are safe for rendering and work just
fine. This is a good thing that does not harm functionality.

>> In case there is space infront between quotes of href it is not accepted.
>> For eg
>> Input String <a href = " http://www.google.com "> GOOGLE </a> reuts in <a>
>> GOOGLE </a>
> This looks like a bug.
> https://www.w3.org/TR/html5/links.html#attr-hyperlink-href says
>
> """The href attribute on a and area elements must have a value that is
> a valid URL potentially surrounded by spaces."""
>
> https://www.w3.org/TR/html5/infrastructure.html#strip-leading-and-trailing-whitespace
> explains what space means
>
> """When a user agent is to strip leading and trailing whitespace from
> a string, the user agent must remove all space characters that are at
> the start or end of the string."""
>
> and https://www.w3.org/TR/html5/infrastructure.html#space-character says
>
> """The space characters, for the purposes of this specification, are
> U+0020 SPACE, "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR"
> (U+000D)."""
>
> so I can try to get the URL sanitizing code to strip leading and
> trailing space first.
Salman,

We've had a LOT of people review and use this code and you are the first
to point out this bug. Nice work! :)

Aloha,
Jim


Reply all
Reply to author
Forward
0 new messages