safe characters used in iri_to_uri (#12445)

16 views
Skip to first unread message

Gary Wilson Jr.

unread,
Dec 27, 2009, 1:12:39 AM12/27/09
to django-d...@googlegroups.com
http://code.djangoproject.com/ticket/12445

RFC 3986 [1] defines the following as "reserved" and "unreserved" characters:

reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

...and states that:
* if data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
* unreserved characters should not be percent-encoded

So, a couple issues here:

1) From my understanding of this, it seems that we need to add the
characters: ' (single quote), @ (at sign), and ~ (tilde) to the list
of safe characters used in iri_to_uri and drop the % (percent)
character. The ' and @ are reserved characters that are currently
missing from our list of safe characters, and the ~ is the only
unreserved character that urllib.quote doesn't already consider safe
[2].

Does this sound right, or am I misinterpreting the RFC?

2) The % character is not a reserved or unreserved character, but at
the end of section 3.1 of RFC 3987 [3] (which the source references)
states that % must not be converted:

"Systems accepting IRIs MAY also deal with the printable characters in
US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
"{", "}", "|", "\", "^", and "`", in step 2 above. If these
characters are found but are not converted, then the conversion SHOULD
fail. Please note that the number sign ("#"), the percent sign ("%"),
and the square bracket characters ("[", "]") are not part of the above
list and MUST NOT be converted."

Which I interpret as: even though % is not a reserved or unreserved
character, it should not be percent-encoded. So we keep % in our list
of safe characters, sound right?

Gary

[1] http://www.ietf.org/rfc/rfc3986.txt
[2] http://docs.python.org/library/urllib.html#urllib.quote
[3] http://www.ietf.org/rfc/rfc3987.txt

Luke Plant

unread,
Dec 28, 2009, 5:43:10 PM12/28/09
to django-d...@googlegroups.com
Hi Gary,

I agree with the your proposals. I've got a few concerns with the
current code:

In django/utils/encoding.py: function iri_to_uri
<<<
# The list of safe characters here is constructed from the
# printable ASCII characters that are not explicitly excluded
# by the list at the end of section 3.1 of RFC 3987.
if iri is None:
return iri
return urllib.quote(smart_str(iri), safe='/#%[]=:;$&()+,!?*')
>>>

First, I can't find any list at the end of section 3.1 of RFC 3987,
unless it's talking about the paragraph starting "Systems accepting
IRIs MAY also deal with..." which only lists a few characters which
should not be converted, and not the list given in the code.

Second, the algorithm given in that section is described very
differently, and it's very hard to see that they are doing the same
thing. Hence this bug. However, I can't actually come up with a nicer
solution, and one that is equally fast is probably even harder.

So, +1 to changing this, as well as some fixes to the comments in the
code.

Luke

--
I teleported home one night
With Ron and Sid and Meg,
Ron stole Meggie's heart away
And I got Sidney's leg
(THHGTTG)

Luke Plant || http://lukeplant.me.uk/

Reply all
Reply to author
Forward
0 new messages