QueryDict and unicode

54 views
Skip to first unread message

Alexey Lozickiy

unread,
Oct 2, 2017, 4:54:38 PM10/2/17
to Django users
Hi all,

Why is it so that QueryDict for PY3 handles input query string different from PY2 (part of __init__ of QueryDict from Django 1.11.5):

if six.PY3:
   
if isinstance(query_string, bytes):
       
# query_string normally contains URL-encoded data, a subset of ASCII.
       
try:
            query_string
= query_string.decode(encoding)
       
except UnicodeDecodeError:
           
# ... but some user agents are misbehaving :-(
            query_string
= query_string.decode('iso-8859-1')
   
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
       
self.appendlist(key, value)
else:
   
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
       
try:
            value
= value.decode(encoding)
       
except UnicodeDecodeError:
            value
= value.decode('iso-8859-1')
       
self.appendlist(force_text(key, encoding, errors='replace'),
                        value
)

Firstly, for PY3 decoding is done only once, for entire query string, while for PY2 query is parsed first, and then each value is decoded separately.
Secondly, for PY3 query_string is being decoded only if it is of bytes type. Why there is no such check for PY2? Why not to decode only if it's not unicode?

With such implementation it is not possible to pass unicode object that contains non-ascii characters to QueryDict.

Can somebody give me a hint on why things wre done in this way?

Thanks,
Alexey.

James Schneider

unread,
Oct 4, 2017, 4:43:43 AM10/4/17
to django...@googlegroups.com


On Oct 2, 2017 1:53 PM, "Alexey Lozickiy" <wrestlin...@gmail.com> wrote:
Hi all,

Why is it so that QueryDict for PY3 handles input query string different from PY2 (part of __init__ of QueryDict from Django 1.11.5):

if six.PY3:
   
if isinstance(query_string, bytes):
       
# query_string normally contains URL-encoded data, a subset of ASCII.
       
try:
            query_string
= query_string.decode(encoding)
       
except UnicodeDecodeError:
           
# ... but some user agents are misbehaving :-(
            query_string
= query_string.decode('iso-8859-1')
   
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
       
self.appendlist(key, value)
else:
   
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
       
try:
            value
= value.decode(encoding)
       
except UnicodeDecodeError:
            value
= value.decode('iso-8859-1')
       
self.appendlist(force_text(key, encoding, errors='replace'),
                        value
)

Firstly, for PY3 decoding is done only once, for entire query string, while for PY2 query is parsed first, and then each value is decoded separately.
Secondly, for PY3 query_string is being decoded only if it is of bytes type. Why there is no such check for PY2? Why not to decode only if it's not unicode?


I'm probably unqualified to answer this, but I'll try anyway. 

The difference likely comes down to the change in string handling in Python 3. Py3 makes a distinction between character strings and byte strings.

The limited_parse_qsl() likely can/will only handle Unicode-escaped (URL-encoded) character strings in Py3, as opposed to handling byte strings transparently (that decode to Unicode-escaped strings) in Py2. I'm guessing that the magic implicit translation/decoding between bytes and characters no longer occurs in Py3 (for good and well-documented reasons), so care must be taken to perform the decoding manually. You'll notice that the values are not decoded a second time for Py3.


With such implementation it is not possible to pass unicode object that contains non-ascii characters to QueryDict.

Given the first comment in the code, if the data is not properly URL encoded to begin with, then I would expect that the parsing function for the values to explode, meaning that you can't pass a true Unicode string with characters beyond the ASCII range because it isn't expected at this stage. To me, that's expected and desired behavior since a QueryDict is expecting to be provided with a properly formatted/URL-encoded query. 

The fix would be to URL-encode your true
Unicode string prior to passing it to a QueryDict. That should allow support of Unicode characters with higher code points.

Basically, the Internet revolves around ASCII being the lowest common denominator.

Someone please correct me if I'm wrong.

-James

Reply all
Reply to author
Forward
0 new messages