It doesn't work with python2 and django.
In python2, QueryDict returns an `unicode` string for values, which makes
it impossible to represent binary data.
Worse: it returns an `unicode` string whose unicode code points are the
raw expected bytes (caused by lax behavior of `str.decode('iso-8859-1')`)
so it silently fails.
Example:
{{{#!python
q = QueryDict('foo=%00%7f%80%ff')
q['foo']
# actual:
# u'\x00\x7f\x80\xff'
# expected:
# '\x00\x7f\x80\xff'
}}}
Relevant code:
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401
This was introduced by
https://github.com/django/django/commit/fa02120d360387bebbbe735e86686bb4c7c43db2
while trying to accept broken user agents that send non-ascii as
urlencoded raw values:
- the python 3 version seems fine: it tries to decode '''before''' calling
the query string parser (from `urllib`)
- the python 2 version made a mistake: it decodes '''after''' calling the
query string parser (and only on the value, not the key strangely)
I'm not sure how to fix this, as there are many locations where the key
and value are re-encoded (`appendlist` does it too..)
I have not (yet) tested with python3.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
Comment (by Simon Charette):
At this point the only version of Django still supporting Python 2 is only
receiving security fixes so unless you can also reproduce on Python 3 I'm
afraid this will not get backported.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:1>
* status: new => closed
* resolution: => wontfix
* type: Uncategorized => Bug
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:2>
* status: closed => new
* resolution: wontfix =>
Old description:
New description:
Relevant code:
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401
In python 3 it fails too, but probably only because of the forced post-
query-string-parser decoding into unicode string: to support binary values
we don't want decoding at this point.
A test for python3:
{{{#!python
def test_binary_input(self):
"""
QueryDicts must be able to handle binary.
"""
q = QueryDict(b'foo=%00%7f%80%ff')
self.assertListEqual(list(map(ord, q['foo'])), [0x00, 0x7F, 0x80,
0xFF])
}}}
{{{
AssertionError: Lists differ: [0, 127, 65533, 65533] != [0, 127, 128, 255]
First differing element 2:
65533
128
- [0, 127, 65533, 65533]
+ [0, 127, 128, 255]
}}}
--
Comment:
OK.
I tested with python3 and it breaks too, but only because the code
explicitly tries to generates an unicode string (`bytes_to_text` in
`appendlist`).
I updated the description to also add the python3 test.
To support binary we would need to add an explicit option to get the
`bytes` (maybe an encoding that asks to keep `bytes`?); not sure how to
expose it in `HTTPRequest`.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:3>
Comment (by Tim Graham):
Can you explain the use case for encoding bytes this way?
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:4>
Comment (by Simon Charette):
Is there a reason you can't base64 encode your binary content instead? It
should be easy to turn it into bytes when necessary after that.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:5>
* status: new => closed
* resolution: => wontfix
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:6>
Comment (by Thomas Riccardi):
(sorry I did not receive a mail notification for comments, just for issue
status update...)
Replying to [comment:4 Tim Graham]:
> Can you explain the use case for encoding bytes this way?
Replying to [comment:5 Simon Charette]:
> Is there a reason you can't base64 encode your binary content instead?
It should be easy to turn it into bytes when necessary after that.
Indeed base64 seems better fitted (and is more efficient), but today our
API accept both raw bytes and base64. It seems allowed by the various
RFCs, and it works with django 1 and python 2 (with the addition of
`str.decode('iso-8859-1')`).
It breaks with python 3: we currently cannot upgrade to python 3 or django
2 without breaking our API.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:7>
Comment (by Tim Graham):
Could you cite your source? [https://www.w3.org/TR/2003/REC-
xforms-20031014/slice11.html#serialize-urlencode I found]:
[`application/x-www-form-urlencoded`] is not suitable for the persistence
of binary content. Therefore, it is recommended that forms capable of
containing binary content use another serialization method.
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST and]:
application/x-www-form-urlencoded: the values are encoded in key-value
tuples separated by '&', with a '=' between the key and the value. Non-
alphanumeric characters are percent encoded: this is the reason why this
type is not suitable to use with binary data (use multipart/form-data
instead)
Maybe a custom middleware could allow your use case.
If you have a fix to offer, we could evaluate it.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:8>
Comment (by Thomas Riccardi):
Replying to [comment:8 Tim Graham]:
> Could you cite your source? [https://www.w3.org/TR/2003/REC-
xforms-20031014/slice11.html#serialize-urlencode I found]:
>
> [`application/x-www-form-urlencoded`] is not suitable for the
persistence of binary content. Therefore, it is recommended that forms
capable of containing binary content use another serialization method.
I believe this is a spec related to XHTML. Firefox [deprecated it in
firefox 19](https://developer.mozilla.org/en-US/docs/Archive/Web/XForms),
XForms 2 is targeted for XML languages (see abstract:
https://www.w3.org/TR/xforms20/ ).
Anyway, it is just a recommendation: the format technically allows
serializing binary: at worst each byte is percent-encoded.
The html5 spec explains the [application/x-www-form-urlencoded byte
serializer](https://url.spec.whatwg.org/#concept-urlencoded-byte-
serializer): it takes a bytes array as input, and returns a string. A
Previous step documents how to generate the bytes array via encoding:
https://url.spec.whatwg.org/#concept-urlencoded-serializer.
So this may be a stretch to serialize raw binary and not a string that we
first encode, but the format would be the same. Only the user API would
change: bytes instead of string.
> [https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST and]:
>
> application/x-www-form-urlencoded: the values are encoded in key-value
tuples separated by '&', with a '=' between the key and the value. Non-
alphanumeric characters are percent encoded: this is the reason why this
type is not suitable to use with binary data (use multipart/form-data
instead)
Either "not suitable" means it won't work, which is wrong, or that it's
not recommended, but still would work.
> Maybe a custom middleware could allow your use case.
>
> If you have a fix to offer, we could evaluate it.
I'm OK to write a fix (I initially started to write a simple fix and unit
tests, but I encountered more issues, as previously explained), but I
would require some guidance. The open questions I have are:
- `QueryDict` output type: it should return `bytes` instead of `string` in
this case. How do we expose that? is it OK to have variable types?
- How to ask for bytes? Do we cheat with the `encoding` parameter? Not
sure who is controlling it: the request sender or the developer?
Thanks,
Thomas
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:9>
* cc: Claude Paroz (added)
Comment:
Claude, do you have any expertise to lend here? If not, perhaps we can ask
for advice on the DevelopersMailingList.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:10>
Comment (by Claude Paroz):
It would be possible to let raw bytes in `QueryDict` when a for value
cannot be properly decoded (I'll attach a patch doing that). This would
however change the (tested in
`handlers.tests.HandlerTests.test_non_ascii_query_string`) behavior when
Django receive a badly-encoded query string like `url?want=caf%E9`
(latin-1 encoded instead of utf-8 encoded).
With the current code, `request.POST['want']` contains `'caf�'` (i.e.
`'caf\ufffd'`). If we let bytes be bytes, `request.POST['want']` would
result in the binary `b'caf\xe9'` instead.
My main concern would be that developers should then add much defensive
coding every time they access `request.POST` values. Typically, the
currently valid `request.POST.get('want', '').startswith('caf')` test in a
view could easily crash by posting specially-crafted values.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:11>
* Attachment "29323.diff" added.
Let bytes be bytes in QueryDict
Comment (by Claude Paroz):
An alternative would be to let undecodable quoted values as is in
`request.POST`: `request.POST['want']` -> `'caf%E9'`.
Then if the app knows some posted content will contain such a value, it
can take care of unquoting the value itself. I think it would add less
backwards incompatibility.
--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:12>