[Django] #29323: HTTPRequest QueryDict wrongly decodes binary encoded values in python2

13 views
Skip to first unread message

Django

unread,
Apr 13, 2018, 1:38:16 PM4/13/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values in python2
-------------------------------------------+------------------------
Reporter: Thomas Riccardi | Owner: nobody
Type: Uncategorized | Status: new
Component: HTTP handling | Version: 1.11
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------------+------------------------
`application/x-www-form-urlencoded` bodies allow binary values: just
percent-escape (most of) the bytes.

It doesn't work with python2 and django.

In python2, QueryDict returns an `unicode` string for values, which makes
it impossible to represent binary data.
Worse: it returns an `unicode` string whose unicode code points are the
raw expected bytes (caused by lax behavior of `str.decode('iso-8859-1')`)
so it silently fails.

Example:
{{{#!python
q = QueryDict('foo=%00%7f%80%ff')
q['foo']
# actual:
# u'\x00\x7f\x80\xff'
# expected:
# '\x00\x7f\x80\xff'
}}}

Relevant code:
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401

This was introduced by
https://github.com/django/django/commit/fa02120d360387bebbbe735e86686bb4c7c43db2
while trying to accept broken user agents that send non-ascii as
urlencoded raw values:
- the python 3 version seems fine: it tries to decode '''before''' calling
the query string parser (from `urllib`)
- the python 2 version made a mistake: it decodes '''after''' calling the
query string parser (and only on the value, not the key strangely)


I'm not sure how to fix this, as there are many locations where the key
and value are re-encoded (`appendlist` does it too..)


I have not (yet) tested with python3.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Apr 13, 2018, 6:26:24 PM4/13/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values in python2
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Uncategorized | Status: new
Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Simon Charette):

At this point the only version of Django still supporting Python 2 is only
receiving security fixes so unless you can also reproduce on Python 3 I'm
afraid this will not get backported.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:1>

Django

unread,
Apr 15, 2018, 9:04:52 PM4/15/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values in python2
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------
Changes (by Tim Graham):

* status: new => closed
* resolution: => wontfix
* type: Uncategorized => Bug


--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:2>

Django

unread,
Apr 16, 2018, 9:02:34 AM4/16/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: new

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------
Changes (by Thomas Riccardi):

* status: closed => new
* resolution: wontfix =>


Old description:

New description:

Relevant code:
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401


In python 3 it fails too, but probably only because of the forced post-
query-string-parser decoding into unicode string: to support binary values
we don't want decoding at this point.

A test for python3:
{{{#!python
def test_binary_input(self):
"""
QueryDicts must be able to handle binary.
"""
q = QueryDict(b'foo=%00%7f%80%ff')
self.assertListEqual(list(map(ord, q['foo'])), [0x00, 0x7F, 0x80,
0xFF])
}}}
{{{
AssertionError: Lists differ: [0, 127, 65533, 65533] != [0, 127, 128, 255]

First differing element 2:
65533
128

- [0, 127, 65533, 65533]
+ [0, 127, 128, 255]
}}}

--

Comment:

OK.

I tested with python3 and it breaks too, but only because the code
explicitly tries to generates an unicode string (`bytes_to_text` in
`appendlist`).

I updated the description to also add the python3 test.

To support binary we would need to add an explicit option to get the
`bytes` (maybe an encoding that asks to keep `bytes`?); not sure how to
expose it in `HTTPRequest`.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:3>

Django

unread,
Apr 16, 2018, 10:30:06 AM4/16/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: new

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Tim Graham):

Can you explain the use case for encoding bytes this way?

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:4>

Django

unread,
Apr 16, 2018, 11:27:01 AM4/16/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: new

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Simon Charette):

Is there a reason you can't base64 encode your binary content instead? It
should be easy to turn it into bytes when necessary after that.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:5>

Django

unread,
Apr 23, 2018, 9:13:24 AM4/23/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------
Changes (by Tim Graham):

* status: new => closed
* resolution: => wontfix


--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:6>

Django

unread,
Apr 23, 2018, 9:50:18 AM4/23/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Thomas Riccardi):

(sorry I did not receive a mail notification for comments, just for issue
status update...)

Replying to [comment:4 Tim Graham]:


> Can you explain the use case for encoding bytes this way?

Replying to [comment:5 Simon Charette]:


> Is there a reason you can't base64 encode your binary content instead?
It should be easy to turn it into bytes when necessary after that.

Indeed base64 seems better fitted (and is more efficient), but today our
API accept both raw bytes and base64. It seems allowed by the various
RFCs, and it works with django 1 and python 2 (with the addition of
`str.decode('iso-8859-1')`).
It breaks with python 3: we currently cannot upgrade to python 3 or django
2 without breaking our API.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:7>

Django

unread,
Apr 23, 2018, 12:18:55 PM4/23/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Tim Graham):

Could you cite your source? [https://www.w3.org/TR/2003/REC-
xforms-20031014/slice11.html#serialize-urlencode I found]:

[`application/x-www-form-urlencoded`] is not suitable for the persistence
of binary content. Therefore, it is recommended that forms capable of
containing binary content use another serialization method.

[https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST and]:

application/x-www-form-urlencoded: the values are encoded in key-value
tuples separated by '&', with a '=' between the key and the value. Non-
alphanumeric characters are percent encoded: this is the reason why this
type is not suitable to use with binary data (use multipart/form-data
instead)

Maybe a custom middleware could allow your use case.

If you have a fix to offer, we could evaluate it.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:8>

Django

unread,
Apr 23, 2018, 12:45:25 PM4/23/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Thomas Riccardi):

Replying to [comment:8 Tim Graham]:


> Could you cite your source? [https://www.w3.org/TR/2003/REC-
xforms-20031014/slice11.html#serialize-urlencode I found]:
>
> [`application/x-www-form-urlencoded`] is not suitable for the
persistence of binary content. Therefore, it is recommended that forms
capable of containing binary content use another serialization method.

I believe this is a spec related to XHTML. Firefox [deprecated it in
firefox 19](https://developer.mozilla.org/en-US/docs/Archive/Web/XForms),
XForms 2 is targeted for XML languages (see abstract:
https://www.w3.org/TR/xforms20/ ).

Anyway, it is just a recommendation: the format technically allows
serializing binary: at worst each byte is percent-encoded.

The html5 spec explains the [application/x-www-form-urlencoded byte
serializer](https://url.spec.whatwg.org/#concept-urlencoded-byte-
serializer): it takes a bytes array as input, and returns a string. A
Previous step documents how to generate the bytes array via encoding:
https://url.spec.whatwg.org/#concept-urlencoded-serializer.

So this may be a stretch to serialize raw binary and not a string that we
first encode, but the format would be the same. Only the user API would
change: bytes instead of string.

> [https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST and]:
>
> application/x-www-form-urlencoded: the values are encoded in key-value
tuples separated by '&', with a '=' between the key and the value. Non-
alphanumeric characters are percent encoded: this is the reason why this
type is not suitable to use with binary data (use multipart/form-data
instead)

Either "not suitable" means it won't work, which is wrong, or that it's
not recommended, but still would work.

> Maybe a custom middleware could allow your use case.
>
> If you have a fix to offer, we could evaluate it.

I'm OK to write a fix (I initially started to write a simple fix and unit
tests, but I encountered more issues, as previously explained), but I
would require some guidance. The open questions I have are:
- `QueryDict` output type: it should return `bytes` instead of `string` in
this case. How do we expose that? is it OK to have variable types?
- How to ask for bytes? Do we cheat with the `encoding` parameter? Not
sure who is controlling it: the request sender or the developer?


Thanks,
Thomas

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:9>

Django

unread,
Apr 23, 2018, 12:54:58 PM4/23/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------
Changes (by Tim Graham):

* cc: Claude Paroz (added)


Comment:

Claude, do you have any expertise to lend here? If not, perhaps we can ask
for advice on the DevelopersMailingList.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:10>

Django

unread,
Apr 24, 2018, 5:18:31 AM4/24/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Claude Paroz):

It would be possible to let raw bytes in `QueryDict` when a for value
cannot be properly decoded (I'll attach a patch doing that). This would
however change the (tested in
`handlers.tests.HandlerTests.test_non_ascii_query_string`) behavior when
Django receive a badly-encoded query string like `url?want=caf%E9`
(latin-1 encoded instead of utf-8 encoded).
With the current code, `request.POST['want']` contains `'caf�'` (i.e.
`'caf\ufffd'`). If we let bytes be bytes, `request.POST['want']` would
result in the binary `b'caf\xe9'` instead.

My main concern would be that developers should then add much defensive
coding every time they access `request.POST` values. Typically, the
currently valid `request.POST.get('want', '').startswith('caf')` test in a
view could easily crash by posting specially-crafted values.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:11>

Django

unread,
Apr 24, 2018, 5:19:20 AM4/24/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------
Changes (by Claude Paroz):

* Attachment "29323.diff" added.

Let bytes be bytes in QueryDict

Django

unread,
Apr 24, 2018, 5:28:21 AM4/24/18
to django-...@googlegroups.com
#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------

Reporter: Thomas Riccardi | Owner: nobody
Type: Bug | Status: closed

Component: HTTP handling | Version: 1.11
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+--------------------------------------

Comment (by Claude Paroz):

An alternative would be to let undecodable quoted values as is in
`request.POST`: `request.POST['want']` -> `'caf%E9'`.
Then if the app knows some posted content will contain such a value, it
can take care of unquoting the value itself. I think it would add less
backwards incompatibility.

--
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:12>

Reply all
Reply to author
Forward
0 new messages