#36520: Performance Regression in parse_header_params
-------------------------------------+-------------------------------------
Reporter: David Smith | Owner: Natalia
| Bidart
Type: Bug | Status: assigned
Component: HTTP handling | Version: dev
Severity: Release blocker | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by Natalia Bidart):
I started looking into this AGAIN to see how the branch using bytes would
look like. In my head, "passing bytes" to `Message.get_params` would just
be:
{{{#!diff
diff --git a/django/utils/http.py b/django/utils/http.py
index 504f28c678..54dc85aead 100644
--- a/django/utils/http.py
+++ b/django/utils/http.py
@@ -327,7 +327,7 @@ def parse_header_parameters(line,
max_length=MAX_HEADER_LENGTH):
raise ValueError("Unable to parse header parameters (value too
long).")
m = Message()
- m["content-type"] = line
+ m["content-type"] = line.encode("utf-8")
params = m.get_params()
pdict = {}
}}}
But this change generates all sort of test failures: some are "expected"
because we'd need to `decode` the results into what's actually expected as
a `str` result, but other show jusy plain wrong results. For example, for
a `rfc2231` like this:
{{{
Content-Type: application/x-stuff; title*=us-ascii'en-
us'This%20is%20%2A%2A%2Afun%2A%2A%2A
}}}
Passing str to `Message.get_params` results in the correct:
{{{
[('Content-Type: application/x-stuff', ''), ('title', ('us-ascii', 'en-
us', 'This is ***fun***'))]
}}}
But passing bytes yields:
{{{
[('b"content-type: application/x-stuff; title*', 'us-ascii\'en-
us\'This%20is%20%2A%2A%2Afun%2A%2A%2A"')]
}}}
I further debugged and ended up in a rabbit hole in the Python code, to
realize that `get_params` is not, at least to my exhausted eyes, expecting
bytes in any way! I posted an update in the Python Discourse, but I would
truly appreciate your opinion on this since at this point my head is very
confused.
I dug into `Message.get_params()`, and from the call chain it seems clear
that everything is built around `str` headers. For example,
`_parseparam()` (the call chain is `get_params` -> `_get_params_preserve`
-> `_parseparam`) starts with these two lines
(
https://github.com/python/cpython/blob/main/Lib/email/message.py#L73):
{{{#!python
def _parseparam(s):
# RDM This might be a Header, so for now stringify it.
s = ';' + str(s)
}}}
It then string-splits, string-strips, and string-compares all the way
down. If I assign bytes to the header, the parsing doesnβt work as
expected: you basically just get the `repr()` of the bytes object. So
while I understand the idea of avoiding decoding overhead, I don't see a
way to feed bytes into `Message` and get equivalent results. The API
really assumes `str` throughout.
=== Side note===
There is ALSO the question of whether the header name should or should not
be passed to `parse_header_parameters`. On a second look in our code, I
think that `line` is expected to have the header name in it. Tests are
super confusing in this regard, see how `test_basic` does not pass it but
the rest of test cases do!
https://github.com/django/django/blob/main/tests/utils_tests/test_http.py
--
Ticket URL: <
https://code.djangoproject.com/ticket/36520#comment:21>