#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by David Smith):
The last couple of times this function has been touched is to address CVEs
(see 1 and 2 below). So anything here needs to be treated with caution, i
feel.
I confirmed that a long string of invalid characters has much slower
performance than a similarly sized string which is valid. I prepared a
benchmark at [
https://github.com/smithdc1/django-asv/tree/uri_to_iri asv
benchmark]. The patch propsed above has a ~15% improvement for the worse
case scenario but is a more complex patch.
Anthropic's Claude pointed out to me that you can register custom error
handlers for codecs, see
[
https://docs.python.org/3/library/codecs.html#codecs.register_error
python docs]. This approach has a ~4x increase in performance for the
scenario where there is a long string of invalid inputs to decode.
I would suggest next steps would be to investigte if this approach has any
security issues and to be as certain as we can be that we don't re-
introduce any issues that have previosuly been addressed.
{{{#!diff
diff --git a/django/utils/encoding.py b/django/utils/encoding.py
index e57e2a2ba1..c65db586cb 100644
--- a/django/utils/encoding.py
+++ b/django/utils/encoding.py
@@ -210,24 +210,22 @@ def punycode(domain):
return domain.encode("idna").decode("ascii")
+def _repercent_error_handler(exc):
+ invalid_input = exc.object[exc.start : exc.end]
+ repercent = quote(invalid_input, safe=b"/#%[]=:;$&()+,!?*@'~")
+ return repercent, exc.end
+
+
+codecs.register_error("django_repercent_error_handler",
_repercent_error_handler)
+
+
def repercent_broken_unicode(path):
"""
As per RFC 3987 Section 3.2, step three of converting a URI into an
IRI,
repercent-encode any octet produced that is not part of a strictly
legal
UTF-8 octet sequence.
"""
- changed_parts = []
- while True:
- try:
- path.decode()
- except UnicodeDecodeError as e:
- # CVE-2019-14235: A recursion shouldn't be used since the
exception
- # handling uses massive amounts of memory
- repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
- changed_parts.append(path[: e.start] + repercent.encode())
- path = path[e.end :]
- else:
- return b"".join(changed_parts) + path
+ return path.decode("utf-8",
"django_repercent_error_handler").encode("utf-8")
def filepath_to_uri(path):
diff --git a/tests/utils_tests/test_encoding.py
b/tests/utils_tests/test_encoding.py
index e0ee190431..01bddfeeff 100644
--- a/tests/utils_tests/test_encoding.py
+++ b/tests/utils_tests/test_encoding.py
@@ -128,17 +128,19 @@ class TestEncodingUtils(SimpleTestCase):
decoded_paths = []
def mock_quote(*args, **kwargs):
- # The second frame is the call to repercent_broken_unicode().
-
decoded_paths.append(inspect.currentframe().f_back.f_locals["path"])
+ # The second frame is the call from _repercent_encode.
+ decoded_paths.append(
+ inspect.currentframe().f_back.f_locals["invalid_input"]
+ )
return quote(*args, **kwargs)
with mock.patch("django.utils.encoding.quote", mock_quote):
self.assertEqual(repercent_broken_unicode(data),
b"test%FCtest%FCtest%FC")
- # decode() is called on smaller fragment of the path each time.
+ # decode() is called on each invalid fragment
self.assertEqual(
decoded_paths,
- [b"test\xfctest\xfctest\xfc", b"test\xfctest\xfc",
b"test\xfc"],
+ [b"\xfc", b"\xfc", b"\xfc"],
)
}}}
1 - 76ed1c49f804d409cfc2911a890c78584db3c76e
2 - 3f41d6d62929dfe53eda8109b3b836f26645bdce
--
Ticket URL: <
https://code.djangoproject.com/ticket/36897#comment:8>