Re: [Django] #36897: Improve repercent_broken_unicode() performance

2 views
Skip to first unread message

Django

unread,
Feb 1, 2026, 12:24:53 AMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner: Harsh007
Type: | Status: assigned
Cleanup/optimization |
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Harsh007):

* cc: Harsh007 (added)
* owner: (none) => Harsh007
* status: new => assigned

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Feb 1, 2026, 9:26:28 AMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner: (none)
Type: | Status: new
Cleanup/optimization |
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Harsh007):

* owner: Harsh007 => (none)
* status: assigned => new

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:3>

Django

unread,
Feb 1, 2026, 11:43:04 AMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by beestarkdev):

* owner: (none) => beestarkdev
* status: new => assigned

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:4>

Django

unread,
Feb 1, 2026, 2:34:49 PMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by beestarkdev):

* stage: Unreviewed => Accepted

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:5>

Django

unread,
Feb 1, 2026, 2:39:09 PMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by beestarkdev):

* has_patch: 0 => 1

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:6>

Django

unread,
Feb 1, 2026, 2:44:26 PMFeb 1
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by beestarkdev):

Hi, here is the pull request: https://github.com/django/django/pull/20626

This is my first time ever contributing to open source so please feel free
to give me feedback if there's anything I can improve on.

I have attempted to add another optimization to this function in addition
to the recommendation here. I will post the testing/benchmarking
methodologies as well in the pull request for full transparency. Thank
you!
--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:7>

Django

unread,
Jul 2, 2026, 3:16:20 AM (2 days ago) Jul 2
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by David Smith):

The last couple of times this function has been touched is to address CVEs
(see 1 and 2 below). So anything here needs to be treated with caution, i
feel.

I confirmed that a long string of invalid characters has much slower
performance than a similarly sized string which is valid. I prepared a
benchmark at [https://github.com/smithdc1/django-asv/tree/uri_to_iri asv
benchmark]. The patch propsed above has a ~15% improvement for the worse
case scenario but is a more complex patch.

Anthropic's Claude pointed out to me that you can register custom error
handlers for codecs, see
[https://docs.python.org/3/library/codecs.html#codecs.register_error
python docs]. This approach has a ~4x increase in performance for the
scenario where there is a long string of invalid inputs to decode.

I would suggest next steps would be to investigte if this approach has any
security issues and to be as certain as we can be that we don't re-
introduce any issues that have previosuly been addressed.

{{{#!diff
diff --git a/django/utils/encoding.py b/django/utils/encoding.py
index e57e2a2ba1..c65db586cb 100644
--- a/django/utils/encoding.py
+++ b/django/utils/encoding.py
@@ -210,24 +210,22 @@ def punycode(domain):
return domain.encode("idna").decode("ascii")


+def _repercent_error_handler(exc):
+ invalid_input = exc.object[exc.start : exc.end]
+ repercent = quote(invalid_input, safe=b"/#%[]=:;$&()+,!?*@'~")
+ return repercent, exc.end
+
+
+codecs.register_error("django_repercent_error_handler",
_repercent_error_handler)
+
+
def repercent_broken_unicode(path):
"""
As per RFC 3987 Section 3.2, step three of converting a URI into an
IRI,
repercent-encode any octet produced that is not part of a strictly
legal
UTF-8 octet sequence.
"""
- changed_parts = []
- while True:
- try:
- path.decode()
- except UnicodeDecodeError as e:
- # CVE-2019-14235: A recursion shouldn't be used since the
exception
- # handling uses massive amounts of memory
- repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
- changed_parts.append(path[: e.start] + repercent.encode())
- path = path[e.end :]
- else:
- return b"".join(changed_parts) + path
+ return path.decode("utf-8",
"django_repercent_error_handler").encode("utf-8")


def filepath_to_uri(path):
diff --git a/tests/utils_tests/test_encoding.py
b/tests/utils_tests/test_encoding.py
index e0ee190431..01bddfeeff 100644
--- a/tests/utils_tests/test_encoding.py
+++ b/tests/utils_tests/test_encoding.py
@@ -128,17 +128,19 @@ class TestEncodingUtils(SimpleTestCase):
decoded_paths = []

def mock_quote(*args, **kwargs):
- # The second frame is the call to repercent_broken_unicode().
-
decoded_paths.append(inspect.currentframe().f_back.f_locals["path"])
+ # The second frame is the call from _repercent_encode.
+ decoded_paths.append(
+ inspect.currentframe().f_back.f_locals["invalid_input"]
+ )
return quote(*args, **kwargs)

with mock.patch("django.utils.encoding.quote", mock_quote):
self.assertEqual(repercent_broken_unicode(data),
b"test%FCtest%FCtest%FC")

- # decode() is called on smaller fragment of the path each time.
+ # decode() is called on each invalid fragment
self.assertEqual(
decoded_paths,
- [b"test\xfctest\xfctest\xfc", b"test\xfctest\xfc",
b"test\xfc"],
+ [b"\xfc", b"\xfc", b"\xfc"],
)
}}}

1 - 76ed1c49f804d409cfc2911a890c78584db3c76e
2 - 3f41d6d62929dfe53eda8109b3b836f26645bdce
--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:8>

Django

unread,
Jul 2, 2026, 3:20:10 AM (2 days ago) Jul 2
to django-...@googlegroups.com
#36897: Improve repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Owner:
Type: | beestarkdev
Cleanup/optimization | Status: assigned
Component: Utilities | Version: 6.0
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 1
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by David Smith):

* needs_better_patch: 0 => 1

--
Ticket URL: <https://code.djangoproject.com/ticket/36897#comment:9>
Reply all
Reply to author
Forward
0 new messages