[Django] #36897: Optimize repercent_broken_unicode() performance

1 view

Skip to first unread message

Django

unread,

Jan 31, 2026, 10:09:37 AM (6 days ago) Jan 31

to django-...@googlegroups.com

#36897: Optimize repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Type:
| Cleanup/optimization
Status: new | Component: Utilities
Version: 6.0 | Severity: Normal
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
The `repercent_broken_unicode()` function in `django/utils/encoding.py`
has performance issues when processing URLs with many consecutive invalid
UTF-8 bytes. The bottleneck is due to raising an exception for each
invalid byte and creating intermediate bytes objects through
concatenation.

{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
# creates new bytes object
changed_parts.append(path[: e.start] + repercent.encode())
path = path[e.end :]
else:
return b"".join(changed_parts) + path
}}}

== Suggested optimization ==

The simplest solution is to append byte parts separately to the list
instead of concatenating them with the `+` operator, avoiding creation of
intermediate bytes objects. This provides ~40% improvement while keeping
the same exception-based approach:

{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
changed_parts.append(path[: e.start])
changed_parts.append(repercent.encode())
path = path[e.end :]
else:
changed_parts.append(path)
return b"".join(changed_parts)
}}}

Alternatively, a manual UTF-8 validation approach could eliminate
exception overhead entirely by scanning byte-by-byte and checking UTF-8
patterns to identify invalid sequences without raising exceptions. This
would reduce processing time by ~80% though the implementation is more
complex.
--
Ticket URL: <https://code.djangoproject.com/ticket/36897>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Reply all

Reply to author

Forward

0 new messages