#36897: Optimize repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Type:
| Cleanup/optimization
Status: new | Component: Utilities
Version: 6.0 | Severity: Normal
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
The `repercent_broken_unicode()` function in `django/utils/encoding.py`
has performance issues when processing URLs with many consecutive invalid
UTF-8 bytes. The bottleneck is due to raising an exception for each
invalid byte and creating intermediate bytes objects through
concatenation.
{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
# creates new bytes object
changed_parts.append(path[: e.start] + repercent.encode())
path = path[e.end :]
else:
return b"".join(changed_parts) + path
}}}
== Suggested optimization ==
The simplest solution is to append byte parts separately to the list
instead of concatenating them with the `+` operator, avoiding creation of
intermediate bytes objects. This provides ~40% improvement while keeping
the same exception-based approach:
{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
changed_parts.append(path[: e.start])
changed_parts.append(repercent.encode())
path = path[e.end :]
else:
changed_parts.append(path)
return b"".join(changed_parts)
}}}
Alternatively, a manual UTF-8 validation approach could eliminate
exception overhead entirely by scanning byte-by-byte and checking UTF-8
patterns to identify invalid sequences without raising exceptions. This
would reduce processing time by ~80% though the implementation is more
complex.
--
Ticket URL: <
https://code.djangoproject.com/ticket/36897>
Django <
https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.