Outside of looking at the pstats output format, the only proper
benchmarking I've done is with timeit ... on a laptop. And unfortunately
the bits being discussed are all quick enough that it's easy enough to
foresee that the differences are within the margin of error, especially
given the results can be affected by the CPU throttling, or other tasks
running concurrently (though I've tried to avoid that as much as
possible). It's entirely possibly that I got the baseline results during a
throttled period accidentally, etc. So I'm going to -- possibly wrongly --
lean on the pstats output. But by all accounts, the theory should be
sound, because it's "do less work"...
Note that `DEBUG=False` for all of these, but it's irrelevant because I'm
not going through the template loaders so there's no cached Loader in play
(I'm just not getting the DebuxLexer as a result). All of the work is
based on instantiating a Template from a string, never getting as far as
rendering it.
== Baseline
To get a baseline of where things are, I've read the contents of the
`admin/change_list.html` into a variable, to get the following:
{{{
In [3]: %timeit -r5 -n5000 Template(changelist)
2.08 ms ± 29.1 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each)
}}}
to try and boost the signal to noise of ms fluctuations, we'll also do:
{{{
def runtemplate(source=changelist, n=1):
for x in range(n):
t = Template(source)
}}}
and get the baseline for that:
{{{
In [2]: %timeit -r5 -n1 runtemplate(n=5000)
10.5 s ± 335 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
}}}
From there, I've cprofiled (well, yappi actually) `runtemplate(n=1000)` to
amortize any other setup/import/machinery costs and boost the lexing &
parsing parts into the spotlight. I've cherry picked the relevant things
that I'm putting forth for change, and for the purpose of extracting the
distinct `get_nodes_by_type` methods being run, copy-pasted the default
`Node.get_nodes_by_type` (sigh) to each.
I've added in "percentage of time" in at the end of each line, calculated
as `(tottime / overall time) * 100` (eg: `Lexer.create_token` is `(1.418 /
30.018) * 100`) ... these are ideally going to go down in the general
case, but with each subsequent change may go back up because something
else changed position, so are a guide more than anything.
{{{
6776637 function calls (6946213 primitive calls) in 30.018 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
1000/31000 2.410 0.002 24.801 0.001
django/template/base.py:445(Parser.parse) (8%)
160000 1.418 0.000 2.436 0.000
django/template/base.py:367(Lexer.create_token) (4.7%)
687159 1.292 0.000 1.293 0.000 builtins:0(isinstance)
(4.3%)
2000/56000 1.199 0.001 5.095 0.000
django/template/base.py:962(NodeList.get_nodes_by_type) (4%)
131000 1.100 0.000 1.835 0.000
django/template/base.py:512(Parser.extend_nodelist) (3.7%)
1000 1.047 0.001 4.206 0.004
django/template/base.py:353(Lexer.tokenize) (3.5%)
59000 0.911 0.000 3.220 0.000
django/template/base.py:768(Variable.__init__) (3%)
423338 0.853 0.000 0.853 0.000 __builtin__:0(<method
'startswith' of 'str' objects>) (2.8%)
140000 0.820 0.000 1.345 0.000
django/template/base.py:972(TextNode.get_nodes_by_type) (2.7%)
371343/371373 0.770 0.000 1.235 0.000 builtins:0(getattr)
(2.6%)
24000 0.141 0.000 0.231 0.000
django/template/base.py:1024(VariableNode.get_nodes_by_type) (0.5%)
10000 0.059 0.000 0.097 0.000
django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
6000 0.036 0.000 0.059 0.000
django/template/defaulttags.py:434(URLNode.get_nodes_by_type) (0.1%)
4000 0.024 0.000 0.039 0.000
django/templatetags/static.py:95(StaticNode.get_nodes_by_type) (0.1%)
2000 0.012 0.000 0.020 0.000
django/template/defaulttags.py:361(LoadNode.get_nodes_by_type) (0%)
2000 0.012 0.000 0.020 0.000
django/template/library.py:186(SimpleNode.get_nodes_by_type) (0%)
2000 0.012 0.000 0.020 0.000
django/template/defaulttags.py:53(CsrfTokenNode.get_nodes_by_type) (0%)
}}}
== After changes
So that you don't need to read all the rest of this just to find out the
difference, I'll front-load the final/ideal results too. At the end
there's an easier to read summary of the changes, too.
{{{
5682638 function calls (5852221 primitive calls) in 25.186 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1000/31000 2.000 0.002 21.065 0.001
django/template/base.py:446(Parser.parse) (7.9%)
2000/56000 1.160 0.001 3.711 0.000
django/template/base.py:966(NodeList.get_nodes_by_type) (4.6%)
556159 1.014 0.000 1.015 0.000 builtins:0(isinstance)
(4%)
1000 1.007 0.001 3.088 0.003
django/template/base.py:353(Lexer.tokenize) (4%)
160000 0.892 0.000 1.387 0.000
django/template/base.py:367(Lexer.create_token) (3.5%)
131000 0.815 0.000 1.292 0.000
django/template/base.py:514(Parser.extend_nodelist) (3.2%)
59000 0.748 0.000 3.081 0.000
django/template/base.py:770(Variable.__init__) (2.9%)
140000 0.528 0.000 0.778 0.000
django/template/base.py:976(TextNode.get_nodes_by_type) (2.1%)
181343/181373 0.342 0.000 0.343 0.000 builtins:0(getattr)
(1.4%)
107338 0.222 0.000 0.222 0.000 __builtin__:0(<method
'startswith' of 'str' objects>) (0.9%)
14000 0.142 0.000 0.243 0.000
django/template/defaulttags.py:280(IfNode.__init__) (0.5%)
51000 0.101 0.000 0.101 0.000
django/template/defaulttags.py:288(IfNode.__iter__) (0.4%)
24000 0.091 0.000 0.134 0.000
django/template/base.py:1028(VariableNode.get_nodes_by_type) (0.4%)
10000 0.038 0.000 0.056 0.000
django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
... other get_nodes_by_type are all the same miniscule parts ...
}}}
and the timeit output:
{{{
In [3]: %timeit -r5 -n5000 Template(changelist)
1.82 ms ± 20 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each)
In [4]: %timeit -r5 -n1 runtemplate(n=5000)
9.38 s ± 186 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
}}}
== Proposed changes
And now I'm going to describe the various bits, hopefully maybe possibly
convince you with stats, and ideally avoid adding in much in the way of
extra complexity.
From here on out, the percentages listed are the difference from the
''previous'' execution, rather than the baseline.
=== Optimising Lexer.create_token
Currently the lexer has many repeated calls to
`token_string.startswith(...)` to determine what kind of token to make.
Hoisting the first 2 chars as `token_start = token_string[0:2]` and then
doing `token_start == ...` is notionally faster because it's the hot loop
for lexing:
{{{
ncalls tottime percall cumtime percall filename:lineno(function)
160000 0.892 0.000 1.380 0.000
django/template/base.py:367(Lexer.create_token) (4.7% -> 3.2%)
165338 0.331 0.000 0.331 0.000 __builtin__:0(<method
'startswith' of 'str' objects> (2.8% -> 1.2%)
}}}
Obviously indexing into a string isn't free, it's just not trackable as a
method call, but `startswith` has the extra ergonomics for passing in a
`tuple` of values to check, which we don't need. The lexer already makes
nods to micro-optimisation being desirable, with the comment: `We could do
len(BLOCK_TAG_START) to be more "correct", but we've hard-coded the 2s
here for performance. And it's not like the TAG_START values are going to
change anytime, anyway.` way back in
234b7e372000381657250fe133cbcfa90ffbb2a9 so I think this is a safe
proposal for the same reasons.
=== Optimising Parser.parse
The method currently makes mention of `Use the raw values here for
TokenType.* for a tiny performance boost.` We can micro-optimise it more,
because for each of the 3 if branches the opcodes are:
{{{
34 LOAD_FAST 3 (token)
36 LOAD_ATTR 3 (token_type)
38 LOAD_ATTR 4 (value)
40 LOAD_CONST 2 (0 or 1 or 2)
42 COMPARE_OP 2 (==)
}}}
By setting `token_type = token.token_type.value` and then re-using that
variable, it bcomes
{{{
34 LOAD_FAST 3 (token)
36 LOAD_ATTR 3 (token_type)
38 LOAD_ATTR 4 (value)
40 STORE_FAST 4 (token_type)
}}}
followed by, for each comparison:
{{{
42 LOAD_FAST 4 (token_type)
44 LOAD_CONST 2 (0 or 1 or 2)
46 COMPARE_OP 2 (==)
}}}
which makes a small difference, at least when instrumented, because again
it's the hot loop:
{{{
ncalls tottime percall cumtime percall filename:lineno(function)
1000/31000 1.894 0.002 22.146 0.001
django/template/base.py:446(Parser.parse) (8% -> 7.2%)
}}}
=== Removing isinstance from Parser.extend_nodelist
Currently the method includes the line `if isinstance(nodelist, NodeList)
and not isinstance(node, TextNode):` but the only places the method is
used, it's re-using `nodelist = NodeList()` and I'm fairly certain that a
3rd party using the method and returning something which isn't a
`NodeList` would error further in due to missing extension methods. So
it's ''always'' true.
{{{
django3.pstat
ncalls tottime percall cumtime percall filename:lineno(function)
556159 0.997 0.000 0.998 0.000 builtins:0(isinstance)
(4.7% -> 3.9%)
131000 0.793 0.000 1.263 0.000
django/template/base.py:514(Parser.extend_nodelist) (4.0% -> 3.1%)
}}}
=== Opting nodes out of having a nodelist
Slightly less sure of this one, it's sound in terms of passing the test
suite, and makes sense from what I know about Nodes/NodeLists and their
nature and how/when they're shared...
All nodes currently inherit `child_nodelists = ('nodelist',)` if none is
given, but AFAICT a bunch of nodes can't & don't actually have that
attribute. This default definition causes `get_nodes_by_type` to run and
execute the following:
{{{
nodelist = getattr(self, attr, None)
if nodelist: ...
}}}
By declaring `child_nodelists = ()` for those where it's not possible
those lines can be avoided. Those `Node` subclasses I've noted in the
template and adapted then behave like this:
{{{
django4.pstat
ncalls tottime percall cumtime percall
filename:lineno(function)
2000/56000 1.139 0.001 4.148 0.000
django/template/base.py:964(NodeList.get_nodes_by_type) (4.5%, different
implementation)
12000/70000 0.633 0.000 3.870 0.000
django/template/base.py:938(BlockNode.get_nodes_by_type) (2.5%,
unaffected/untouched)
181343/181373 0.391 0.000 0.837 0.000 builtins:0(getattr) (2.8%
-> 1.5%)
140000 0.519 0.000 0.770 0.000
django/template/base.py:974(TextNode.get_nodes_by_type) (3.0% -> 2.1%)
24000 0.089 0.000 0.132 0.000
django/template/base.py:1026(VariableNode.get_nodes_by_type) (0.5% ->
0.4%)
10000 0.038 0.000 0.056 0.000
django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
... others are miniscule or same ...
}}}
Assuming my understanding is correct (it may not be!), there are almost
certainly other nodes which can omit the implicit child nodelist, these
are just the ones I have accounted for, but they're also probably the
lion's share of the cost anyway (with TextNode being highest)
=== Optimising Variable.__init__
Like with the micro-optimisations to `Lexer.create_token` by hoisting the
first and last characters of the given string to variables, and avoiding
startswith, we can see some small improvements:
after variable node init:
{{{
django5.pstat
ncalls tottime percall cumtime percall filename:lineno(function)
59000 0.773 0.000 3.013 0.000
django/template/base.py:770(Variable.__init__) (3.6% -> 2.9%)
107338 0.228 0.000 0.228 0.000 __builtin__:0(<method
'startswith' of 'str' objects>) (1.3% -> 0.9%)
}}}
== Summary of profiled changes
Remember these timings are intentionally inflated by rendering the same
`str` into a `Template` 1000 times...
First using Yappi, which is what I was using throughout, precisely because
it's ''slower'' and more accurate WRT asyncio etc.
=== Yappi percentage changes
Based solely on the work done, rather than secondary affects:
{{{
Parser.parse: 8% -> 7.9%
Lexer.create_token: 4.7% -> 3.5%
isinstance: 4.3% -> 4%
Parser.extend_nodelist: 3.7% -> 3.2%
Variable.__init__: 3% -> 2.9%
str.startswith: 2.8% -> 0.9%
getattr: 2.6% -> 1.4%
Node.get_nodes_by_type: 6.1% -> 5.4%
}}}
=== Yappi timing changes:
More important than the swapping around of percentages, based on the own
time/tottime:
{{{
Parser.parse: 2.41s -> 2s
Lexer.create_token: 1.41s -> 0.9s
isinstance: 1.29s -> 1s
Parser.extend_nodelist: 1.1s -> 0.81s
Variable.__init__: 0.9s -> 0.75s
str.startswith: 0.85s -> 0.22s
getattr: 0.77s -> 0.34s
get_nodes_by_type: 1.80s -> 1.35s
}}}
=== cProfile percentage changes
{{{
Parser.parse: 9% -> 8.5%
Lexer.create_token: 4.1% -> 3.8%
isinstance: 1.9% -> 1.8%
Parser.extend_nodelist: 2.8% -> 2.5%
Variable.__init__: 3% -> 2.9%
str.startswith: 2.1% -> 0.8%
getattr: 1.3% -> 0.7%
Node.get_nodes_by_type: 3.2% -> 2.8%
}}}
=== cProfile timing changes:
{{{
Total time: 4.148s -> 3.833s
Parser.parse: 370ms -> 320ms
Lexer.create_token: 170ms -> 140ms
isinstance: 77ms -> 69ms
Parser.extend_nodelist: 118ms -> 97ms
Variable.__init__: 240ms -> 230ms
str.startswith: 86ms -> 29ms
getattr: 52ms -> 27ms
get_nodes_by_type: 131ms -> 107ms
}}}
Numbers should all be correct, though I'll admit I've re-written and tried
to edit this for clarity a couple of times and it got long enough that I
could've pasted a wrong number somewhere. I've also rebased from `main`
once previously so ... yeah. Scientific method may out of the window.
--
Ticket URL: <https://code.djangoproject.com/ticket/32919>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* owner: nobody => Keryn Knight
* needs_better_patch: 0 => 1
* has_patch: 0 => 1
* status: new => assigned
Comment:
PR for any pending discussion, if this is deemed appropriate, is:
https://github.com/django/django/pull/14627
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:1>
* cc: Nick Pope (added)
* stage: Unreviewed => Accepted
Comment:
Hey Keryn. Thanks for this.
Really nice analysis, and the PR looks reasonable in terms of
comprehensibility/maintainability, so a speed-up to the DTL is +1.
I'll cc Nick, as it strikes me as the kind of thing he likes :)
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:2>
* needs_better_patch: 1 => 0
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:3>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"edf184dd067e676962e8e1fc1a50397b810461d6" edf184d]:
{{{
#!CommitTicketReference repository=""
revision="edf184dd067e676962e8e1fc1a50397b810461d6"
Fixed #32919 -- Optimized lexing & parsing of templates.
This optimizes:
- Lexer.create_token() by avoiding startswith() calls,
- Parser.parse() by re-using the token type enum's value,
- Parser.extend_nodelist() by removing unnecessary isinstance() check,
- some Node subclasses by removing the implicit "nodelist" from
"child_nodelists",
- Variable.__init__() by avoiding startswith() calls.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:4>
Comment (by Chris Jerdonek):
I noticed that some additional, related improvements can be made to
`Lexer.create_token()`. For example, `in_tag` is currently checked twice,
when it can be checked once. Also, `token_start` only needs to be defined
when `in_tag` is true. (Both seen here:
https://github.com/django/django/blob/edf184dd067e676962e8e1fc1a50397b810461d6/django/template/base.py#L373-L382
)
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:5>
Comment (by Keryn Knight):
Hey Chris, there's a PR https://github.com/django/django/pull/14686 for
ticket #23356 which looks like it's similarly playing around with `in_tag`
... maybe there's enough overlap with that (which regrettably now needs
rebasing because of me!) to deal with? I've only had a cursory glance at
the diff though.
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:6>
Comment (by Chris Jerdonek):
Thanks for the heads up, Keryn. That PR is a lot more expansive than what
I had in mind. I was considering changing only `create_token()`. I think
it could help if the smaller changes I had in mind were considered
separately. It would decouple the changes and make the other PR easier to
evaluate as it would be starting from a better state.
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:7>
Comment (by Chris Jerdonek):
I just added a PR that builds on Keryn's changes, along the lines of my
comments above:
https://github.com/django/django/pull/14739
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:8>
Comment (by Chris Jerdonek):
If you're curious, I just filed a ticket (#32986) regarding something I
noticed about `Lexer.create_token()` when working on the above PR. It
might lead to another small optimization or fix.
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:9>
Comment (by Mariusz Felisiak <felisiak.mariusz@…>):
In [changeset:"55cf9e93b5e5bdf19bedeb1d900ee8a83f8489fb" 55cf9e9]:
{{{
#!CommitTicketReference repository=""
revision="55cf9e93b5e5bdf19bedeb1d900ee8a83f8489fb"
Refs #32919 -- Simplified Lexer.create_token() by reorganizing blocks.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:11>
Comment (by Mariusz Felisiak <felisiak.mariusz@…>):
In [changeset:"7ff72b5909894504b2343419630ceb5a4a9cd421" 7ff72b59]:
{{{
#!CommitTicketReference repository=""
revision="7ff72b5909894504b2343419630ceb5a4a9cd421"
Refs #32919 -- Added assertion for token start in Lexer.create_token().
This adds an assertion in the code path where the method would otherwise
return None, which isn't allowed.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/32919#comment:10>