I have done some testing and it shows that of the 70s, around 50-60s
should be avoidable.
Some options:
- Somehow use absolute paths for the fixtures. Maybe by using
`path(__file__) + '/fixtures/' + fixture_name`. A subset of this is to
always use, say gz compressed files which removes the need to check for
combos.
- Limit fixture searching to current app (if requested): maybe by
something like './somefixture' -> search only from current app.
'somefixture' -> search everywhere.
- Cache the existing fixtures. That is, go only once through the
directories, check what files exists and cache that information, then
check against the cached information instead of trying to open each
possible combo in each possible directory.
- Do not use fixtures for Django's test data.
Of the above I like the last option as I find reading models created in
Python code much easier than definitions in fixtures (especially, what
relations does given instance have). Unfortunately converting the fixtures
takes time. In addition, I am sure some developers favour fixtures instead
of models created in Python code.
Marking as accepted as in "lets see what can be done here".
--
Ticket URL: <https://code.djangoproject.com/ticket/20485>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
Comment (by akaariai):
I created some proof-of-concept code for this, see:
https://github.com/akaariai/django/commit/448b5d76d30b5af4682702ec21c8fe85cc1bb7d3
(applies over isolated_apps patch).
Runtime of Django's test suite is reduced from ~210 to ~160. So, around 50
seconds gone.
The code isn't commit quality. It uses a syntax of `fixtures = [(__file__,
'fixturename.json')]` to indicate that fixtures from a directory relative
to current test class should be loaded. This causes a couple of problems,
but is enough to demonstrate the speed gain. Better ideas &
implementations welcome.
I am not sure if my initial analysis of all the suffix combinations
causing the problem is correct. While going through the tests it became
evident that there are multiple fixtures with the same name in different
test apps. If I am mistaken any time 'testdata.json' is defined as a
fixture for a test class, then all different testdata.json files will be
loaded. Another similar issue is that there are a couple of
'initial_data.json' files in the test suite, and so every flush will load
these files.
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:1>
Comment (by aaugustin):
Would it be faster to call `os.listdir` on the `fixtures` directory of
each application, and then do everything in Python instead of hitting the
filesystem repeatedly?
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:2>
Comment (by aaugustin):
I'm attaching a patch that caches the contents of fixture directories in
memory and checks this cache before attempting to hit the filesystem.
This patch makes the test suite 10% faster (540s vs. 600s) on my laptop
with a SSD.
I didn't include the current directory in the list of directories whose
content is cached, because it could contain a lot of files, and the cache
would then use lots of memory for no good reason. A regression ensues:
fixtures cannot be found the current directory. (Two tests fail because of
this.)
This could be fixed by adding an explicit search inside the current
directory if no fixture is found... but then the patch starts looking bad,
and I didn't go further.
----
I think I have a better idea to make fixture search decently fast.
Here's how I would load a given fixture label:
1. Take the basename of the label and build all the combinations of
format, compression, etc. These are the targets we're looking for.
2. Build a list of prefixes:
- If label is a relative path, [<app>/fixtures/<label> for each app] +
[<fixture_dir>/<label> for each fixture dir] + [<label>]
- If label is an absolute path, [<label>]
3. For each prefix, run glob.glob(<prefix>*) to obtain a list of
candidates, and split them in dirname + basename. The basenames are the
candidates.
4. Check the intersection of the set of targets and the set of candidates;
these are the fixtures that must be loaded.
5. Once this is done for all prefixes, cache the list of all fixtures
found for this label.
This minimizes filesystems hits by making the search for a single label
efficient, and by performing it only once.
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:3>
* status: new => assigned
* owner: nobody => aaugustin
* has_patch: 0 => 1
* stage: Accepted => Ready for checkin
Comment:
I created a pull request with the technique described in my previous
comment. The speedup is still around 10% as expected.
https://github.com/django/django/pull/1237
Feedback welcome! Yes, it's a major refactoring of loaddata, I know it's
quite hard to review...
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:4>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"51aa000378c00a442273f01142acdebc94dec68d"]:
{{{
#!CommitTicketReference repository=""
revision="51aa000378c00a442273f01142acdebc94dec68d"
Fixed #20485 -- Refactored loaddata for speed.
Thanks Anssi for reporting this performance bottleneck.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:5>
Comment (by Aymeric Augustin <aymeric.augustin@…>):
In [changeset:"6900cb79dcbc502e0b18476e53ad493b7069ffb8"]:
{{{
#!CommitTicketReference repository=""
revision="6900cb79dcbc502e0b18476e53ad493b7069ffb8"
Fixed small regression from 51aa000378.
A test failed if the path to the Django checkout contained a dot.
Refs #20485.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:6>