[Django] #20485: Find faster ways for test fixture loading

Django

unread,

May 23, 2013, 6:59:10 AM5/23/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading
------------------------------------------------+------------------------
Reporter: akaariai | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Testing framework | Version: master
Severity: Normal | Keywords:
Triage Stage: Accepted | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
------------------------------------------------+------------------------
Currently Django's test suite uses about 70s for fixture loading when
running full test suite on in-memory SQLite. Considering that after #20483
Django's test suite takes around 210s in total, this means 1/3 of time is
used for fixture loading. Most of the time in fixture loading is used for
finding the fixture file. Django needs to do effectively this for each
file, for each test:
{{{
for each app in INSTALLED_APPS:
for each suffix in combo(('json', 'yaml', 'xml') X ('bz2', 'gz',
'zip')):
try to open file() - if succeeds add to fixture files, else skip
}}}
Unsurprisingly this is expensive. In addition fixture loading after the
file has been found isn't that fast, it is actually somewhat faster to
create the models by SomeModel.objects.create() than by fixtures.

I have done some testing and it shows that of the 70s, around 50-60s
should be avoidable.

Some options:
- Somehow use absolute paths for the fixtures. Maybe by using
`path(__file__) + '/fixtures/' + fixture_name`. A subset of this is to
always use, say gz compressed files which removes the need to check for
combos.
- Limit fixture searching to current app (if requested): maybe by
something like './somefixture' -> search only from current app.
'somefixture' -> search everywhere.
- Cache the existing fixtures. That is, go only once through the
directories, check what files exists and cache that information, then
check against the cached information instead of trying to open each
possible combo in each possible directory.
- Do not use fixtures for Django's test data.

Of the above I like the last option as I find reading models created in
Python code much easier than definitions in fixtures (especially, what
relations does given instance have). Unfortunately converting the fixtures
takes time. In addition, I am sure some developers favour fixtures instead
of models created in Python code.

Marking as accepted as in "lets see what can be done here".

--
Ticket URL: <https://code.djangoproject.com/ticket/20485>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

May 23, 2013, 10:14:35 AM5/23/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

--------------------------------------+------------------------------------

Reporter: akaariai | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Testing framework | Version: master

Severity: Normal | Resolution:

Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

--------------------------------------+------------------------------------

Comment (by akaariai):

I created some proof-of-concept code for this, see:
https://github.com/akaariai/django/commit/448b5d76d30b5af4682702ec21c8fe85cc1bb7d3
(applies over isolated_apps patch).

Runtime of Django's test suite is reduced from ~210 to ~160. So, around 50
seconds gone.

The code isn't commit quality. It uses a syntax of `fixtures = [(__file__,
'fixturename.json')]` to indicate that fixtures from a directory relative
to current test class should be loaded. This causes a couple of problems,
but is enough to demonstrate the speed gain. Better ideas &
implementations welcome.

I am not sure if my initial analysis of all the suffix combinations
causing the problem is correct. While going through the tests it became
evident that there are multiple fixtures with the same name in different
test apps. If I am mistaken any time 'testdata.json' is defined as a
fixture for a test class, then all different testdata.json files will be
loaded. Another similar issue is that there are a couple of
'initial_data.json' files in the test suite, and so every flush will load
these files.

--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:1>

Django

unread,

May 23, 2013, 11:02:55 AM5/23/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

--------------------------------------+------------------------------------

Reporter: akaariai | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Testing framework | Version: master

Severity: Normal | Resolution:

Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

--------------------------------------+------------------------------------

Comment (by aaugustin):

Would it be faster to call `os.listdir` on the `fixtures` directory of
each application, and then do everything in Python instead of hitting the
filesystem repeatedly?

--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:2>

Django

unread,

Jun 1, 2013, 4:25:59 PM6/1/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

--------------------------------------+------------------------------------

Reporter: akaariai | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Testing framework | Version: master

Severity: Normal | Resolution:

Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

--------------------------------------+------------------------------------

Comment (by aaugustin):

I'm attaching a patch that caches the contents of fixture directories in
memory and checks this cache before attempting to hit the filesystem.

This patch makes the test suite 10% faster (540s vs. 600s) on my laptop
with a SSD.

I didn't include the current directory in the list of directories whose
content is cached, because it could contain a lot of files, and the cache
would then use lots of memory for no good reason. A regression ensues:
fixtures cannot be found the current directory. (Two tests fail because of
this.)

This could be fixed by adding an explicit search inside the current
directory if no fixture is found... but then the patch starts looking bad,
and I didn't go further.

----

I think I have a better idea to make fixture search decently fast.

Here's how I would load a given fixture label:

1. Take the basename of the label and build all the combinations of
format, compression, etc. These are the targets we're looking for.
2. Build a list of prefixes:
- If label is a relative path, [<app>/fixtures/<label> for each app] +
[<fixture_dir>/<label> for each fixture dir] + [<label>]
- If label is an absolute path, [<label>]
3. For each prefix, run glob.glob(<prefix>*) to obtain a list of
candidates, and split them in dirname + basename. The basenames are the
candidates.
4. Check the intersection of the set of targets and the set of candidates;
these are the fixtures that must be loaded.
5. Once this is done for all prefixes, cache the list of all fixtures
found for this label.

This minimizes filesystems hits by making the search for a single label
efficient, and by performing it only once.

--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:3>

Django

unread,

Jun 2, 2013, 4:43:25 AM6/2/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by aaugustin):

* status: new => assigned
* owner: nobody => aaugustin
* has_patch: 0 => 1
* stage: Accepted => Ready for checkin

Comment:

I created a pull request with the technique described in my previous
comment. The speedup is still around 10% as expected.

https://github.com/django/django/pull/1237

Feedback welcome! Yes, it's a major refactoring of loaddata, I know it's
quite hard to review...

--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:4>

Django

unread,

Jun 2, 2013, 3:02:52 PM6/2/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: aaugustin

Type: | Status: closed
Cleanup/optimization | Version: master
Component: Testing framework | Resolution: fixed

Severity: Normal | Triage Stage: Ready for
Keywords: | checkin

Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Aymeric Augustin <aymeric.augustin@…>):

* status: assigned => closed
* resolution: => fixed

Comment:

In [changeset:"51aa000378c00a442273f01142acdebc94dec68d"]:
{{{
#!CommitTicketReference repository=""
revision="51aa000378c00a442273f01142acdebc94dec68d"
Fixed #20485 -- Refactored loaddata for speed.

Thanks Anssi for reporting this performance bottleneck.
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/20485#comment:5>

Django

unread,

Jun 3, 2013, 5:49:00 AM6/3/13

to django-...@googlegroups.com

#20485: Find faster ways for test fixture loading

Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by Aymeric Augustin <aymeric.augustin@…>):

In [changeset:"6900cb79dcbc502e0b18476e53ad493b7069ffb8"]:
{{{
#!CommitTicketReference repository=""
revision="6900cb79dcbc502e0b18476e53ad493b7069ffb8"
Fixed small regression from 51aa000378.