Improving django test startup performance.

95 views
Skip to first unread message

Andrew Farrell

unread,
Mar 5, 2014, 3:52:44 PM3/5/14
to django...@googlegroups.com
Привет Djanonaughts!

tl;dr: django with buildout spends a lot of time looking for files that aren't there when trying to do imports. How can I reduce this?



I've long been frustrated with the time it takes my django installation to run tests. Now, I'm aware that some ways to think about to how to write fast and efficient tests in django include:
- putting things in setUpClass rather than setUp
- treating the database as hot lava (at least, when you don't have ManyToMany relations)
- using sqlite3 in-memory
- using mock to emulate instances
- using nose to only run the tests you need
But I want to talk about another component of test performance: Disk I/O syscalls

My company (the startup accelerator MassChallenge) uses buildout to manage dependencies and the several apps modules within our appication. So I run tests from the root of the buildout by calling `bin/django test`, a script that looks like the following: (note the datetime printing)
```
#!/home/afarrell/projects/masschallenge/venv/bin/python
from datetime import datetime
print datetime.now().isoformat()
print "before test"

import sys
sys.path[0:0] = [
  '/home/afarrell/projects/masschallenge/mc2013/eggs/MySQL_python-1.2.3-py2.7-linux-i686.egg',
  ... 38 other paths to various eggs...
  '/home/afarrell/projects/masschallenge/venv/lib/python2.7/site-packages',
  '/home/afarrell/projects/masschallenge/mc2013', #the root of our app
  ]

import djangorecipe.manage

if __name__ == '__main__':
    sys.exit(djangorecipe.manage.main('mcproject.develop_settings'))
```

I am in the habit of writing a single test as I am writing a method. To do this, I use nose's attrib tag in a test file that looks like this:
```
from nose.plugins.attrib import attr
import nose.tools as nt
from django.test import TestCase
from datetime import datetime

class TestExample(Testcase):
    @attr('now')
    def test_trivial_case(self):
        print datetime.now().isoformat()
        print "test has begun"
        nt.assert_equal(1 + 1, 2)
```

I then run tests with `bin/django test mentor_match -a'now' -s` and this prints out the following:
```
2014-03-05T13:46:02.774142
before test
nosetests --verbosity 1 mentor_match -anow -s
To reuse old database "test_02_06_18_29" for speed, set env var REUSE_DB=1.
Creating test database for alias 'default'...
2014-03-05T13:46:28.612895
test has begun
```
As you can see, there are about 26 seconds from the test command is issued and when the first test actually runs. This delay seems small but it is noticeable and kinda annoying when that is the only test. You could load, present, and fire a musket in that time. There are two possible causes that I can think of:
1) The test database takes a long time to set up
2) The test command takes a long time to import all required modules

However the if I `export REUSE_DB=1` and then run `bin/django test mentor_match -a'now' -s`, I see
2014-03-05T15:22:51.743338 
before test 
nosetests --verbosity 1 mc_allocate -s -anow 
2014-03-05T15:23:15.813821
test has begun
which is still a difference of 24 seconds. Working on a linode, I often get notifications about the high disk i/o rate, so it makes sense that the cause is the latter. So, I turn to the sysadmin's microsope: strace.

With REUSE_DB not set, I run `strace -f bin/django test mentor_match -s -a'now' 2> tracelog`. I can now see all of the system calls made by the command, 67626. `cat tracelog | grep "open(" | wc -l` shows that 37635 of those are open commands.

total:   67626  
open:    37635  56.7% 
stat64:  15401  22.8%
read:     5362   7.9%
fstat64:  2619   3.9%
write:    2316   3.4%
poll:     2268   3.4%
close:    1754   2.6%
mmap2:     949   1.4%
munmap:    847   1.3%

Since there are far more open calls than there are read calls, and
there are 45847 lines containing "ENOENT (No such file or directory)" (86.4% of the open or stat64 calls), I think that django spends a lot of time looking for files in the wrong location.


I plan to write a blog post about this, but before I dig in, does anyone have any comments on the following approaches I'm considering:
- reordering the paths assigned to sys.path
- removing items from sys.path when I am running tests
- caching the locations where modules were previously found and looking there first

Has anyone looked at doing anything like this?
-- 
Andrew Farrell
Software Engineer | MassChallenge, Inc.
ONE Marina Park Drive | Boston, MA 02210

Free m

entorship, resources, coworking space. No equity taken.


Applications now open: 2014 MassChallenge accelerator
Apply or refer great startups today!

Russell Keith-Magee

unread,
Mar 5, 2014, 7:47:51 PM3/5/14
to Django Users
Not to my knowledge, but based on your data, it certainly looks like it would be worthwhile doing so.

I have to say that the data you've presented here surprises me. I can't think of any obvious reasons why Django would be spending that much time trying to open non-existent files - at least, not to the extent that you would observe 24 seconds of startup time - unless your disk performance was *really* bad. However, I know from experience isn't generally a problem on Linode.

The only "speculative opening" behaviour I can think of would be the attempted imports that are performed as part of identifying INSTALLED_APPS and management commands - that is, looking for each models.py in your INSTALLED_APPS, and looking for the management/commands folder in each installed app. However, this should be a fairly constrained activity; startup will only look in each app once, for a single folder that (in most cases) won't exist. We're talking O(100) operations in the worst case for a Django project - not the sort of thing I would have expected to cause 24 seconds of slowdown.

So - I'm very much interested in what you discover. Good hunting! (I understand your weapon of choice is a musket, so you'll need the luck :-)

Also - you might want to bump this thread over to django-developers; you're more likely to find a receptive audience for this sort of analysis.

Yours,
Russ Magee %-)


Tom Evans

unread,
Mar 5, 2014, 8:45:45 PM3/5/14
to django...@googlegroups.com
On Wed, Mar 5, 2014 at 8:52 PM, Andrew Farrell <armors...@gmail.com> wrote:
> Привет Djanonaughts!
>
> tl;dr: django with buildout spends a lot of time looking for files that
> aren't there when trying to do imports. How can I reduce this?

(I did read :)

The first thing that springs to mind is test discovery. I may be way
off track, an easy way to test (other than read all the code/know
already) would be a simple django management command that spits out
the time. Comparing the time difference for that operation compared to
when running a test should size the costs of starting django
environment versus starting django environment and finding a specific
test.

Cheers

Tom

Andrew Farrell

unread,
Mar 5, 2014, 10:37:03 PM3/5/14
to django...@googlegroups.com
Hmm. That is a good point. If there is little difference, then I suppose I'll blame buildout.

$ bin/django print_time
2014-03-05T22:07:33.051634
before test
2014-03-05T22:07:40.202247
command has been run

That is A difference of 7 seconds, significantly better.

$ strace bin/django print_time 2> cmdtrace
cat cmdtrace | wc -l
54361
$ cat cmdtrace | grep 'open(' | wc -l
34233
$ cat cmdtrace | grep 'stat64(' | wc -l
13625
$ cat cmdtrace | grep 'read(' | wc -l
2706
$ cat cmdtrace | grep 'fstat64(' | wc -l
2377
$ cat cmdtrace | grep 'close(' | wc -l
1563
$ cat cmdtrace | grep 'ENOENT (No such file or directory)' | wc -l
41561

There are still almost as many files being sought for and not opened however. I suppose that means that failing to find a file on disc is actually not that costly.
The difference between running a command and running a test also suggests that either:
1) Linode disc performance varies throughout the day and is better outside buisiness hours? Nope.
2014-03-05T22:20:33.604757
before test
nosetests --verbosity 1 match_mentors -anow -s
To reuse old database "test_02_06_18_29" for speed, set env var REUSE_DB=1.
Creating test database for alias 'default'...
2014-03-05T22:21:03.106680
test has begun

2) A significant portion of the time is spent in the other imports that I'm doing in that test file. I should have broken it out and ran that test. I'll do that now.
2014-03-05T22:23:05.650883
before test
nosetests --verbosity 1 match_mentors -anow -s
To reuse old database "test_02_06_18_29" for speed, set env var REUSE_DB=1.
Creating test database for alias 'default'...
2014-03-05T22:23:42.127545
test has begun
That is still 37 seconds, so it is almost certainly something in the test discovery process.
I suppose this means that digging up and mapping out the piping that runs through nose, django-nose, the django test-discovery code, and unittest.py is in order. I suppose that will build character.
 
-- 
Andrew Farrell
Software Engineer | MassChallenge, Inc.
ONE Marina Park Drive | Boston, MA 02210

Free m

entorship, resources, coworking space. No equity taken.

 

Applications now open: 2014 MassChallenge accelerator
Apply or refer great startups today!


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAFHbX1%2B6qoCp%3DprVoCZ5X-CGX%2BxBZ%3DZembJZ937Qjydd9JA92g%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages