[Numpy-discussion] Followup on Python+MPI import performance

7 views

Skip to first unread message

Asher Langton

unread,

Mar 5, 2012, 1:17:59 PM3/5/12

to numpy-di...@scipy.org

This is a followup to my post from January
(http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html)
and the panel discussion at PyData this weekend. As a few people have
suggested, a better approach than the MPI-broadcasted lookups is to
cache the locations of all the modules found in sys.path.

I previously claimed the the PEP 302 finders/loaders wouldn't work
here because the finder is selected by a module's path and filename,
at which point the damage is already done. At the PyData panel, Guido
countered that PEP 302 does indeed provide the necessary machinery for
implementing the 'right' solution. The trick is to use sys.meta_path.
(Thanks to Travis for pointing me in the direction of sys.meta_path,
and Dag for helping me work through the details.)

Here's an example demonstrating the use of sys.meta_path:

import os
# Simple finder/loader that pretends to load module 'foo'
class foo(object):
def find_module(self,fullname,path=None):
if fullname == "bar":
return self
return None

def load_module(self,fullname):
if fullname == "bar":
return os
raise ImportError("This shouldn't happen!")

if __name__ == "__main__":
import sys
sys.meta_path.append(foo())
import bar # actually the os module
print bar.getcwd()

To eliminate the import bottleneck, the finder/loader just needs to
traverse sys.path, make a dict mapping modules to their location in
the filesystem, and 'claim responsibility' for those modules in
find_module(). Building (and maintaining, when sys.path changes) this
dict, even if each process does it independently, shouldn't be much
worse than the traversal required by a single import statement. We
could even subclass the finder/loader so that the dict construction is
done by only one process and the result broadcasted over MPI, though
that probably isn't necessary.

I'll put an initial implementation of this importer on github sometime
this week, and I'll follow up this post with some performance numbers
when I have them.

-Asher
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Asher Langton

unread,

Mar 15, 2012, 1:38:14 PM3/15/12

to numpy-di...@scipy.org

On Mon, Mar 5, 2012 at 10:17 AM, Asher Langton <lan...@gmail.com> wrote:
> This is a followup to my post from January
> (http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html)
> and the panel discussion at PyData this weekend. As a few people have
> suggested, a better approach than the MPI-broadcasted lookups is to
> cache the locations of all the modules found in sys.path.

> [...]

> I'll put an initial implementation of this importer on github sometime
> this week, and I'll follow up this post with some performance numbers
> when I have them.

Here are some numbers for the PEP302-based cached importer on an IBM
BlueGene/P machine. Numbers are wallclock measurements by the time
utility in minutes:seconds, one run for each test (not an average),
with no attempt to take into account other activity on the system or
fileservers. (With that said, I ran a variety of other tests, and the
results have been consistent.) I still need to run some larger tests,
particularly in the 16k-64k range, where Python imports start to scale
very poorly on this machine.

The tests use the code currently at github.com/langton/MPI_Import with
a script that simply imports 100 small C-extension modules.

With 1k cores/MPI processes:
cached_import.finder: 14:19.98
- skip actual import [1]: 13:37.77
- with checks [2]: 27:09.60
- w/checks, no import: 26:23.63

cached_import.mpi4py_finder [3]: 2:32.51
- skip actual import: 1:42.55
- with checks: 2:32.38
- w/checks, no import: 1:42.94

MPI_Import [4]: 2:22.20

standard import : 15:43.63
- skip actual imports [5]: 0:56.59

With 4k cores/MPI processes:
cached_import.finder: 27:34.45
- skip actual import: 27:40.58
- with checks: 52:14.83
- w/checks, no import: 50:04.73

cached_import.mpi4py_finder: 4:03.02
- skip actual import: 3:12.75
- with checks: 4:13.65
- w/checks, no import: 3:18.46

MPI_Import: 4:02.76

standard import : 35:24.77
- skip actual imports: 1:56.36

Notes:
[1] Builds the cache, but omits the actual imports.
[2] Check whether modules in sys.path are readable while building the
cache. Because filesystem operations are expensive, these checks are
off by default.
[3] Only the rank 0 process builds the initial cache, which is then
broadcasted over MPI.
[4] The other import replacement.
[5] This is roughly the interpreter startup/initialization time.

Reply all

Reply to author

Forward

0 new messages