Python libraries in map/reduce functions

18 views
Skip to first unread message

Wes Chow

unread,
Sep 12, 2008, 2:21:32 PM9/12/08
to disc...@googlegroups.com

What is the recommended way to import external Python libraries into my
map and reduce functions?

For example, my fun_map needs to use a function bar that I've defined in
my module foo. I would like to have something like:

import foo

def fun_map(e, params):
return [ foo.bar(k) for k in e.split() ]


However, as I understand it, disco only transmits fun_map to the remote
node, and so it does not have access to the foo namespace.


Wes

tuulos

unread,
Sep 12, 2008, 2:30:48 PM9/12/08
to Disco-development

On Sep 12, 11:21 am, Wes Chow <wes.c...@s7labs.com> wrote:
> What is the recommended way to import external Python libraries into my
> map and reduce functions?

Currently the only approach is:

def fun_map(e, params):
import foo
return [ foo.bar(k) for k in e.split() ]

It wouldn't be too difficult to add a new argument to disco.core.Job
that specifies a list of required modules to be imported.


Ville

Wes Chow

unread,
Sep 12, 2008, 2:45:15 PM9/12/08
to disc...@googlegroups.com

> On Sep 12, 11:21 am, Wes Chow <wes.c...@s7labs.com> wrote:
>> What is the recommended way to import external Python libraries into my
>> map and reduce functions?
>
> Currently the only approach is:
>
> def fun_map(e, params):
> import foo
> return [ foo.bar(k) for k in e.split() ]

I was afraid of that :)

> It wouldn't be too difficult to add a new argument to disco.core.Job
> that specifies a list of required modules to be imported.

I'll look into this if you haven't already.


Wes

Valentino Volonghi

unread,
Sep 12, 2008, 2:53:58 PM9/12/08
to disc...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Sep 12, 2008, at 11:30 AM, tuulos wrote:

> Currently the only approach is:
>
> def fun_map(e, params):
> import foo
> return [ foo.bar(k) for k in e.split() ]
>
> It wouldn't be too difficult to add a new argument to disco.core.Job
> that specifies a list of required modules to be imported.


Incidentally this would also remove the need for marshalling.

There's no need for marshalling if you require certain modules to be
installed, you could use your own module as just another dependency.

hadoop sends around jar files, disco should work by sending
around egg files, verifying them etc. etc.

- --
Valentino Volonghi aka Dialtone
Now running MacOS X 10.5
Home Page: http://www.twisted.it
http://www.adroll.com

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkjKuscACgkQ9Llz28widGXgeQCgtShaxIwY9D3ch5QPVntT2Wwk
XSoAoL03BK2F7hDE2+J7e34PGlUYXUME
=wo5f
-----END PGP SIGNATURE-----

Wes Chow

unread,
Sep 12, 2008, 4:58:11 PM9/12/08
to disc...@googlegroups.com

>> On Sep 12, 11:21 am, Wes Chow <wes.c...@s7labs.com> wrote:
>> It wouldn't be too difficult to add a new argument to disco.core.Job
>> that specifies a list of required modules to be imported.

I've attached a patch for rudimentary module loading support via a
"required_modules" argument to new_job. It takes a list of strings,
where each string is a module name to be imported, in that order, before
executing map, reduce, combine, etc...

It has no support for the "from <package> import <modules>" syntax. Like
I said, this is rudimentary...

In fact, I'm not even convinced that you gain anything at all by passing
in this argument rather than directly importing things inside the
map/reduce functions. It was primarily an exercise to learn a bit more
about Disco (and Python too, it turns out).


Wes

required_module.patch

tuulos

unread,
Sep 12, 2008, 5:11:04 PM9/12/08
to Disco-development

> Incidentally this would also remove the need for marshalling.
>
> There's no need for marshalling if you require certain modules to be
> installed, you could use your own module as just another dependency.
>
> hadoop sends around jar files, disco should work by sending
> around egg files, verifying them etc. etc.

Sure. That actually works already indirectly through the external
interface. It lets you specify a task as an executable together with
required libraries. We could make this approach more straightforward
for Python, maybe using eggs.

However, I think that this approach should be complementary to
marshalling. Disco should make creating simple jobs simple and complex
jobs possible. I doubt that you can express the example from the
http://discoproject.org front page with as few lines of code using
eggs as with marshalling (without making Disco overly complex). If you
can, I'd be happy to get rid of marshalling which is problematic to
use with mismatching Python versions.


Ville






















Ville Tuulos

unread,
Oct 5, 2008, 4:52:31 AM10/5/08
to Disco-development

> I've attached a patch for rudimentary module loading support via a
> "required_modules" argument to new_job. It takes a list of strings,
> where each string is a module name to be imported, in that order, before
> executing map, reduce, combine, etc...

Thanks Wes! This patch is now added to Disco.

Ville
Reply all
Reply to author
Forward
0 new messages