Is it possible for me to add a destructor method to a mapper class
that will reliably be called once no further input data will be fed to
a particular mapper instance? I could work with the __del__() method,
but the Python documentation makes that sound "precarious":
http://docs.python.org/reference/datamodel.html#object.__del__
My mapper in question needs to run a thread pool, and I would like to
launch the thread pool in the mapper's __init__ method, leave the
thread pool running throughout the sequence of calls to __call__, and
then safely shut down the thread pool when no further data will be
delivered to the mapper. Is there a simple way to make Dumbo do this?
If not, suppose I use the alternative interface for mappers and reducers:
http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/
In that case, will the __call__ method of the mapper class only be
called once, guaranteed, no matter how Hadoop is configured? Then I
could fold the __init__ and destructor methods into the beginning and
end of the __call__ method.
Any help would be much appreciated.
Thanks,
Zak
You could add a suitable "close" method. For mapper and reducer classes,
Dumbo will call the "configure" and "close" methods automatically (if they
exists) before and after doing the actual work, respectively.
> If not, suppose I use the alternative interface for mappers and reducers:
>
> http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/
>
> In that case, will the __call__ method of the mapper class only be
> called once, guaranteed, no matter how Hadoop is configured? Then I
> could fold the __init__ and destructor methods into the beginning and
> end of the __call__ method.
That should also work, yeah. The "__call__" method should indeed only be
called once (per map task) if you use the alternative interface.
-Klaas
Thanks, Klaas. The "close" method sounds like just what I need. What
is the difference between __init__ and "configure", though? Is
"configure" called before each __call__, or is it called only once,
right after __init__?
Zak
It's called only once. In practice there usually is no difference
between using "__init__" or "configure" (although the former could be
considered preferable since any Python programmer should immediately
be able to understand what it does). For some very specific cases,
however, "configure" can be the only option. You could have a look at
the code for "MultiMapper" if you want to see an example a such a
specific case:
http://github.com/klbostee/dumbo/blob/026dd1fd7e8baf6d68843d09040b85a52f405155/dumbo/lib.py
Essentially, the difference is that "__init__" coincides with object
creation, whereas the object already exists when "configure" is
called. The following examples might help to understand this subtle
difference:
$ cat hello1.py
import sys, dumbo
class Mapper:
def __init__(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value
if __name__ == "__main__":
dumbo.run(Mapper) # class
$ cat hello2.py
import sys, dumbo
class Mapper:
def __init__(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value
if __name__ == "__main__":
dumbo.run(Mapper()) # instance
$ cat hello3.py
import sys, dumbo
class Mapper:
def configure(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value
if __name__ == "__main__":
dumbo.run(Mapper()) # instance
$ dumbo start hello1.py -hadoop [...]
EXEC: [...]
[..]
$ dumbo start hello2.py -hadoop [...]
Hello there!
EXEC: [...]
[..]
$ dumbo start hello3.py -hadoop [...]
EXEC: [...]
[..]
So in case of "hello2.py" the "Hello there!" (also) gets printed on
the machine from which the program is started, whereas you'll only
find it in the Hadoop logs in case of "hello1.py" and "hello2.py".
I realize this might all be kind of confusing, but as I said, the
difference doesn't really matter most of the time...
-Klaas
Unfortunately, it appears that the close method is actually called
_before_ the work begins at present. Consider this example:
mapclose.py
-------------------------------------------------------------------
import sys
class TestMap:
def __init__(self):
sys.stderr.write("Executing __init__\n")
def __call__(self, key, value):
sys.stderr.write("Executing __call__\n")
yield key, value
def close(self):
sys.stderr.write("Executing close method\n")
if __name__ == "__main__":
import dumbo
dumbo.run(TestMap)
-------------------------------------------------------------------
sample_data.txt
-------------------------------------------------------------------
key value
-------------------------------------------------------------------
The command to run:
dumbo start mapclose.py -input sample_data.txt -output out
When I run the command above, I get the following ouput, which
indicates that the close method is called before the __call__ method:
EXEC: PYTHONPATH="/usr/local/lib/python2.6/dist-packages/dumbo-0.21.22-py2.6.egg:$PYTHONPATH"
python -m dumbo.cmd encodepipe -file sample_data.txt |
PYTHONPATH="/usr/local/lib/python2.6/dist-packages/dumbo-0.21.22-py2.6.egg:$PYTHONPATH"
python -m mapclose map 0 262144000 > 'out'
Executing __init__
Executing close method
Executing __call__
Here is my guess as to why this happens. From what I can tell, the
mapper output is collected in core.py in a generator called 'outputs',
but items aren't actually requested from outputs until after the close
method (called 'mapclose' internally) is called. Because the generator
is lazy, the cleanup happens before the actual work!
If this is the correct explanation, could the mapclose and redclose
calls be moved somehow in core.py so as to run after all of the output
has been generated?
Thanks,
Zak