how best to add a destructor to a mapper class?

50 views
Skip to first unread message

Zak Stone

unread,
Jul 3, 2009, 12:01:32 AM7/3/09
to dumbo...@googlegroups.com
Hi everyone,

Is it possible for me to add a destructor method to a mapper class
that will reliably be called once no further input data will be fed to
a particular mapper instance? I could work with the __del__() method,
but the Python documentation makes that sound "precarious":

http://docs.python.org/reference/datamodel.html#object.__del__

My mapper in question needs to run a thread pool, and I would like to
launch the thread pool in the mapper's __init__ method, leave the
thread pool running throughout the sequence of calls to __call__, and
then safely shut down the thread pool when no further data will be
delivered to the mapper. Is there a simple way to make Dumbo do this?

If not, suppose I use the alternative interface for mappers and reducers:

http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/

In that case, will the __call__ method of the mapper class only be
called once, guaranteed, no matter how Hadoop is configured? Then I
could fold the __init__ and destructor methods into the beginning and
end of the __call__ method.

Any help would be much appreciated.

Thanks,
Zak

Klaas Bosteels

unread,
Jul 3, 2009, 4:09:04 AM7/3/09
to dumbo...@googlegroups.com
On Fri, Jul 3, 2009 at 6:01 AM, Zak Stone<zst...@gmail.com> wrote:
>
> Hi everyone,
>
> Is it possible for me to add a destructor method to a mapper class
> that will reliably be called once no further input data will be fed to
> a particular mapper instance? I could work with the __del__() method,
> but the Python documentation makes that sound "precarious":
>
> http://docs.python.org/reference/datamodel.html#object.__del__
>
> My mapper in question needs to run a thread pool, and I would like to
> launch the thread pool in the mapper's __init__ method, leave the
> thread pool running throughout the sequence of calls to __call__, and
> then safely shut down the thread pool when no further data will be
> delivered to the mapper. Is there a simple way to make Dumbo do this?

You could add a suitable "close" method. For mapper and reducer classes,
Dumbo will call the "configure" and "close" methods automatically (if they
exists) before and after doing the actual work, respectively.


> If not, suppose I use the alternative interface for mappers and reducers:
>
> http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/
>
> In that case, will the __call__ method of the mapper class only be
> called once, guaranteed, no matter how Hadoop is configured? Then I
> could fold the __init__ and destructor methods into the beginning and
> end of the __call__ method.

That should also work, yeah. The "__call__" method should indeed only be
called once (per map task) if you use the alternative interface.


-Klaas

Zak Stone

unread,
Jul 3, 2009, 11:31:08 AM7/3/09
to dumbo...@googlegroups.com
> You could add a suitable "close" method. For mapper and reducer classes,
> Dumbo will call the "configure" and "close" methods automatically (if they
> exists) before and after doing the actual work, respectively.

Thanks, Klaas. The "close" method sounds like just what I need. What
is the difference between __init__ and "configure", though? Is
"configure" called before each __call__, or is it called only once,
right after __init__?

Zak

Klaas Bosteels

unread,
Jul 3, 2009, 12:21:11 PM7/3/09
to dumbo...@googlegroups.com

It's called only once. In practice there usually is no difference
between using "__init__" or "configure" (although the former could be
considered preferable since any Python programmer should immediately
be able to understand what it does). For some very specific cases,
however, "configure" can be the only option. You could have a look at
the code for "MultiMapper" if you want to see an example a such a
specific case:

http://github.com/klbostee/dumbo/blob/026dd1fd7e8baf6d68843d09040b85a52f405155/dumbo/lib.py

Essentially, the difference is that "__init__" coincides with object
creation, whereas the object already exists when "configure" is
called. The following examples might help to understand this subtle
difference:

$ cat hello1.py
import sys, dumbo

class Mapper:
def __init__(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value

if __name__ == "__main__":
dumbo.run(Mapper) # class

$ cat hello2.py
import sys, dumbo

class Mapper:
def __init__(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value

if __name__ == "__main__":
dumbo.run(Mapper()) # instance

$ cat hello3.py
import sys, dumbo

class Mapper:
def configure(self):
print >>sys.stderr, "Hello there!"
def __call__(self, key, value):
yield key, value

if __name__ == "__main__":
dumbo.run(Mapper()) # instance

$ dumbo start hello1.py -hadoop [...]
EXEC: [...]
[..]

$ dumbo start hello2.py -hadoop [...]
Hello there!
EXEC: [...]
[..]

$ dumbo start hello3.py -hadoop [...]
EXEC: [...]
[..]

So in case of "hello2.py" the "Hello there!" (also) gets printed on
the machine from which the program is started, whereas you'll only
find it in the Hadoop logs in case of "hello1.py" and "hello2.py".

I realize this might all be kind of confusing, but as I said, the
difference doesn't really matter most of the time...

-Klaas

Zak Stone

unread,
Jul 3, 2009, 3:15:43 PM7/3/09
to dumbo...@googlegroups.com
Perfect -- thanks again.

Zak

Zak Stone

unread,
Jul 4, 2009, 5:52:00 PM7/4/09
to dumbo...@googlegroups.com
> You could add a suitable "close" method. For mapper and reducer classes,
> Dumbo will call the "configure" and "close" methods automatically (if they
> exists) before and after doing the actual work, respectively.

Unfortunately, it appears that the close method is actually called
_before_ the work begins at present. Consider this example:

mapclose.py
-------------------------------------------------------------------
import sys

class TestMap:
def __init__(self):
sys.stderr.write("Executing __init__\n")

def __call__(self, key, value):
sys.stderr.write("Executing __call__\n")
yield key, value

def close(self):
sys.stderr.write("Executing close method\n")

if __name__ == "__main__":
import dumbo
dumbo.run(TestMap)
-------------------------------------------------------------------

sample_data.txt
-------------------------------------------------------------------
key value
-------------------------------------------------------------------

The command to run:
dumbo start mapclose.py -input sample_data.txt -output out

When I run the command above, I get the following ouput, which
indicates that the close method is called before the __call__ method:

EXEC: PYTHONPATH="/usr/local/lib/python2.6/dist-packages/dumbo-0.21.22-py2.6.egg:$PYTHONPATH"
python -m dumbo.cmd encodepipe -file sample_data.txt |
PYTHONPATH="/usr/local/lib/python2.6/dist-packages/dumbo-0.21.22-py2.6.egg:$PYTHONPATH"
python -m mapclose map 0 262144000 > 'out'
Executing __init__
Executing close method
Executing __call__

Here is my guess as to why this happens. From what I can tell, the
mapper output is collected in core.py in a generator called 'outputs',
but items aren't actually requested from outputs until after the close
method (called 'mapclose' internally) is called. Because the generator
is lazy, the cleanup happens before the actual work!

If this is the correct explanation, could the mapclose and redclose
calls be moved somehow in core.py so as to run after all of the output
has been generated?

Thanks,
Zak

JinYeong Bak

unread,
Oct 9, 2012, 2:28:30 AM10/9/12
to dumbo...@googlegroups.com
Hi

I also find the way to execute some procedure after reducer's job.

I search it and find this thread.

Zak said that his example code is not working correct.

But that message is written in 2009.

Now I am in 2011, and dumbo is upgraded.

So I test it with master branch one.

https://github.com/klbostee/dumbo

https://github.com/klbostee/dumbo/commit/c57cad01fa68f894963064a5850c7b7fbae40874

And I test it my CentOS 5 machine.


[hadoop_usr@ test_dumbo]$ dumbo start mapclose.py -input sample_data.txt -output out -python /usr/bin/python26
EXEC: PYTHONPATH="/usr/lib/python2.6/site-packages/dumbo-0.21.36-py2.6.egg:$PYTHONPATH" /usr/bin/python26 -m dumbo.cmd encodepipe -file sample_data.txt | PYTHONPATH="/usr/lib/python2.6/site-packages/dumbo-0.21.36-py2.6.egg:$PYTHONPATH" dumbo_mrbase_class='dumbo.backends.common.MapRedBase' dumbo_jk_class='dumbo.backends.common.JoinKey' dumbo_runinfo_class='dumbo.backends.common.RunInfo' /usr/bin/python26 -m mapclose map 0 262144000  > 'out'
Executing __init__
Executing __call__
Executing close method

As you can see, it works correct.

I think this problem is fixed.


Thanks to create and suggest this thread.
Reply all
Reply to author
Forward
0 new messages