Keeping temporary results on disk when running multiple iterations

Jørn Schou-Rode

unread,

Dec 8, 2009, 12:13:59 PM12/8/09

to dumbo-user

When running a job with multiple iterations, can I somehow instruct
Dumbo/Hadoop to keep (some of) the intermediate results on disk, in
addition to the final result?

I am quite new with Dumbo (and Hadoop for that matter), so please
don't hurt me if there is some really obvious answer that I should
have known about :-)

Thanks in advance.

/Jørn

Erik Forsberg

unread,

Dec 8, 2009, 1:19:47 PM12/8/09

to dumbo...@googlegroups.com

On Tue, 8 Dec 2009 09:13:59 -0800 (PST)
Jørn Schou-Rode <j...@malamute.dk> wrote:

> When running a job with multiple iterations, can I somehow instruct
> Dumbo/Hadoop to keep (some of) the intermediate results on disk, in
> addition to the final result?

I think that if you add '-preoutputs yes' to the command line, the
intermediate inputs will be kept.

Regards,
\EF

Jørn Schou-Rode

unread,

Dec 8, 2009, 2:34:25 PM12/8/09

to dumbo...@googlegroups.com

On Tue, 08 Dec 2009 19:19 +0100, "Erik Forsberg" <fors...@opera.com>
wrote:

> I think that if you add '-preoutputs yes' to the command line, the
> intermediate inputs will be kept.
>
> Regards,
> \EF
>

It works, thanks!

Looking at the output, I realize that it would be an extra bonus to have
control of how the intermediate results are serialized. Is there any
alternative to -preoutputs that allows me to specify some kind of
formatting function?

In my specific case, I only need to report the keys outputted by each
iteration, while the associated values are used internally for the next
iteration.

Thanks again!

/Jørn

Reply all

Reply to author

Forward