Processing files with disco

39 views
Skip to first unread message

tomah...@gmail.com

unread,
Jun 24, 2015, 10:02:06 AM6/24/15
to disc...@googlegroups.com
Hi,

I'm currently in the initial phase of evaluating disco to perform analysis of large volumes of binary data and I'm hoping someone can provide some advice.

We have a lot of small files ~20KB each and need to run several disco jobs over the data, what would be the best way to format this for ddfs?

I currently dump all of the files into a "big file" containing key value pairs of id and base64 data separated by a tab:

<id>    <base64>
<id>    <base64>
<id>    <base64>
<id>    <base64>

This is single file is then written to ddfs as chunked data in 64MB chunks.  An initial disco job decodes the base64 and writes the data to tmp on the node, yielding the id and filename kv pair, subsequent jobs attempt to analyse the file.

How can I make sure that a specific disco node always processes the files that is has written to tmp and not files that were written to tmp on another node?  Would using a pipeline with node grouping do this?

Should I instead write all of the small individual files to ddfs with a single tag and work on them that way?

Thanks

Tom



 


 




Erik Dubbelboer

unread,
Jun 28, 2015, 7:17:08 AM6/28/15
to disc...@googlegroups.com
Why exactly do you need to decodes the base64 and write it to a file? Why not decode it and then process it immediately? For example you can do the decode in the map_reader stage and then just pass the data to map to process.
Reply all
Reply to author
Forward
0 new messages