Hi,
I'm currently in the initial phase of evaluating disco to perform analysis of large volumes of binary data and I'm hoping someone can provide some advice.
We have a lot of small files ~20KB each and need to run several disco jobs over the data, what would be the best way to format this for ddfs?
I currently dump all of the files into a "big file" containing key value pairs of id and base64 data separated by a tab:
<id> <base64>
<id> <base64>
This is single file is then written to ddfs as chunked data in 64MB chunks. An initial disco job decodes the base64 and writes the data to tmp on the node, yielding the id and filename kv pair, subsequent jobs attempt to analyse the file.
How can I make sure that a specific disco node always processes the files that is has written to tmp and not files that were written to tmp on another node? Would using a pipeline with node grouping do this?
Should I instead write all of the small individual files to ddfs with a single tag and work on them that way?
Thanks
Tom