On Thursday, July 9, 2015 at 3:05:25 PM UTC+3, Evgeny Shulman wrote:
> Hi
>
> We are facing some limitation by the fact that LocalTarget/S3Target were designed as File representation. There are a lot of cases, when we need task.output() to be the folder.
> Some examples:
> Hadoop local runs ( using pig or any other wrapper) - outputs are <TASK_OUTPUT/part-1.gz,part-2.gz
> Downloading folder from other sources.Local tools that writes multiple files
> Currently atomic_file/AtomicS3File objects works great if user want to write one file. However, if we could use .tmp_path of the return object and use it as it is, that would solve the problem. Currently that can not be done, as at the moment AtomicLocalFile is created,the file on disk is created as well. Thus we can not treat it as Directory.
> Moreover, to make folder usage compatible with current implementation, we would like to change read behaviour of *Target objects as well. if user has marked target as dir, it could be read by reading all files inside it.
>
>
>
> Implementation details:
> add is_dir flag to *Target objects,propagate it to atomic_file on write.atomic file with is_dir , if set, does not create IOFile, but fakes it with NonWritableFile object (raise exception on write). This way we can get object that has .tmp_path. That is the target directory.
We are taking this implementation one step farther.
Now when we have atomic_file with is_dir support, we would like to implement DirectoryFormat that will support read/write from folders/files
class DirectoryFormat(Format):
input = 'bytes'
output = 'dir'
the pipe reader will wrap the input_pipe if it file with FileWrapper, otherwise it will `cat <input_path>/<prefix>*<suffix>
the pipe writer will create simple file if max_part_size is 0, otherwise it will split the input stream into files using `split ..`
this format is really useful when you work with hadoop outputs locally ( targetdir.gz/part1.gz,part2.gz) or when you are generate some inputs for hadoop process in your local running task ( so you want to have multiple files output)
we have POC that works, the question is if anybody interesting in such implementation. ( I'll generate pull request)