LocalTarget enhancement - support for directories

Evgeny Shulman

unread,

Jul 9, 2015, 8:05:25 AM7/9/15

to luigi...@googlegroups.com

Hi

We are facing some limitation by the fact that LocalTarget/S3Target were designed as File representation. There are a lot of cases, when we need task.output() to be the folder.

Some examples:

Hadoop local runs ( using pig or any other wrapper) - outputs are <TASK_OUTPUT/part-1.gz,part-2.gz
Downloading folder from other sources.
Local tools that writes multiple files

Currently atomic_file/AtomicS3File objects works great if user want to write one file. However, if we could use .tmp_path of the return object and use it as it is, that would solve the problem. Currently that can not be done, as at the moment AtomicLocalFile is created,the file on disk is created as well. Thus we can not treat it as Directory.

Moreover, to make folder usage compatible with current implementation, we would like to change read behaviour of *Target objects as well. if user has marked target as dir, it could be read by reading all files inside it.

Implementation details:

add is_dir flag to *Target objects,propagate it to atomic_file on write.
atomic file with is_dir , if set, does not create IOFile, but fakes it with NonWritableFile object (raise exception on write). This way we can get object that has .tmp_path. That is the target directory.
for open('r') mode return file object using fileinput.input (https://docs.python.org/2/library/fileinput.html) - reads all files inside the directory.

We would like to create a pull request for the following changes.

What do you think?

Evgeny

(crosswise.com)

Arash Rouhani

unread,

Jul 9, 2015, 8:42:49 AM7/9/15

to Evgeny Shulman, luigi...@googlegroups.com

This seems sensible. Just to be sure we're on the same page. You're aware of that you can use LocalTarget even on directories, but you have an issue with that, since it doesn't mix well with AtomicLocalTarget, as it can't do the move-on-close operation for folders.

Have you seen the workaround in case of Hadoop job tasks? http://luigi.readthedocs.org/en/latest/api/luigi.contrib.hadoop_jar.html?highlight=atomic_output#luigi.contrib.hadoop_jar.HadoopJarJobTask.atomic_output

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evgeny Shulman

unread,

Jul 9, 2015, 8:48:54 AM7/9/15

to Arash Rouhani, luigi...@googlegroups.com

move_on_close works, the problem is earlier, the file is created on disk the moment we initialize the atomic_file object ( by IOFile('w') ). we can not use it as directory anymore

regarding the workaround: we have similar code, that did the magic for us, but it's task specific and we want to simplify our environment/enhance luigi.

Arash Rouhani

unread,

Jul 9, 2015, 8:54:32 AM7/9/15

to Evgeny Shulman, luigi...@googlegroups.com

Ah I see. So you want to have something like AtomicLocalTarget('/my/folder', is_dir=True)? Makes sense for me to have. :)

Evgeny Shulman

unread,

Jul 9, 2015, 11:13:02 AM7/9/15

to Arash Rouhani, luigi...@googlegroups.com

exactly.

LocalTarget('my/folder', is_dir = True) even better

self.output().open('r') - same behavior as regular file

self.output().open('w') - a little bit different:

with self.output().open('w') as dp:

do something with already created dir at dp.tmp_path

assume that if there is no exceptions, AtomicFile will rename dp.tmp_path to '/my/folder').

if you try to dp.write() - throw exception.

We are going to implement this next week.

Thanks!

Arash Rouhani

unread,

Jul 9, 2015, 12:17:36 PM7/9/15

to Evgeny Shulman, luigi...@googlegroups.com

Cool :)

Alexander Krasnukhin

unread,

Jul 10, 2015, 3:22:24 AM7/10/15

to Arash Rouhani, Evgeny Shulman, luigi...@googlegroups.com

Looks interesting!

--

Alexander

aka Six-Hat-Thinker

shu...@crosswiselabs.com

unread,

Jul 12, 2015, 9:09:16 AM7/12/15

to luigi...@googlegroups.com

On Thursday, July 9, 2015 at 3:05:25 PM UTC+3, Evgeny Shulman wrote:
> Hi
>
> We are facing some limitation by the fact that LocalTarget/S3Target were designed as File representation. There are a lot of cases, when we need task.output() to be the folder.
> Some examples:
> Hadoop local runs ( using pig or any other wrapper) - outputs are <TASK_OUTPUT/part-1.gz,part-2.gz

> Downloading folder from other sources.Local tools that writes multiple files

> Currently atomic_file/AtomicS3File objects works great if user want to write one file. However, if we could use .tmp_path of the return object and use it as it is, that would solve the problem. Currently that can not be done, as at the moment AtomicLocalFile is created,the file on disk is created as well. Thus we can not treat it as Directory.
> Moreover, to make folder usage compatible with current implementation, we would like to change read behaviour of *Target objects as well. if user has marked target as dir, it could be read by reading all files inside it.
>
>
>
> Implementation details:

> add is_dir flag to *Target objects,propagate it to atomic_file on write.atomic file with is_dir , if set, does not create IOFile, but fakes it with NonWritableFile object (raise exception on write). This way we can get object that has .tmp_path. That is the target directory.

> for open('r') mode return file object using fileinput.input (https://docs.python.org/2/library/fileinput.html) - reads all files inside the directory.
>
> We would like to create a pull request for the following changes.
> What do you think?
>
>
>
>
> Evgeny
> (crosswise.com)

Pull Request: https://github.com/spotify/luigi/pull/1066

We didn't implement read functionality, as simple use case is mostly useless. More complicated scenarios like `read hadoop output directory with omitting schema and _SUCCESS files` requires proper design like the Target(format=something) feature

shu...@crosswiselabs.com

unread,

Jul 13, 2015, 11:25:35 AM7/13/15

to luigi...@googlegroups.com

On Thursday, July 9, 2015 at 3:05:25 PM UTC+3, Evgeny Shulman wrote:

> Hi
>
> We are facing some limitation by the fact that LocalTarget/S3Target were designed as File representation. There are a lot of cases, when we need task.output() to be the folder.
> Some examples:
> Hadoop local runs ( using pig or any other wrapper) - outputs are <TASK_OUTPUT/part-1.gz,part-2.gz

> Downloading folder from other sources.Local tools that writes multiple files

> Currently atomic_file/AtomicS3File objects works great if user want to write one file. However, if we could use .tmp_path of the return object and use it as it is, that would solve the problem. Currently that can not be done, as at the moment AtomicLocalFile is created,the file on disk is created as well. Thus we can not treat it as Directory.
> Moreover, to make folder usage compatible with current implementation, we would like to change read behaviour of *Target objects as well. if user has marked target as dir, it could be read by reading all files inside it.
>
>
>
> Implementation details:

> add is_dir flag to *Target objects,propagate it to atomic_file on write.atomic file with is_dir , if set, does not create IOFile, but fakes it with NonWritableFile object (raise exception on write). This way we can get object that has .tmp_path. That is the target directory.

> for open('r') mode return file object using fileinput.input (https://docs.python.org/2/library/fileinput.html) - reads all files inside the directory.
>
> We would like to create a pull request for the following changes.
> What do you think?
>
>
>
>
> Evgeny
> (crosswise.com)

We are taking this implementation one step farther.
Now when we have atomic_file with is_dir support, we would like to implement DirectoryFormat that will support read/write from folders/files

class DirectoryFormat(Format):
input = 'bytes'
output = 'dir'

the pipe reader will wrap the input_pipe if it file with FileWrapper, otherwise it will `cat <input_path>/<prefix>*<suffix>

the pipe writer will create simple file if max_part_size is 0, otherwise it will split the input stream into files using `split ..`

this format is really useful when you work with hadoop outputs locally ( targetdir.gz/part1.gz,part2.gz) or when you are generate some inputs for hadoop process in your local running task ( so you want to have multiple files output)

we have POC that works, the question is if anybody interesting in such implementation. ( I'll generate pull request)

Alexander Krasnukhin

unread,

Jul 14, 2015, 3:47:54 AM7/14/15

to shu...@crosswiselabs.com, luigi...@googlegroups.com

So now LocalTarget could be a directory. So much for a single target concept.

Any reason you don't create a separate LocalDirectory class instead?

--
Alexander
aka Six-Hat-Thinker

Alexander Krasnukhin

unread,

Jul 14, 2015, 4:45:07 AM7/14/15

to Evgeny Shulman, luigi...@googlegroups.com

I know that it was named File (or something similar before) and moved towards more generic Target.

As I understand all mentioned operations were supported implicitly. Now it explicitly supports directories. Personally I believe it overloads target concept without a real need. It could be tricky to read code keeping in mind that it could be either a directory or a file. I'm all for having directory support, just don't sure about having all code in a single class. But sure, this could be only me.

On Tue, Jul 14, 2015 at 10:19 AM, Evgeny Shulman <shu...@crosswise.com> wrote:

I agree that this will be nice to separate.
however, till now we have treated the LocalTarget as "data container" on local disk. It's not called LocalFileTarget. I would say it should be transparent to user if it's directory/file.
Current LocalTarget implementation supports all operations with both files and directories, except LocalTarget.open.

regarding the LocalTarget .open, we are going to pull new format called DirectoryFormat, that gives full support for read/write from/into directory. on the same time it supports working with files as well. By default if you don't use the format and try to open - the implementation will fail. But this is true for all other formats, if you are reading gzip, you have to specify gzip format.

I am going to PR the code in few hours, let discuss the implementation afterthat.