Example of multiple outputs

1,015 views
Skip to first unread message

timsm...@hotmail.com

unread,
Feb 14, 2014, 12:37:09 AM2/14/14
to luigi...@googlegroups.com
Would anyone happen to have a simple example of a Task that has multiple outputs?  I have a situation where I am polling for files every 15 mins for each tenant (client) and in some cases, there may be more than one file for a given time (log roles more than once between polls) and tenant.  In this situation, I have a task that accepts a tenant id and date, but would/may encounter more than one file.  These files would be sent as parameters to a child task that would explicitly retrieve each file.  Thank you.

ro...@edx.org

unread,
Feb 14, 2014, 12:07:21 PM2/14/14
to luigi...@googlegroups.com
I have similar question. Basically we have a large number of large files that contain some type of events. We would like to parse them and sort them into multiple files, one per day. We can generate multiple tasks that create one file per day, or make one tasks that creates all the different days. The second alternative seems to use hadoop better, by creating a large map reduce job instead of tiny ones like in the first case.

Thoughts? What is the recommended best practice?

Thanks,

--
Carlos

Erik Bernhardsson

unread,
Feb 14, 2014, 1:02:36 PM2/14/14
to timsm...@hotmail.com, luigi...@googlegroups.com
We have support for multiple outputs but it's being determined at scheduling time rather than runtime.

I think what you describe would be supported by this PR: https://github.com/spotify/luigi/pull/255 but it probably won't be merged in the next few months because it's a pretty large change




On Fri, Feb 14, 2014 at 12:37 AM, <timsm...@hotmail.com> wrote:
Would anyone happen to have a simple example of a Task that has multiple outputs?  I have a situation where I am polling for files every 15 mins for each tenant (client) and in some cases, there may be more than one file for a given time (log roles more than once between polls) and tenant.  In this situation, I have a task that accepts a tenant id and date, but would/may encounter more than one file.  These files would be sent as parameters to a child task that would explicitly retrieve each file.  Thank you.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Erik Bernhardsson
Engineering Manager, Spotify, New York

timsm...@hotmail.com

unread,
Feb 14, 2014, 1:26:06 PM2/14/14
to luigi...@googlegroups.com, timsm...@hotmail.com
Ok, maybe I have the relationship between requires() and output() backwards.  If I want to pull log files for a specific client and date that may have more than one file for this criteria I cannot build a list of these files and return it from the task's output()?  Something along the lines of this:

class ClientFileList(luigi.Task):
    client = luigi.Parameter()
    date = luigi.DateParameter()

    def output(self):
        outputs = set()
        for myfile in myfiles:
           outputs.add(luigi.hdfs.HdfsTarget(myfile))
        return outputs

    def run(self):
        curl = exec_sh('curl -s -l <host_info_missing> /logs/{client}/{date}/'.format(
            client='%s' % self.client,
            date='%s' % self.date ))
        myfiles = set(curl.splitlines())

Erik Bernhardsson

unread,
Feb 16, 2014, 10:16:52 AM2/16/14
to timsm...@hotmail.com, luigi...@googlegroups.com
So, as I replied in the other thread – problem is output() is a part of the scheduling rather than the execution. This is because it needs to check if that task was already run, and the way this works is it looks if the output() targets are all complete.

So output() is actually run before run().

This is kind of a requirement to be able to run tasks across multiple processes, since otherwise there would have to be communication back to the server about what targets were produced in run time. This is doable, but would require a lot of refactoring.

Hope this makes sense
Reply all
Reply to author
Forward
0 new messages