luigi usage and web interface

1,041 views
Skip to first unread message

utkar...@gmail.com

unread,
Jul 2, 2013, 8:20:42 PM7/2/13
to luigi...@googlegroups.com
Hello,

I am evaluating azkaban for workflow management.

Some background:
 - We have a bunch of pig scripts which are triggered via cron. Some hourly, some weekly etc.
 - I don't have a job dependency problem yet. But azkaban and luigi offers it anyway.
 - I need regular retries and manual retries via a web interface, for example: "my_pig_job -date=20130101" failed on 01, so I want to rerun this job on 02, but the argument should be "20130101" Doing this in azkaban is tricky because the manual rerun spawns a new job instance and does not pick the old arguments passed to it.
 - Jobs management via a web interface: Add, remove and update tasks via a web interface.

I looked at Airbnb's chronos, but it was way too complicated for a simple usecase (added complexity of managing mesos) and looked buggy.

So, does luigi intend to be a azkaban or chronos replacement? How do you guys use it at 4sq?

Thanks,
-Utkarsh

utkar...@gmail.com

unread,
Jul 2, 2013, 8:51:21 PM7/2/13
to luigi...@googlegroups.com
Also, cannot find any documentation about how to run remote commands.

For example: I have multiple machines:
machine1. Triggers pig jobs,
machine2. Non hadoop related data processing
machine3. Command which triggers a SQL query and rsyncs the data to another machine.

So, can I run commands on multiple machine and manage it from one place?

Thanks,
-Utkarsh

Erik Bernhardsson

unread,
Jul 2, 2013, 10:25:42 PM7/2/13
to utkar...@gmail.com, luigi...@googlegroups.com
Hi,

I think for the case when you start to have more and more Pig scripts triggered from cron, Luigi is a great fit. 

Doing retries is pretty easy, just retry until it succeeds. You still need to cron it up though, but if you have even pretty simple dependencies, Luigi helps by making sure things are run in order. As well as making sure all file system operations are atomic etc.

We don't have any capability to trigger jobs from the web service yet though


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Erik Bernhardsson
Engineering Manager, Spotify, New York

Erik Bernhardsson

unread,
Jul 2, 2013, 10:27:29 PM7/2/13
to utkar...@gmail.com, luigi...@googlegroups.com
This is something that Luigi leaves up to the developer to implement, but what I would do is to just trigger the commands over ssh to another machine.

Alternatively you trigger two Luigi workflows separately on each machine where the second step kicks in once the data is transferred from the first one.


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Joe Crobak

unread,
Jul 3, 2013, 10:08:04 AM7/3/13
to Erik Bernhardsson, utkar...@gmail.com, luigi...@googlegroups.com
I echo everything that Erik has said, but also wanted to elaborate on a few things:

 - Jobs management via a web interface: Add, remove and update tasks via a web interface.

I've worked with job submission via a web-ui before, and it can become troublesome quickly. Some issues: 1) there's no good way to 'version' tasks or even figure out what version is running. 2) if you need to a mass change, e.g. to add support for a new log field that effects 10 different jobs, then you need to resubmit 10 jobs. In contrast, luigi is much more heroku-like: Your workflow definition is code, and you push your code out to all the luigi workers. Pushing the entire codebase to all workers solves both the version and the mass-update issue.

So, does luigi intend to be a azkaban or chronos replacement?

Definitely. I haven't used Chronos, but I saw a presentation on it. I also used Azkaban 1.0 in the past. Luigi solves the same problems, and in my experience it does so in a simpler and more scalable way (although some of the scalability issues have been addressed in azkaban 2.0).

In addition, luigi has a really important concept that other workflow systems do not: Idempotency. Each task defines its output and Luigi is smart enough to only schedule tasks if their output is not yet available. This means that reruns are easy -- just walk the entire dependency graph and run tasks missing output. In my experience with Oozie and Azkaban, reruns are much more complicated.

> How do you guys use it at 4sq?

Happy to answer this (Erik can speak about spotify's usage, too) -- we have moved from Oozie to Luigi and are running ~100 scala mapreduce and hive queries through Luigi. We have ~15 luigi workers in total (planning to scale this back), and we run a continuous integration via hudson that is continuously deployed to all of the luigi workers.


one last thing -- I heard a rumor that the spotify folks had some pig stuff internally. Would love to see that contributed back to the github project -- although it also shouldn't be too difficult to write a base pig task from scratch.

Erik Bernhardsson

unread,
Jul 3, 2013, 10:11:48 AM7/3/13
to Joe Crobak, utkar...@gmail.com, luigi...@googlegroups.com

one last thing -- I heard a rumor that the spotify folks had some pig stuff internally. Would love to see that contributed back to the github project -- although it also shouldn't be too difficult to write a base pig task from scratch.

 
We have some experimental code around. Maybe we should just commit it so at least there's something to use. It's not super smart though. Something I really want to do is to have a @udf decorator that automatically pulls out code blocks and compiles it using Jython.

Ken Zheng

unread,
Jan 2, 2014, 3:33:27 PM1/2/14
to luigi...@googlegroups.com, Joe Crobak, utkar...@gmail.com
Hi Erik,

I am looking for an workflow framework to manage our tasks running on Hadoop. Luigi looks great. 
Luigi supports Hive. Does it have built-in tasks to support Pig / Scald / Casalog ? If not, what's the best work around to execute these types of tasks?

Thank you

Ken Zheng

Erik Bernhardsson

unread,
Jan 2, 2014, 3:36:43 PM1/2/14
to Ken Zheng, luigi...@googlegroups.com, Joe Crobak, utkar...@gmail.com
There's built-in support for Hive and Scalding, but not Pig and Cascalog. Should be easy to add by just subclassing luigi.Task. If you do that, feel free to submit a pull request and we can merge it back into Luigi!


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ken Zheng

unread,
Jan 3, 2014, 10:48:37 AM1/3/14
to luigi...@googlegroups.com, Ken Zheng, Joe Crobak, utkar...@gmail.com
 Erik, 

Thank you.

I will love to make some contributions down the road. btw, I didn't find any detailed documentation / tutorial other than README.md. I assumed I will have to read through the code to understand the full functionality. Please let me know if I missed out some resources.

Best.

Ken

Erik Bernhardsson

unread,
Jan 6, 2014, 4:12:07 AM1/6/14
to Ken Zheng, luigi...@googlegroups.com, Joe Crobak, utkar...@gmail.com
Unfortunately the README.md is pretty much the sole documentation. Feel free to contribute to it, though!
Reply all
Reply to author
Forward
0 new messages