every day (tasks are independent of each other) and I have a cluster with some machines. Luigi doesn't do resource management itself, but we already have YARN installed to run Map Reduce stuff (and the Luigi outputs are on places common to all the machines on the clusters, so if machine A has finished task T, machine B can see its output). It theoretically is possible to launch YARN containers for every luigi task (and then YARN figures out where to run it), right? Has anybody tried this already? I am also curious what technologies spotify itself uses to scale up luigi.
Thanks
That's a very interesting idea, you could also use something like Mesos (http://mesos.apache.org/) to distribute the workload on a Mesos cluster. I have never used YARN or Mesos and i am not sure how they compare, i have seen people doing a lot of Mesos lately.
There is a simpler solution that i have been using to some success. I have multiple boxes running a simple rabbitmq daemon reading from a queue that starts the luigi worker when it receives a message. The luigi worker connects to the luigi central scheduler that tells if another worker is executing the tasks so work is not done twice.
I hacked the solution in a couple of days including a "master" node that is in charge of creating the rabbitmq messages in the correct format.
I have used with more 20 machines and its been working fine. The amount of messaging is minimal so i think it should scale relatively fine to more boxes.
My next plan is to do better routing in rabbitmq so specific tasks can be executed in specific boxes, right now there is only one type of worker.
I would like to see more ideas on how people is been "scaling" luigi to multiple machines. I notice that spotify has another opensource project called scalegrease(https://github.com/spotify/scalegrease) but documentation is very mininal so i don't know how that works.
--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
That idea of luigi "slaves" is exactly what I implemented using rabbitmq. I did think of hacking the luigi worker code but i didn't like that much the idea of polling the scheduler for tasks, also i already had rabbitmq running so it was natural for me to use it. Finally, with rabbitmq its possible do some complex routing like having specific workers for specific tasks, I haven't implemented that part though, mainly because I am not sure on how the syntax should be.I do assume that the luigi tasks are importable on the slave nodes, i use salt for config management. I have one git repo only with luigi tasks so when the developers push the code to that repo Jenkins triggers a salt state and the new code is available on all the slave boxes.
It would be nice for luigi to have a built in distributor of tasks, or something like that, it will make it compete more directly with other frameworks like chronos. I do have to admit that i have trouble convincing people to use luigi, they always ask: "How do i do parallel tasks?" and when i say there is not something built in its almost like a lost battle from that point.
My rationale is something like: I am willing to write the code for scaling luigi as long as i can maintain the clear syntax for tasks, targets, parameters and so on, most of the people disagree with that position i am afraid.
Would you consider this a main priority?, how often is this type of question asked? What i mean is that its easy to write luigi tasks for hadoop or spark and let those tools do the heavy lifting of "scaling" an specific task.
On Tue, Sep 2, 2014 at 7:13 PM, Daniel Rodriguez <df.rodr...@gmail.com> wrote:
That idea of luigi "slaves" is exactly what I implemented using rabbitmq. I did think of hacking the luigi worker code but i didn't like that much the idea of polling the scheduler for tasks, also i already had rabbitmq running so it was natural for me to use it. Finally, with rabbitmq its possible do some complex routing like having specific workers for specific tasks, I haven't implemented that part though, mainly because I am not sure on how the syntax should be.I do assume that the luigi tasks are importable on the slave nodes, i use salt for config management. I have one git repo only with luigi tasks so when the developers push the code to that repo Jenkins triggers a salt state and the new code is available on all the slave boxes.
If you have code (even ugly), would love to take a look at this.
It would be nice for luigi to have a built in distributor of tasks, or something like that, it will make it compete more directly with other frameworks like chronos. I do have to admit that i have trouble convincing people to use luigi, they always ask: "How do i do parallel tasks?" and when i say there is not something built in its almost like a lost battle from that point.
When you say "parallel", you mean parallel as in horizontal scaling, right? I mean being able to run things on multiple machines. There are clearly ways of doing so (see Dan Frank's answer) but they are not super beautiful imho.
My rationale is something like: I am willing to write the code for scaling luigi as long as i can maintain the clear syntax for tasks, targets, parameters and so on, most of the people disagree with that position i am afraid.
Which is understandably :) Luigi was always built with clarity and simplicity as the main feature, over things like scalability. I think it's easier to go in the direction of adding scalability once you nail simplicity, but not the other way around. Again I'd love to take a look at your code if you feel comfortable sharing it.
Would you consider this a main priority?, how often is this type of question asked? What i mean is that its easy to write luigi tasks for hadoop or spark and let those tools do the heavy lifting of "scaling" an specific task.
Luigi isn't necessarily driven by "priorities" since it's not the day job of any of the authors to add new features to Luigi. So Luigi has had a tendency to grow very organically and incrementally. This is great for stability, but I think the current model isn't great for big features. I hope to have more time in the future to put together scaffolds for more complicated features like scalability etc.What do you mean with Hadoop/Spark? Those are examples where you get scalability because it's handled by something else. But we're at the point where some people (including Spotify) run 10k-100k tasks per day, and we also need to scale the launching of those tasks :)