Luigi workflow questions

365 views
Skip to first unread message

Y_G

unread,
Jun 16, 2017, 5:40:22 PM6/16/17
to Luigi
Hi Guys,

I have started playing around with Luigi and have few basic questions. Will really appreciate if someone can help me out here.

1. Can we have a graph as input ( A --> B --> C --> A --> B .... and the loop between A and C stops on a condition or the loops runs for lets say 10 iterations)? If Yes, is there a hack to do it because simply following examples I wasn't able to design this. 

2. Let's say I have a DAG with 100 tasks (nodes). To be precise it looks like this - A1 --> A2 -->....A100 and I have 10 (M1, ... ,M10) machines to run this. My understanding from the documents is: On each machine M1,M2...M10 I will have trigger the DAG. There will be one central scheduler running on a machine say S. Now the central scheduler will decide which task to run on which machine and will tell the machines M1,...M10 to execute those tasks. Please correct me if I my understanding is right here. If it is - then this is what I am trying to understand:
(i) scheduler will get 10 DAGs from 10 machines (M1, ... , M10) and will analyze and figure out that oh they are same and I should not give the same task to any 2 machines. What is the motivation behind this design?
(ii) What happens if I run the same DAG for 10 different jobs. By 10 different jobs, I mean that output of each step will be in that job related folder (job1/, job2/, etc.)?

3. Conditional Branching (A --> B (on some condition in B) either trigger C OR trigger D. Can we do this in Luigi? 

Thanks a lot!

Dave Buchfuhrer

unread,
Jun 17, 2017, 4:46:06 AM6/17/17
to Y_G, Luigi
I don't understand questions 1 and 3, but I suspect my answer to question 2 will help with them. You shouldn't think in terms of DAGs, but rather in terms of tasks. If you're trying to run a task A1 on a node, it will send that task to the scheduler plus schedule the things it depends on as necessary. If another machine schedules the same task A1, the scheduler will see that it's the same task and will update it if there is any new information but otherwise not change anything. The same thing will happen when you schedule A2 and the rest of the dependencies.

When a worker asks for work from the scheduler, the scheduler will send it a task that is pending and has all of its dependencies done. This is because we only want to run a task once its input is ready. We don't have multiple machines run the same task at the same time because we expect that tasks only need to be done once. If you need to do the same task on every machine, you either need to add a parameter to indicate which machine the task is for or run separate schedulers for each machine.

I don't understand your question 2.ii either. If you still have that question, could you try to explain in more detail what you mean?

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Y_G

unread,
Jun 17, 2017, 11:31:41 AM6/17/17
to Luigi, yashg...@gmail.com
Thank you so much Dave for getting back to me and your explanation for question 2 helps me understand Luigi a little more. I will try to explain my other questions with a scenario to be more concrete and sorry for pounding you with questions but I really want to understand this. 

Let's say we have 2 Worker machines (M1 and M2) and we have a DAG like this:

Scenario 1:
A --> (B, C) --> D (Diamond kind of a DAG). Output of A is feed into B and C (B and C can run in parallel) and then output from B and C is used to execute D. 

1. If I need to run this DAG - I will have to trigger task D (because that will take care of executing dependencies) from both the workers M1 and M2. Is this understanding correct that both the workers will submit the same DAG to scheduler?
2. Scheduler will know that B and C can run in parallel and will let M1 and M2 run this in parallel. (o/p will be on HDFS, etc.) This seems conceptually correct but i am just clarifying :). 

I understand that the tasks need to be run once. What I am trying to understand is the motivation behind the +design of triggering the same DAG from all worker nodes and scheduler deciding what to run where Vs. + the design where you submit DAG from one m/c and let scheduler know that you have n worker nodes to execute this task. Scheduler based on dependencies decide what to run where? Hope this helps. 

Scenario 2:
1) This is related to my questions 1 and 3. Generally machine learning tasks have some kind of iteration or loops (for example active learning converging to a random forest model in 30 iterations). So, if I have a DAG like this: A --> B --> C --> A --> B ---> so on.. A is the starting point and A has dependencies on C whereas C has dependencies on B and B has on A (circular fashion). My starting point is A but since it depends on C, I am not sure how Luigi will handle this or how to encode circular dependencies is something I am not clear.

2) Many times based on task execution / output you want to decide what next to execute. Let's say we have a DAG (for examples sake i have very easy tasks) like this: 
A(add 2 numbers) --> (if sum greater than 50) B --> C --> F
                              --> (if sum <= 50)                D --> E --> F
If addition of 2 numbers is >50 I want to execute B and C otherwise I want to execute D and E and finally i want to execute F. In Luigi, if I trigger this task F, how and where in code can I tell Luigi that don't execute B, C if sum <=50? Basically, at A I am doing conditional branching. 

Hope this helps. Thanks a lot again and your inputs will be really appreciated. 
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

Dave Buchfuhrer

unread,
Jun 17, 2017, 1:16:11 PM6/17/17
to Y_G, Luigi
Ok I think I understand your issues now. I actually run Luigi on a setup where I have one worker that schedules the task and several machines that run tasks without scheduling them. It's your choice whether you want to do that or have every machine schedule tasks. Just run a worker with --assistant if you want it to run tasks it didn't schedule. For technical reasons, you still have to schedule a task, but you can run it like luigi Task --assistant and it'll run the dummy task and then ask for and run jobs from the scheduler without having to schedule them.

If you want to run a machine learning loop, that should be a single task in Luigi. Tasks in Luigi should be large. The dependency graph exists to help you run the the tasks you need in the right order, not to be part of your algorithm. Your machine learning task should just depend on the initial input and then run the learning loop until you reach the terminating condition.

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
Message has been deleted

Gaurav Gupta

unread,
Nov 7, 2017, 6:41:37 AM11/7/17
to Luigi
Hi, did you figure out how to implement the conditional branching usecase in luigi?
Reply all
Reply to author
Forward
0 new messages