Thank you so much Dave for getting back to me and your explanation for question 2 helps me understand Luigi a little more. I will try to explain my other questions with a scenario to be more concrete and sorry for pounding you with questions but I really want to understand this.
Let's say we have 2 Worker machines (M1 and M2) and we have a DAG like this:
Scenario 1:
A --> (B, C) --> D (Diamond kind of a DAG). Output of A is feed into B and C (B and C can run in parallel) and then output from B and C is used to execute D.
1. If I need to run this DAG - I will have to trigger task D (because that will take care of executing dependencies) from both the workers M1 and M2. Is this understanding correct that both the workers will submit the same DAG to scheduler?
2. Scheduler will know that B and C can run in parallel and will let M1 and M2 run this in parallel. (o/p will be on HDFS, etc.) This seems conceptually correct but i am just clarifying :).
I understand that the tasks need to be run once. What I am trying to understand is the motivation behind the +design of triggering the same DAG from all worker nodes and scheduler deciding what to run where Vs. + the design where you submit DAG from one m/c and let scheduler know that you have n worker nodes to execute this task. Scheduler based on dependencies decide what to run where? Hope this helps.
Scenario 2:
1) This is related to my questions 1 and 3. Generally machine learning tasks have some kind of iteration or loops (for example active learning converging to a random forest model in 30 iterations). So, if I have a DAG like this: A --> B --> C --> A --> B ---> so on.. A is the starting point and A has dependencies on C whereas C has dependencies on B and B has on A (circular fashion). My starting point is A but since it depends on C, I am not sure how Luigi will handle this or how to encode circular dependencies is something I am not clear.
2) Many times based on task execution / output you want to decide what next to execute. Let's say we have a DAG (for examples sake i have very easy tasks) like this:
A(add 2 numbers) --> (if sum greater than 50) B --> C --> F
--> (if sum <= 50) D --> E --> F
If addition of 2 numbers is >50 I want to execute B and C otherwise I want to execute D and E and finally i want to execute F. In Luigi, if I trigger this task F, how and where in code can I tell Luigi that don't execute B, C if sum <=50? Basically, at A I am doing conditional branching.
Hope this helps. Thanks a lot again and your inputs will be really appreciated.