Hi, thank you for your reply.
> Yeah for those you'd typically want to write a BashOperator,
This approach always confuses me, it looks like you are writing command line text generator in python. Unfortunately it's wide-spread practice. I don't believe in bash, it's hard to test and maintain
>That knows where to download the right version of a jar and fire it up.
I need some persistent storage for downloaded artifacts. Why do I need it if I already have HDFS? Oozie can use only HDFS, really oozie doesn't need to store somthing on local fs except server logs.
>- On our side the Airflow chef recipe inherits some of our Cloudera recipe,
One more thing to learn, support, maintain. Anyway, you should have puppet/chef/ansible to manage clusters, but is it good thing to couple it with workflow management tool?
>Not sure what you mean by "multiple cluster config problems"
Each AirFlow executor should have hadoop conf near itself. We could have several clusters conf and AirFlow should know their conf for these clusters, I have to keep these confs up to date. I don't want to bring AirFlow to cluster, I want to run AirFlow on dedicated machines/docker containers/whatever.
>Soon we'll have a YarnExecutor so that it plays nicely with Hadoop.
That would be great, I don't believe that Hadoop eco would decouple itself from YARN soon. It's good to get rid of celery, rabbit e.t.c. since they give more moving parts to maintain, upgrade, monitor, alert.
>You'll need to sync a filesystem on many machines.
>cron abstraction pulling from a central git repo, ...
What if cron fails? We get inconsistent code across workers. One more thing to maintain, upgrade, monitor, alert.
>User commit DAG to repository,
We have inhouse tool right now which utilizes the same approach, It doesn't work in a long run. Reviewer and contributor could have 12 hours difference. so commiter should wait for a day until it's merged. Feedback loop becomes enormously long. The other thing, I don't want to be python code QA, it's user responsibility to write python code, my responsibility is to provide platform for running the code.
I plan other deployment schema.
We have an airflow installed as pip.
Airflow gets inhouse extensions installed by pip.
Airflow gives bootstrap functionality for the user. User runs built-in workflow with single operator that installs/updates user-provided pip module. When user pip module is installed, user can run workflow.
We are all completely independent. But synchronization problem still exists...
What is some of executors are down/unresposive/e.t.c.?