AirFlow concepts, dag_id, task

Serega Sheypak

unread,

Feb 8, 2016, 8:44:58 AM2/8/16

to Airflow

Hi, I'm confused a little with dag_id and task_id concept.

Documentation says:

"We’ll need a DAG object to nest our tasks into. Here we pass a string that defines the dag_id, which serves as a unique identifier for your DAG"

Does it mean that I have manually assign unique id values to my DAGs and tasks?

I've used Ooziie for a quite long time. Let's talk about Ooize workflows and omit coordinators. Oozie workflow could be compared to AirFlow DAG. You write your DAG in xml and then submit it to Oozie server.

When Oozie acceptd workflow definition and instantiate workflow, Oozie assigns unique id to workflow instance. You can easily run same workflow in parallel and get workflow instance status using assigned workflow id. EAch workflow submission will end up with unique tracking id.

Am I right that AirFlow uses different concept. I have to manually assign DAG/task id and I ca'y run in parallel the same DAG. I have to dynamically generate unique dag_id if I want to run the same DAG in parallel?

Maxime Beauchemin

unread,

Feb 8, 2016, 11:32:15 AM2/8/16

to Airflow

Welcome Serega, glad you're considering Airflow. Once you get a clear understanding of Airflow and some experience using it, I'd love to hear your thoughts comparing both frameworks.

The `dag_id` is a unique identifier for your DAG, and the `task_id` is a unique name (within the DAG's namespace) for a task. It's nice if these names are meaningful to you and briefly describe your workflow and individual tasks. When browsing the UI, you'll recognize your dag_id and task_id that were assigned in your code.

In Airflow, tasks get instantiated and given a meaningful `execution_date`, usually related to the schedule if the DAG is scheduled, or to the start_date when DAGs are instantiated on demand. The state of a task instance's PK in the database is (dag_id, task_id, execution_date).

Many instances of a DAG and / or of a task can be run in parallel, within the specified constraints, if any. Such constraints might be certain tasks that you set to `depends_on_past=True`, settings around task concurrency for a specific DAG object (each DAG has a concurrency limit, default is 16), maximum number of active DAG instances (number of DAG schedules that get evaluated by the scheduler), pool concurrency settings if any, and of course the total maximum number of slots in the cluster.

Let me know if you have more questions. I'd love to have someone write a blog post "Understand Airflow when coming from Oozie" or something like that eventually.

Max

Serega Sheypak

unread,

Feb 8, 2016, 3:53:36 PM2/8/16

to Airflow

Okay, got it. Thanks for the explanation.

I've already started workflow with custom operator, runs fine!

> I'd love to have someone write a blog post "Understand Airflow when coming from Oozie" or something like that eventually.

My current concerns are:

- AirFlow can't submit mapreduce/scalding/spark job. I have to build it on my own.

- AirFlow needs hadoop libs near itself. Right now I'm trying to build docker with apache-hadoop+java+airflow onboard in order to run my airflow-testdrive flow. I can't just go to hadoop cluster and install/start AirFlow there. Nobody will allow me to do it. I will run Airflow in docker with external database and keep all hadoop libs and java in docker. It would be really heavy image.

- Then I should solve multiple cluster configuration problems. I should provide configuration to docker image and maintain these configurations.

- AirFlow forces me to setup one more distributed system (celery). I already have several distributed systems, not sure if my devops would like to maintain (monitor/alert/upgrade/collect logs...) one more...

- Didn't get idea of DAG deployment. Do I have to put all my DAGs here:

[core]

dags_folder=/airflow/dags_folder

What if user whats to add his DAG? Should give user RW access to /app/dags_folder?

I hope that AirFlow programming model would be more productive comparing to declarative XML spaghetti and I could compensate initial downpayment later. My main concern is that I'm deploying and configuring one more distributed system around hadoop for running sequence of jobs on hadoop.

I'll write about results, definitely :)

Thanks again, sorry if I said something wrong, I didn't want to say something bad about AirFlow, it's just one more way to solve flow problems. Just evaluating it and comparing with my experience.

Maxime Beauchemin

unread,

Feb 9, 2016, 3:39:27 AM2/9/16

to Airflow

Thanks for all the feedback, here are some comments/ answers:

- Yeah for those you'd typically want to write a BashOperator, we have a SparkOperator and a CascadingOperator internally at Airbnb but there's really not much to it, and some of it is specific to our env. It's basically a simple PythonOperator that knows where to download the right version of a jar and fire it up.

- On our side the Airflow chef recipe inherits some of our Cloudera recipe, but we're actually decoupling them soon since Airflow mostly talks to services.

- Not sure what you mean by "multiple cluster config problems"

- Celery is just a python library, but it does rely on a message queue like Redis or RabbitMQ. You may or may not have one of these up already. Soon we'll have a YarnExecutor so that it plays nicely with Hadoop.

- You'll need to sync a filesystem on many machines. These files should be in source control, and synced on boxes using however way you sync your repositories to boxes. For us that's chef, for others it may be a mountpoint, other some cron abstraction pulling from a central git repo, ...

- User commit DAG to repository, after it's reviewed and merged it gets synced to all nodes and gets scheduled automagically

Serega Sheypak

unread,

Feb 9, 2016, 6:22:28 AM2/9/16

to Airflow

Hi, thank you for your reply.

> Yeah for those you'd typically want to write a BashOperator,

This approach always confuses me, it looks like you are writing command line text generator in python. Unfortunately it's wide-spread practice. I don't believe in bash, it's hard to test and maintain

>That knows where to download the right version of a jar and fire it up.

I need some persistent storage for downloaded artifacts. Why do I need it if I already have HDFS? Oozie can use only HDFS, really oozie doesn't need to store somthing on local fs except server logs.

>- On our side the Airflow chef recipe inherits some of our Cloudera recipe,

One more thing to learn, support, maintain. Anyway, you should have puppet/chef/ansible to manage clusters, but is it good thing to couple it with workflow management tool?

>Not sure what you mean by "multiple cluster config problems"

Each AirFlow executor should have hadoop conf near itself. We could have several clusters conf and AirFlow should know their conf for these clusters, I have to keep these confs up to date. I don't want to bring AirFlow to cluster, I want to run AirFlow on dedicated machines/docker containers/whatever.

>Soon we'll have a YarnExecutor so that it plays nicely with Hadoop.

That would be great, I don't believe that Hadoop eco would decouple itself from YARN soon. It's good to get rid of celery, rabbit e.t.c. since they give more moving parts to maintain, upgrade, monitor, alert.

>You'll need to sync a filesystem on many machines.

:(

>cron abstraction pulling from a central git repo, ...

What if cron fails? We get inconsistent code across workers. One more thing to maintain, upgrade, monitor, alert.

>User commit DAG to repository,

We have inhouse tool right now which utilizes the same approach, It doesn't work in a long run. Reviewer and contributor could have 12 hours difference. so commiter should wait for a day until it's merged. Feedback loop becomes enormously long. The other thing, I don't want to be python code QA, it's user responsibility to write python code, my responsibility is to provide platform for running the code.

I plan other deployment schema.

We have an airflow installed as pip.

Airflow gets inhouse extensions installed by pip.

Airflow gives bootstrap functionality for the user. User runs built-in workflow with single operator that installs/updates user-provided pip module. When user pip module is installed, user can run workflow.

We are all completely independent. But synchronization problem still exists...

What is some of executors are down/unresposive/e.t.c.?

Maxime Beauchemin

unread,

Feb 9, 2016, 12:21:28 PM2/9/16

to Airflow

>> Yeah for those you'd typically want to write a BashOperator,

>This approach always confuses me, it looks like you are writing command line text generator in python. Unfortunately it's wide-spread practice. I don't believe in bash, it's hard to test and maintain

I also stay away from bash as much as possible. It's pretty easy to write an operator and tests for it. Use the PythonOperator or write your own operator.

>>That knows where to download the right version of a jar and fire it up.

>I need some persistent storage for downloaded artifacts. Why do I need it if I already have HDFS? Oozie can use only HDFS, really oozie doesn't need to store somthing on local fs except server logs.

Our SparkOp and CascadingOp operators get the jar from hdfs, fire it up and then cleans up after itself. This approach would work with S3 as well. Either you pre-depoy the jar and have your operator assume it's there or either you fetch it / clean up. There's no magic here, you can't make an omelette without eggs.

>>- On our side the Airflow chef recipe inherits some of our Cloudera recipe,

> One more thing to learn, support, maintain. Anyway, you should have puppet/chef/ansible to manage clusters, but is it good thing to couple it with workflow management tool?

I'm not sure if there is any ways around this. You can't get a whole new set of features without having to keep a system up. In this microservice / SOA world you need to be good at running many services. Maybe your ideal world is to have Airflow run as a Yarn app?

>> Not sure what you mean by "multiple cluster config problems"

> Each AirFlow executor should have hadoop conf near itself. We could have several clusters conf and AirFlow should know their conf for these clusters, I have to keep these confs up to date. I don't want to bring AirFlow to cluster, I want to run AirFlow on dedicated machines/docker containers/whatever.

Yeah Airflow needs to know about how to connect to your other system. Maintaining the connection metadata (in the Airflow db), even in our complex, fast moving environment has been very very lightweight. Are you suggesting we should find and read the Hadoop XML files from somewhere? That seems messy, what if they aren't local? There's also no magic here. ETL tools need to have the metadata about how to connect to other systems.

>> Soon we'll have a YarnExecutor so that it plays nicely with Hadoop.

> That would be great, I don't believe that Hadoop eco would decouple itself from YARN soon. It's good to get rid of celery, rabbit e.t.c. since they give more moving parts to maintain, upgrade, monitor, alert.

Remember that not everyone has Hadoop / Yarn. If these things first class citizen in your env, then great, YarnExecutor FTW. If Mesos is the way you do things, MesosExecutor! and if none of those, you can get a huge amount of mileage out of Celery. We have run millions of tasks using it, it works great. Redis (at Airflow scale) is the easiest thing in the world to maintain.

>>You'll need to sync a filesystem on many machines.

> :(

I figured most companies have some standard way to do this. There's a PR out to sync to workers using s3, it may or may not work for you.

>> cron abstraction pulling from a central git repo, ...

> What if cron fails? We get inconsistent code across workers. One more thing to maintain, upgrade, monitor, alert.

It's pretty vital to have a reliable way to sync a code repos to nodes. We're looking into serializing the DAG objects but that solution has shortcomings, I'm not getting into details here.

>>User commit DAG to repository,

>We have inhouse tool right now which utilizes the same approach, It doesn't work in a long run. Reviewer and contributor could have 12 hours difference. so commiter should wait for a day until it's merged. Feedback loop becomes enormously long. The other thing, I don't want to be python code QA, it's user responsibility to write python code, my responsibility is to provide platform for running the code.

Well if you guys don't code review that's fine I guess, YOLO. People can merge away and hope for the best.

We are all completely independent. But synchronization problem still exists...

What is some of executors are down/unresposive/e.t.c.?

What if someone shuts down your network? Nuclear meltdown? Airflow (like all of your other systems) is down until infrastructure is restored.

Serega Sheypak

unread,

Feb 9, 2016, 12:37:22 PM2/9/16

to Airflow

Thanks, awesome points!

> What if someone shuts down your network? Nuclear meltdown?

I can go home earlier that day:) Nothing to fix.

Reply all

Reply to author

Forward

AirFlow concepts, dag_id, task_id

Serega Sheypak

Maxime Beauchemin

Serega Sheypak

Maxime Beauchemin

Serega Sheypak

Maxime Beauchemin

Serega Sheypak