Airflow + Docker DAG execution

Chris Riccomini

unread,

Nov 17, 2015, 12:42:40 PM11/17/15

to Airflow

Hey all,

I'm still poking around with Airflow, so forgive me if some of this has been answered.

If I have a DAG that has a Python task in it, and the Python task includes a bunch of dependencies (e.g. NLTK, scipy, etc), do I have to actually install these packages on all of my worker nodes? This seems kind of excessive, since the nodes will end up having to run the union of all possible packages on all possible languages that everyone is writing tasks in (Python, Java, Go, Javascript, etc etc). From an ops perspective, this is a maintenance nightmare (especially when different orgs start demanding different versions of the same package).

One solution, which I've had imposed on me in the past is to just dictate very strictly which things are supported, and install them on all the worker nodes. This would work, but is somewhat limiting. We used to do it with our old Hadoop clusters.

Another solution that I'm curious about is supporting containers. It seems like this is the exact use case that containers are trying to solve. Is anyone running Airflow with the DAG execution being done inside a docker image, such that developers can pip install/apt-get install/gem install/whatever into their docker image, and the people operating the Airflow workers don't have to worry about any of this stuff?

I'm wondering if there's some integration where the tasks can `docker run` (didn't find any Docker operator, or anything), and have the worker suck down the proper Docker image(s), and execute it.

How does Airbnb/blueapron/others solve this?

Cheers,

Chris

Arthur Wiedmer

unread,

Nov 17, 2015, 1:40:24 PM11/17/15

to Airflow

Hi Chris,

We are still trying to solve this exact problem to some extent. We do fix a set of dependencies for most of the workers and have not yet encountered too many issues related to different versions of packages.

We would be really excited in supporting containers. Right now we are trying to scope how to integrate with YARN to provide a YARN executor, but this is not the only possibility. Contributors have written a Mesos Executor which provides container support but unfortunately we are moving away from Mesos as a platform and thus cannot drive further development on it.

I know some people have been experimenting with Airflow Docker images, but I do not know if they use them for more than deploying on container based architecture as opposed to using them as different execution environment as you suggest.

Best,

Arthur

Chris Riccomini

unread,

Nov 17, 2015, 2:12:32 PM11/17/15

to Airflow

Hey Arthur,

YARN is something that I've also got experience on from a past life. YARN will help execute in a distributed fashion, much like Celery, but it's not going to help a ton with packaging. Two thoughts:

1) YARN does support the ability to move resources. This puts you in the land of Azkaban, where you have ZIP files or tarballs that YARN can ship around for you. This doesn't really help with package dependencies, though.

2) Last I checked, YARN had moved to support Docker images as the binary that it deploys using the DockerContainerExecutor. I'm not too sure what the current status of this feature is, though. Unlike (1), using YARN to deploy Docker images *would* get you to a point where individual devs would be able to install their own dependencies in their images.

A note on (2), though is that YARN and Docker images are somewhat orthogonal. Using YARN vs. Celery vs. Mesos doesn't make much difference. What matters is deploying a proper container that allows devs to install all of their dependencies without impacting the entire cluster.

A second note is that moving towards Docker images as a deployment mechanism helps to decouple the DAG definition from the artifacts that are deployed and executed. While it's convenient at first to embed scripts and whatnot along with your config, decoupling config from binaries has some nice benefits when operating at scale. I definitely pick up a bit of a mixed message in the airflow docs, where you're saying, "DAGs are configs," but then in other places, I can see the tendency to stuff Python callbacks and bash commands inside the DAG script (or in adjacent .py files that are deployed simultaneously with the config). This is something that I'd never do with config. I'd keep config and binary artifact publication, revisioning, etc separate. Having the DAG clearly defined as the config that defines the workflow, and the docker image clearly defined as the deployable artifact that contains the workflow binaries makes this really clear.

Cheers,

Chris

Chris Riccomini

unread,

Nov 17, 2015, 2:14:05 PM11/17/15

to Airflow

(Note: I don't mean to come off as a Docker fanboy--I promise I'm not. I'm just pretty nervous about the deployment mechanism that Airflow offers right now, and I see containers as a nice way to bundle things in this use case).

Chris Riccomini

unread,

Nov 17, 2015, 2:43:14 PM11/17/15

to Airflow

Hey all,

I took a brief look at the YARN DockerContainerExecutor, and docs:

https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html

Rather than bundling this support as part of the executor in Airflow, which it would be if we used YARN's Docker support, what do you think about just implementing a DockerOperator that does the same thing? In my mind, the operator would:

1. Download the image to the local Docker daemon on the worker.

2. Docker run it.

This would require running the Docker daemon on every worker (as the YARN implementation does). The advantage of this is that the DockerOperator should be able to work on any executor, provided that the Docker daemon is running on the workers (Mesos slaves, YARN NodeManagers, Celery workers, etc).

Does anyone see anything obviously wrong about this, or have any other ideas?

Cheers,

Chris

Arthur Wiedmer

unread,

Nov 17, 2015, 2:45:03 PM11/17/15

to Airflow

On Tuesday, November 17, 2015 at 11:12:32 AM UTC-8, Chris Riccomini wrote:

Hey Arthur,

YARN is something that I've also got experience on from a past life. YARN will help execute in a distributed fashion, much like Celery, but it's not going to help a ton with packaging. Two thoughts:

1) YARN does support the ability to move resources. This puts you in the land of Azkaban, where you have ZIP files or tarballs that YARN can ship around for you. This doesn't really help with package dependencies, though.

Agreed, the main reason we are considering this is for resource managing. I was kind of hoping that there was more possibilities to use LXC and cgroups there, but the docker executor you mention might be the way to go.

2) Last I checked, YARN had moved to support Docker images as the binary that it deploys using the DockerContainerExecutor. I'm not too sure what the current status of this feature is, though. Unlike (1), using YARN to deploy Docker images *would* get you to a point where individual devs would be able to install their own dependencies in their images.

A note on (2), though is that YARN and Docker images are somewhat orthogonal. Using YARN vs. Celery vs. Mesos doesn't make much difference. What matters is deploying a proper container that allows devs to install all of their dependencies without impacting the entire cluster.

Actually since Mesos is using LXC just like docker, you can deliver different dependencies that will be cleaned up after execution. In this sense, it is better than YARN for this, and different.

A second note is that moving towards Docker images as a deployment mechanism helps to decouple the DAG definition from the artifacts that are deployed and executed. While it's convenient at first to embed scripts and whatnot along with your config, decoupling config from binaries has some nice benefits when operating at scale. I definitely pick up a bit of a mixed message in the airflow docs, where you're saying, "DAGs are configs," but then in other places, I can see the tendency to stuff Python callbacks and bash commands inside the DAG script (or in adjacent .py files that are deployed simultaneously with the config). This is something that I'd never do with config. I'd keep config and binary artifact publication, revisioning, etc separate. Having the DAG clearly defined as the config that defines the workflow, and the docker image clearly defined as the deployable artifact that contains the workflow binaries makes this really clear.

I completely agree about the decoupling of config deployment vs binaries deployment.

Note that, as of yet, almost nothing we use in our workflows is a binary. We do ETL using code that will run on frameworks that are already deployed on our cluster like Hive, Python, Mysql etc. The code we run is interpreted and as such can live just fine in being deployed in a similar way as the workflow definition, given a framework version. Our deployment of required frameworks is done via Chef on our infra.

Cheers,

Arthur

Arthur Wiedmer

unread,

Nov 17, 2015, 2:46:06 PM11/17/15

to Airflow

And I should say that I have nothing against docker, and would love to pursue a Docker Executor, or a way to deploys containerized runs :)

Best,

Arthur

Chris Riccomini

unread,

Nov 17, 2015, 3:07:50 PM11/17/15

to Airflow

Hey Arthur,

> I was kind of hoping that there was more possibilities to use LXC and cgroups there, but the docker executor you mention might be the way to go.

They have LXC and cgroups support, AFAIK. Again, my experience is from a year ago, but at the time, I was able to get CGroup CPU isolation up and running in YARN. Just a data point.

> Our deployment of required frameworks is done via Chef on our infra.

Are you using Chef to pull in DAGs? Is there one central DAG repo in VCS, or many? Curious how you guys are managing this.

> we are moving away from Mesos as a platform and thus cannot drive further development on it.

Curious about why, and learnings. Especially given LXC and dependency cleanup that Mesos gives you. Seems very complementary to Airflow.

Cheers,

Chris

Chris Riccomini

unread,

Nov 17, 2015, 3:11:32 PM11/17/15

to Airflow

> They have LXC and cgroups support

Correction, they have CGroup support, though it's pretty limited. LXC, I'm not so sure. I see a lot of JIRAs trying to replicate it by just using DockerContainerExecutor.

Will Norman

unread,

Nov 17, 2015, 4:40:19 PM11/17/15

to Airflow

We're currently doing something similar to this, just using the bash operator:

run_some_docker_job = BashOperator(
    task_id='run_some_docker_job',
    env={'RUN_AS' : run_as_user, 'AIRFLOW_ENV' :  environment},
    bash_command='sudo docker run --rm -e "ENV=${AIRFLOW_ENV}" intentmedia/aggregations:green <job> <task> --runDate {{ ds }}  --user ${RUN_AS}',
    dag=dag)

We install docker on our job server, and have a task at the head of our DAG that pulls in the latest green container

It's worked out pretty well, and allows for running different type of tasks, without worrying about installing different libraries on the box.

You probably wouldn't have to pull first if you use a set version tag. It could be passed in as part of the env map, like we're doing with RUN_AS and AIRFLOW_ENV above. Those are just variables in our DAG that are read in from config in our case.

It probably wouldn't be too hard to write a DockerOperator that does something similar using docker's python client API

Arthur Wiedmer

unread,

Nov 17, 2015, 4:52:59 PM11/17/15

to Airflow

We have DAGs and related interpreted code in VCS which is deployed to machines every 10/15 minutes. We have one central dag repo, but there is nothing preventing you from having separate repos as subfolders of your DAGS_FOLDER.

As far as Mesos, people on our infra team are probably better suited to answer this, but I can provide partial answers :

- The main issue for Data Engineering had was around investigating issues from jobs when they arose. Logs in containers were disappearing quickly, sometimes too quickly for us to be able to investigate, and we did not have a good system to keep the containers around on error to investigate them later.

- Turns out that we were running custom forks of Mesos and Chronos a little too far from vanilla and the people maintaining them left the company (Before my time mostly).

Best,

Arthur

Maxime Beauchemin

unread,

Nov 17, 2015, 7:18:56 PM11/17/15

to Airflow

Hi,

Very interesting discussion. I agree about callbacks and other Python hooks (PythonOperator!), we may want to backtrack on attaching it physically in the DAG in favor of a string reference to the module/callable. Code logic in the DAG brings issues around serialization of bytecode.

Now about docker, it's extremely useful, and it would be nice if it was Executor agnostic. I haven't used docker so I don't know what is possible or not, but what if we added a reference to a docker container in BaseOperator allowing to specify at the task level what docker container to run the airflow task into. Then it would be a matter of having the worker being able to fetch the container and run the `airflow run` command inside of it in a subprocess. I don't think there's any reason why it wouldn't work on just any executor.

A place to start would be to evaluate how hard it is to retrofit `airflow run` to accept a "--docker-container-id" parameter. A new [docker] section in airflow.cfg would reference where to get the container from, ...

Max

C Mcc

unread,

Nov 23, 2015, 4:50:57 PM11/23/15

to Airflow

Will,

Could you expand a little on your docker solution?

In the context of the original question relating to a list of python package dependencies, do you have various different docker images that include a different mix of these ... or is the approach to somehow seed on-demand images with dynamic dockerfiles that build containers as required. Is each worker then spawned as a container? Or is there a requirement for the docker daemon to be running on each worker first?

Thanks for any further colour you can offer.

Colum

Will Norman

unread,

Nov 30, 2015, 4:39:21 PM11/30/15

to Airflow

We have the docker daemon running on our worker (just one right now)

Our docker images contain the application that job flow is running, a mix of scala and clojure in our case, but could easily be a python application.

The first step of our DAG pulls in the latest docker image for the application(s) being run in that dag.

We only have a couple DAGs so far, but this approach has worked out pretty well.

Reply all

Reply to author

Forward