Hey all,
I'm still poking around with Airflow, so forgive me if some of this has been answered.
If I have a DAG that has a Python task in it, and the Python task includes a bunch of dependencies (e.g. NLTK, scipy, etc), do I have to actually install these packages on all of my worker nodes? This seems kind of excessive, since the nodes will end up having to run the union of all possible packages on all possible languages that everyone is writing tasks in (Python, Java, Go, Javascript, etc etc). From an ops perspective, this is a maintenance nightmare (especially when different orgs start demanding different versions of the same package).
One solution, which I've had imposed on me in the past is to just dictate very strictly which things are supported, and install them on all the worker nodes. This would work, but is somewhat limiting. We used to do it with our old Hadoop clusters.
Another solution that I'm curious about is supporting containers. It seems like this is the exact use case that containers are trying to solve. Is anyone running Airflow with the DAG execution being done inside a docker image, such that developers can pip install/apt-get install/gem install/whatever into their docker image, and the people operating the Airflow workers don't have to worry about any of this stuff?
I'm wondering if there's some integration where the tasks can `docker run` (didn't find any Docker operator, or anything), and have the worker suck down the proper Docker image(s), and execute it.
How does Airbnb/blueapron/others solve this?
Cheers,
Chris