Scheduler Integration with s3

840 views
Skip to first unread message

Jesse Edwards

unread,
Sep 1, 2015, 7:06:52 PM9/1/15
to Airflow
Hi there-

I'm currently working on getting airflow setup in a distributed and HA setup on AWS and have ran into a few challenges.

Is it possible to tell airflow to use an s3 bucket for its DAG directory and the log directory?

Things I've encountered:

1. Running more than one scheduler introduces race conditions where each scheduler attempts to grab the same task, this is stopped by unique constraints in the DB however it is a nice flurry of errors.
2. If I were to have multiple airflow web servers or schedulers (active passive) or just a new vm, I'd want to preserve my output logs and have them accessible. 

Thanks,
Jesse


Maxime Beauchemin

unread,
Sep 2, 2015, 10:48:04 AM9/2/15
to Airflow
For now the DAGS_FOLDER has to be locally mounted path. You may be able to mount an s3 bucket and point to that. We use the chef git resource to sync the repository to all workers. Your environment/infrastructure should allow for keeping git repos in sync on multiple machines.

As you discovered, at the moment you should run only one scheduler at a time. The scheduler code could easily be adapted to be distributed by taking locks on individual DAG as it processes them. I have written some code in that direction but haven't push in that direction since one scheduler can schedule 10k+ tasks a day with <60 seconds lag. As we scale up I'll most likely have to allow multiple schedulers to run, maybe a few months from now or whenever someone from the community needs it.

We run multiple web servers (each Airflow worker at Airbnb acts as a web server also), the logs are captured by something called runit (chef resource), your environment should have some standard way to wrap the Airflow executables and capture the logs.

Max
Reply all
Reply to author
Forward
0 new messages