Redundancy & Failover

227 views
Skip to first unread message

Andrew O'Brien

unread,
Apr 26, 2016, 11:01:53 AM4/26/16
to Airflow
Hi everyone,

I just started to take a look at Airflow and before I went too far, I wanted to find out about redundancy and failover. What are the options like for running Airflow on multiple machines in multiple data centers? Is all state kept in the database or is there other storage I'd need to synchronize? (The Celery queues maybe?) Anything on the filesystem? Anyone tried to do this yet?

Thanks.

Maxime Beauchemin

unread,
Apr 26, 2016, 11:22:01 AM4/26/16
to Airflow
State is all in the database. For now you have to provide boxes that have synced git repositories with your DAG definitions. We're working on a proposal/prototype of a service that will retrieve specific version of the DAG definitions at task run time.

The message queue (Redis / RabbitMQ / ...) is pretty low bandwidth so I assume you could have workers on multiple data centers. I'm not aware of anyone who does though, but shouldn't be an issue.

Max

Andrew O'Brien

unread,
Apr 26, 2016, 11:47:16 AM4/26/16
to Airflow
Thanks, Max. Sounds pretty straight forward.

So if I understand correctly, I can just run multiple webservers (behind a load-balancer) and multiple workers (and they'll just read work from the queue).

What about schedulers? If you run multiple schedulers, do they each schedule DAGs or can they do some kind of leader election?

Maxime Beauchemin

unread,
Apr 26, 2016, 1:42:12 PM4/26/16
to Airflow
> So if I understand correctly, I can just run multiple webservers (behind a load-balancer) and multiple workers (and they'll just read work from the queue).
right

> What about schedulers?
We run a single scheduler at Airbnb. @bolke recently added multiprocessing to it. Other installations have multiple schedulers but it's not officially supported. We're adding support to cleanly lock DAGs while scheduling them so that multiple schedulers don't step on each other's toes. Each scheduler will figure out which DAG is further from its most recent schedule run, lock it, schedule it, and release the lock.
Reply all
Reply to author
Forward
0 new messages