Interesting blog post: "Airflow: Tips, Tricks, and Pitfalls"

3,370 views
Skip to first unread message

Maxime Beauchemin

unread,
Dec 7, 2015, 12:40:27 PM12/7/15
to Airflow
Hi,

Marcin Tustin at Handy.com posted a very informative blog post titled "Airflow: Tips, Tricks, and Pitfalls" 
https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.650vzj43p

I encourage you to take the time to read and integrate it. I also encourage you to write about Airflow, help us improving the docs and spreading the knowledge!

Max

Sergei Iakhnin

unread,
Dec 8, 2015, 5:48:54 AM12/8/15
to Airflow
Hi Maxime,

Thanks for sharing this article. The issue of certain tasks not being scheduled unless you reboot the scheduler is something that I also encounter, and is quite worrisome to me. Do you have any insight into what is causing this and are you trying to identify the root cause? I think rebooting the service is an ok band-aid in the short term, but can't be a valid long term strategy.

Thanks,

Sergei.

Chris Riccomini

unread,
Dec 8, 2015, 12:07:16 PM12/8/15
to Airflow
 The issue of certain tasks not being scheduled unless you reboot the scheduler is something that I also encounter, and is quite worrisome to me

Yea, this is alarming. Having to bounce a scheduler doesn't instill much confidence. Any idea what's going on?

Maxime Beauchemin

unread,
Dec 8, 2015, 1:53:16 PM12/8/15
to Airflow
Yes, restarting the scheduler every N runs has worked a little too well for us, so we haven't had as much incentives to fix this just yet. It's an important set of issues (especially for folks running a LocalExecutor) at the heart of the platform and needs to be addressed.

The issues I've seen are around keeping the DagBag up to date while the code is changing underneath. Here are a few things that may or may create problems:
* A dag_id moves from a file to another
* Your pipeline imports a module that has content that can alter the DAG and doesn't use reload(my_module) anywhere. These imports are handled by python and they won't reload unless instructed to, or by messing with `sys.modules` which seems dangerous. Maybe using a subprocess?
* I haven't monitored the code for memory leaks, but what if someone's pipeline has some sort of leak? If there's something of that nature restarting the process solves that
* I don't think that removing DAG files remove them from the DagBag currently, but it should be trivial to do
* Race conditions between the scheduler, workers and web servers, where the scheduler can get ahead of the workers for instance

There are many paths to solve this. The most obvious is to iterate on DagBag to iron these out one at a time. 

My favorite and more ambitious idea is to have a process to serialize all DAG objects to the DB while monitoring them for change, versioning them, and making the workers and webserver get DAG definition by deserializing them, meaning they don't have to have the code locally and maintain their own DagBag. That solves issues around versioning and having conflicting or eneven DagBag in production. One super important blocker there is that the jinja templates aren't serializable. We also need to enforce DAG objects being serializable, currently some callbacks or PythonOperators might not be, we need to change these to use namespace references instead of actual python object references. That's probably going to take place in Q1.

I might do a call for help on the jinja template pickling (serialization). If someone from the community could either find a hack or alter the jinja project to make templates serializable that would help a lot.

r0ger

unread,
Apr 27, 2016, 3:25:56 PM4/27/16
to Airflow
Sorry on being late to the party.

We have recently switched our pipelines to airflow (we love it)
But I was reading about this. I also came across https://github.com/airbnb/airflow/issues/698 in which @mistercrunch says: 
"We restart it every 5 runs using --num_runs mostly just to make sure we get a 100% refreshed dagbag. Keep in mind that this strategy doesn't work with LocalExecutor."


Currently, we are using LocalExecutor. I am worried when I read that statement, more because I don't know what is problem here? 
In general, can someone please throw more detailed info on this? It will be really helpful from an understanding point of view.

Maxime Beauchemin

unread,
Apr 27, 2016, 5:05:44 PM4/27/16
to Airflow
Addressing related problems is on top of the roadmap this quarter, and we have lots of bright minds attacking this issue.

Short story is that "configuration as code" is hard for many reasons, and we need a more systematic approach to insure that what is in memory stays in line with the code, as people would expect. One basic problem is that if your `mydag.py` imports a `config.py`, and that `config.py` changes, the DagBag will never know about it. Airflow can force reload `mydag.py` since it knows it has found a DAG in there earlier, but really doesn't know about `config.py`. Even if we flush the DagBag and rebuild it from scratch, config.py has been cached by python in `sys.modules`. And we don't want to mess with sys.modules, that would just "void the warrantee". 

The new approach we're rolling with is to never parse the DagBag in a long running process, always in subprocesses. Short lived DagBags. A sub process might load the whole dagbag, and figure out where all DAGs are located (file entry-points for each dag) and put that in the database. The main scheduler process will then read that manifest and start sub processes that will do a single "schedule cycle", loading the dagbag from that entrypoint only, and die.

In the meantime, restarting the scheduler every 5 runs was just a simple way for us to ensure a 100% fresh dagbag. An alternative would be to systematically `reload(module)` on external dependencies that you know might change, but that's just not right either.

This shoudn't be an issue a month or so from now, bear with us :). Also note that all of us got a lot of mileage out of the current situation.

Max

r0ger

unread,
Apr 28, 2016, 2:20:33 AM4/28/16
to Airflow
got it!!!
Thanks Maxime. 
Reply all
Reply to author
Forward
0 new messages