catchup = False is not working as expected

3,458 views
Skip to first unread message

John Cheng

unread,
May 8, 2018, 2:24:02 AM5/8/18
to cloud-composer-discuss
Hi all,

In my DAG, I set catchup=False . However, once I unpause the DAG, it runs immediately even the scheduled time is not reached.

Here is my DAG

from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime

args = {
   'owner': 'airflow',
   'depends_on_past': False
}


dag = DAG(
   dag_id='dummy',
   default_args=args,
   catchup=False,
   start_date=datetime(2018, 5, 1, 0, 00),
   schedule_interval='0 0 * * *'
   )

t1 = DummyOperator(
   task_id='dummy',
   dag=dag
)

I unpause the DAG at 2018/5/8 06:20. What I expected is to trigger the DAG at 2018/5/9 00:00.
However it runs at 2018/5/8 06:20 immediately after I unpause it.

Regards,
John

ri...@clarivoy.com

unread,
May 9, 2018, 9:43:16 AM5/9/18
to cloud-composer-discuss
This has been unexpected behavior in Airflow for a long time.  I was hoping the Google team would finally fix it as well.

The other one that Airflow often ignores is  setting   'max_active_runs': 1 in the dag default_args.;  So if you unpause your job shortly before the next scheduled run starts, you'll often get two running.

Both seem like fundamental features of any job scheduler, but have been broken in Airflow for a couple of years now.

John Cheng

unread,
May 9, 2018, 10:06:37 AM5/9/18
to cloud-composer-discuss
I tried with command 
echo "airflow unpause mydag" | at 00:00
It triggers the DAG twice.


ri...@clarivoy.com於 2018年5月9日星期三 UTC+8下午9時43分16秒寫道:

Wilson Lian

unread,
May 11, 2018, 9:03:26 PM5/11/18
to John Cheng, cloud-composer-discuss
The catchup behavior is documented in the Airflow documentation: https://airflow.apache.org/scheduler.html?highlight=backfill#backfill-and-catchup
Having been designed for ETL, Airflow's objective is to ensure that data generated in all completed intervals since the start_date are processed by the DAG. When you unpause a DAG, it sets about achieving this objective in one of 2 ways:

- catchup=True: Airflow generates a DAG run for every completed interval between the start_date and the time you unpause your DAG. It will ignore intervals that have already been processed (as evidenced by a DAG run in the Airflow metadata DB). The DAG is expected to only process data for a single interval, hence the need to spawn a DAG run for every interval.
- catchup=False: Airflow generates a single DAG run for the most-recently completed interval. The DAG is assumed to have backfill logic integrated into it, meaning that it can potentially process more than one interval's worth of data.

If you want to delay the first run of your DAG, you'll need to set the start_date in the future. Note, however, that changing the start_date of an existing DAG will confuse Airflow; you'll need to assign a different DAG ID. (source)

The issue of max_active_runs was brought up recently on the Airflow dev mailing list. The parameter should be passed directly to the DAG constructor rather than being embedded in default_args. https://lists.apache.org/thread.html/890907ce8ddaf4e0f19ef6380dafd4fc77498a82e48733ff1a62223d@%3Cdev.airflow.apache.org%3E

best,
Wilson

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/b8079758-6663-4a4c-a901-4c013ab2492c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages