interplay of backfill start_date and dag file start_date

2,443 views
Skip to first unread message

Daniel Kranowski

unread,
Oct 12, 2015, 5:17:04 PM10/12/15
to Airflow
Using airflow release 1.5.1.
I have a dag whose start_date is set in the far future, meaning "don't run yet":

default_args = {
'start_date': datetime(2025, 10, 1),
'schedule_interval': timedelta(days=1)
}
dag = DAG('my_dag', default_args=default_args)

Then on Oct 9 I manually started the dag on the command-line:

airflow backfill my_dag -s 2015-10-08 -x


It ran once for ds=Oct 8, then to my surprise it ran again Oct 9, Oct 10, Oct 11.  A comment in cli.py for def backfill says "If only one date is passed, using same as start and end", so I'd expect the backfill to run for ds=Oct 8, then end.  However it appears that the backfill command has set up a permanent new schedule, which is not shown anywhere -- after having typed this command, the scheduler plans to keep running the dag, but it's not reflected in the dag python file or anywhere in the webserver ui.

So I have these questions:

1) Is this apparent lack of end_date a bug?

2) Is there any way to tell (via webserver ui, or command-line) the start/end_date window in force for a dag based on a prior backfill command?

3) Can you cancel an existing active backfill command, reverting the scheduler to honor only the start/end_dates specified in the dag python file?

Maxime Beauchemin

unread,
Oct 13, 2015, 12:03:58 AM10/13/15
to Airflow
Actually the start_date is only used on first run. Once a task has run once (through any mean), the scheduler adds the schedule_interval to that latest date and moves forward. In your case I believe the scheduler must have triggered the dates after 2015-10-08 (not the backfill itself).

Other ways to prevent the scheduler from running tasks include the `adhoc=True` task parameter (set a task to adhoc and the scheduler won't trigger it), pausing the DAG, or using the `end_date` task param sometime in the past.

Max

Daniel Kranowski

unread,
Oct 13, 2015, 3:22:26 PM10/13/15
to Airflow
Hi Max,

Thanks for the replies to all my posts yesterday.  If I understand what you are saying, the 'start_date': datetime(2025, 10, 1) parameter in my_dag.py no longer has any relevance after having run the command-line backfill.

However, why did the scheduler decide to start subsequent executions at 00:00:00 (midnight) instead of exactly 1 day (the schedule_interval) after the first run?  Here is a summary of my Task Instances:
  1. Start 10-09T23:55:02, End 10-10T00:05:22, Execution 10-08T00:00:00, from airflow backfill my_dag -s 2015-10-08 -x.
  2. Start 10-10T00:05:15, Execution 10-09T00:00:00, seemed to start automatically.
  3. Start 10-11T00:00:05, Execution 10-10T00:00:00.
  4. Start 10-12T00:00:05, Execution 10-11T00:00:00.  I trashed this one manually.
  5. Start 10-12T18:16:03, Execution 10-11T00:00:00, from airflow backfill my_dag -s 2015-10-11 -x.
  6. Start 10-13T00:00:03, Execution 10-12T00:00:00.
I guess #2 started as soon as the scheduler could become aware that a run was due for Oct 10, but why did numbers 3, 4, and 6 start nearly at 00:00:00 instead of exactly 1 day after the first run?


Maxime Beauchemin

unread,
Oct 15, 2015, 12:59:25 PM10/15/15
to Airflow
schedule_interval is added to execution_date, not start_date

Let me know if something specific is unclear in this part of the documentation:

Daniel Kranowski

unread,
Oct 15, 2015, 3:08:50 PM10/15/15
to Airflow
Thanks, that explains it.

The Scheduler doc page does not mention that bit of schedule_interval arithmetic, or the other point above about how the start_date dag parameter is ignored after the first run.  I could take a crack at adding that to the scheduler.rst file.

In general I was confused by the date terminology.  It seems to me that there are two classes of dates in airflow:
  1. Actual execution time window
    1. The time when a dag or task is actually executing, typically a random-looking HH:MM:SS time.
    2. Corresponds to 'Start Date', 'End Date' in the webserver ui page 'Task Instances'; and 'Dttm' in the ui page 'Logs'
  2. Effective date, or Allowed date
    1. The date when a dag is allowed to start running, typically aligned to a day granularity with HH:MM:SS=00:00:00.
    2. Corresponds to 'Execution Date' in the Task Instances page; the 'ds' template variable; and the '-s' / '-e' command-line args for Start and End of the 'run' and 'backfill' window.  This disparity of names is the most confusing to me.
    3. schedule_interval is added to this date.


Reply all
Reply to author
Forward
0 new messages