Hi,
I was noticing today that when I run backfill, I see very strange behavior.
This is the command I was running:
airflow backfill test_pipeline -s 2016-04-20 -e 2016-04-21
This is how my default_args look:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime_obj.now() - datetime.timedelta(hours=1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': datetime.timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
This is the DAG definition:
dag = DAG(
'test_pipeline',
default_args=default_args,
schedule_interval=datetime.timedelta(minutes=60))
And I have 2 tasks t0 and t1 ( t0 -> t1)
So as you see above, my schedule interval = HOURLY
Confusion/Issue 1:
Now when I start backfill, I was expecting the pipeline to start running from 20th april first hour (00) and then go to 01, 02nd hour, 03, 04th.....20th april 23rd hour and stop after this.
But for some reason, the first job that ran had execution_date: 04-21T00:00:00 (21st?? WHY??? ) , then 04-20T10:00:00 then 04-20T05:00:00, and some random ordering (I cannot explain this ordering. Hence, here to seek help in understanding)
Confusion/Issue 2:
Also, I have 2 tasks t0 and t1 ( t0 -> t1)
I see when I started my backfill, 25 t0 tasks were launched first (in the random order I described above)
Then t1 tasks were launched (again, in a random order, but not in the same random order as the random order of t0 :) )
Sorry for complaining. I really dont mean to.
I am just trying to understand how backfill runs.
Confusion/Issue 3:
Also, can I schedule a backfill job for just 1 hour? This is one of the features I was hoping I could get when we started using airflow.
Thanks