backfill not behaving correctly?

2,491 views
Skip to first unread message

r0ger

unread,
Apr 20, 2016, 9:27:21 PM4/20/16
to Airflow
Hi,

I was noticing today that when I run backfill, I see very strange behavior.

This is the command I was running: 
airflow  backfill test_pipeline -s 2016-04-20 -e 2016-04-21 

This is how my default_args look:

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime_obj.now() - datetime.timedelta(hours=1),
    'email': ['air...@airflow.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': datetime.timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}

This is the DAG definition:
dag = DAG(
    'test_pipeline', 
    default_args=default_args, 
    schedule_interval=datetime.timedelta(minutes=60))
 
And I have 2 tasks t0 and t1  ( t0 -> t1)

So as you see above, my schedule interval = HOURLY

Confusion/Issue 1:
Now when I start backfill, I was expecting the pipeline to start running from 20th april first hour (00) and then go to 01, 02nd hour, 03, 04th.....20th april 23rd hour and stop after this.

But for some reason, the first job that ran had execution_date: 04-21T00:00:00 (21st?? WHY??? ) , then  04-20T10:00:00 then  04-20T05:00:00, and some random ordering (I cannot explain this ordering. Hence, here to seek help in understanding)

Confusion/Issue 2:
Also, I have 2 tasks t0 and t1  ( t0 -> t1)
I see when I started my backfill,  25 t0  tasks were launched first (in the random order I described above)
Then t1 tasks were launched (again, in a random order, but not in the same random order as the random order of t0 :) )

Sorry for complaining. I really dont mean to. 
I am just trying to understand how backfill runs. 

Confusion/Issue 3:
Also, can I schedule a backfill job for just 1 hour?  This is one of the features I was hoping I could get when we started using airflow.


Thanks



r0ger

unread,
Apr 21, 2016, 12:25:16 PM4/21/16
to Airflow
Anyone who can help me out here? I would really appreciate some help. 
(Sorry if the question sounds silly)

Maxime Beauchemin

unread,
Apr 21, 2016, 3:15:08 PM4/21/16
to Airflow
Airflow assumes idempotent tasks that operate on immutable data chunks. It also assumes that all task instance (each task for each schedule) needs to run.

If your tasks need to be executed sequentially, you need to tell Airflow: use the depends_on_past=True flag on the tasks that require sequential execution.

3) The -s (--start_date) and -e (--end_date) backfill CLI params can receive ISO parseable dates, so you can pass specific timestamps there (`2016-01-01T00:03:00`). 

Max
Message has been deleted

r0ger

unread,
Apr 21, 2016, 9:14:02 PM4/21/16
to Airflow
hmm.. depends_on_past=True did work for sequential execution.

The other issue that I am seeing is
if I do :
 airflow backfill -s 2016-01-01T00:03:00 -e 2016-01-01T00:03:00


The job runs successfully. 


Now if I try to run the same backfill job again, it doesnt run again. 
So backfill is only for running the jobs that were never run in the past? Shouldnt I be able to re-run my old jobs between any start and end date? 


Maxime Beauchemin

unread,
Apr 22, 2016, 12:01:28 AM4/22/16
to Airflow
`backfill` only fills in the blanks, so that an interrupted backfill or run can be completed. Check out `airflow clear` in the CLI (or clearing int he UI) to selectively define what should re-run. You can pick date ranges, specify a task_id regex, include upsteream/downstream, limit the clearing to tasks in a certain state, ...

r0ger

unread,
Apr 22, 2016, 2:57:28 AM4/22/16
to Airflow
works! 
I feel airflow has a lot of features that people can use but are not aware of. It can be a beast. Kinda difficult to get handle on, initially. Thanks for your help Maxime. 
Reply all
Reply to author
Forward
0 new messages