This is what my pipeline looks like:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': scheduling_start_date,
'email': ['air...@airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': default_retries_delay,
}
for i in (1,100):
Dag d = dag = DAG("Pipeline"+i, default_args=default_args, schedule_interval= datetime.timedelta(minutes=60))
to = BashOperator
t1 = BashOperator
t2 = BashOperator
t3 = BashOperator
t4 = BashOperator( depends_on_past=True )
t5 = BashOperator( depends_on_past=True)
t6 = Noop (a DummyOperator)
to -> t1 -> t2 -> t3 -> t6
t0 -> t4 -> t6
t0 -> t5 -> t6
Now when I do a backfill for a particular DAG:
airflow backfill -s 2016-04-21T12:00:00 -e 2016-04-21T23:00:00 Pipeline_1
A bunch of t0s simultaneously starts.
A bunch of t4 and t5 go to the QUEUED State.
Then about 16 tasks (among t0, t1, t2, t3) start running. At any point the count of tasks is 16 or less (sometimes 10, 12). I understand this setting can be controlled from the airflow.cfg.
But after few runs, my pipeline just wont progress:
[2016-04-29 06:56:43,013] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:48,010] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:53,014] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:58,010] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
I saw some error in the logs:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 15, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 167, in run
raise AirflowException(msg)
airflow.utils.AirflowException: DAG [Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B] could not be found in /usr/local/airflow/dags/pipeline.py
[2016-04-29 06:28:03,017] {jobs.py:799} ERROR - The airflow run command failed at reporting an error. This should not occur in normal circumstances. Task state is 'None',reported state is 'success'. TI is <TaskInstance: Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B.gbm-runner-PTypeConn 2016-04-22 00:00:00 [None]>
I dont know what is happening.
My backfill just gets stuck spitting out the same above log -> [backfill progress] waiting: 104
Also, my main pipeline which is supposed to run every hourly, even that is stalled.
Please help. I really want to understand how serious of a problem this is?
Is this related to the scheduler restart after 5 runs? --num_runs?