Final task in a backfill is run over and over - caught in a weird failure loop

969 views
Skip to first unread message

Lance N.

unread,
Mar 28, 2016, 2:40:47 PM3/28/16
to Airflow
Here is a sample snippet from the end of the logs of a large backfill. It was 80 days with 30+ tasks per day. The final task failed with a weird error (Java stuff) and then the backfill command would restart it. I eventually marked the task 'failed' just to make this stop.

"ERROR - The airflow run command failed at reporting an error. This should not occur in normal circumstances."

Cheers,

Lance Norskog


[2016-03-28 08:09:20,006] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:25,007] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:30,005] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:35,006] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0

[2016-03-28 08:09:35,017] {base_executor.py:34} INFO - Adding to queue: airflow run embulk_daily fatbird_deliveries 2016-01-28T00:00:00 --pickle 5 --local --pool db_fatbird-adhoc_host -s 2016-01-01T00:00:00
[2016-03-28 08:09:40,005] {celery_executor.py:62} INFO - [celery] queuing ('embulk_daily', 'fatbird_deliveries', datetime.datetime(2016, 1, 28, 0, 0)) through celery, queue=default
[2016-03-28 08:09:40,006] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:45,020] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:50,007] {jobs.py:799} ERROR - The airflow run command failed at reporting an error. This should not occur in normal circumstances. Task state is 'running',reported state is 'success'. TI is <TaskInstance: embulk_daily.fatbird_deliveries 2016-01-28 00:00:00 [running]>
[2016-03-28 08:09:50,007] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:09:55,007] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0
[2016-03-28 08:10:00,007] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded: 2685 | kicked_off: 2686 | failed: 0 | wont_run: 0

r0ger

unread,
Apr 28, 2016, 8:35:47 PM4/28/16
to Airflow

similar error 

[2016-04-28 23:33:42,720] {jobs.py:799} ERROR - The airflow run command failed at reporting an error. This should not occur in normal circumstances. Task state is 'None',reported state is 'success'. TI is




My pipeline is now stuck. It doesnt run anymore

Message has been deleted

r0ger

unread,
Apr 29, 2016, 3:00:19 AM4/29/16
to Airflow
This is what my pipeline looks like:

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': scheduling_start_date,
'email': ['air...@airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': default_retries_delay,

}

for i in (1,100):
    Dag d = dag = DAG("Pipeline"+i, default_args=default_args, schedule_intervaldatetime.timedelta(minutes=60))
    to = BashOperator  
    t1 = BashOperator
    t2 = BashOperator
    t3 = BashOperator
  
    t4 = BashOperator(  depends_on_past=True
  t5 = BashOperator( depends_on_past=True
  t6 = Noop (a DummyOperator)

  to -> t1 -> t2 -> t3 -> t6 
  t0 -> t4 -> t6
  t0 -> t5 -> t6


Now when I do a backfill for a particular DAG:
airflow backfill  -s 2016-04-21T12:00:00 -e 2016-04-21T23:00:00 Pipeline_1

A bunch of t0s simultaneously starts. 

A bunch of t4 and t5 go to the QUEUED State.


Then about 16 tasks (among t0, t1, t2, t3)  start running. At any point the count of tasks is 16 or less  (sometimes 10, 12).  I understand this setting can be controlled from the airflow.cfg.


But after few runs,  my pipeline just wont progress:

[2016-04-29 06:56:43,013] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:48,010] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:53,014] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0
[2016-04-29 06:56:58,010] {jobs.py:813} INFO - [backfill progress] waiting: 104 | succeeded: 112 | kicked_off: 100 | failed: 0 | wont_run: 0

I saw some error in the logs: 

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 15, in <module>
    args.func(args)
  File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 167, in run
    raise AirflowException(msg)
airflow.utils.AirflowException: DAG [Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B] could not be found in /usr/local/airflow/dags/pipeline.py
[2016-04-29 06:28:03,017] {jobs.py:799} ERROR - The airflow run command failed at reporting an error. This should not occur in normal circumstances. Task state is 'None',reported state is 'success'. TI is <TaskInstance: Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B.gbm-runner-PTypeConn 2016-04-22 00:00:00 [None]>

I dont know what is happening. 
My backfill just gets stuck spitting out the same above log -> [backfill progress] waiting: 104 

Also, my main pipeline which is supposed to run every hourly, even that is stalled.

Please help. I really want to understand how serious of a problem this is? 

Is this related to the scheduler restart after 5 runs? --num_runs?

r0ger

unread,
Apr 29, 2016, 1:59:03 PM4/29/16
to Airflow, d...@airflow.incubator.apache.org

d...@airflow.incubator.apache.org

need your help to debug this ^ 

Chris Riccomini

unread,
Apr 29, 2016, 2:58:00 PM4/29/16
to Airflow, d...@airflow.incubator.apache.org
Please subscribe to the dev@ mailing list. Sorry to make you jump through hoops--I know it's annoying--but it's for a good cause. ;)

This looks like a bug. I'm wondering if it's related to https://issues.apache.org/jira/browse/AIRFLOW-20. Perhaps the backfill is causing a mis-alignment between the dag runs, and depends_on_past logic isn't seeing the prior execution?

Chris Riccomini

unread,
Apr 29, 2016, 3:11:58 PM4/29/16
to Airflow, d...@airflow.incubator.apache.org
Comments are happening on the dev@ mailing list. Suggest you switch over to that.
Reply all
Reply to author
Forward
0 new messages