Tasks are staying in scheduled forever and not running, and pods keep crashing

9,026 views
Skip to first unread message

eric....@farfetch.com

unread,
Dec 26, 2018, 4:08:35 PM12/26/18
to cloud-composer-discuss

I'm trying to test out Cloud Composer for my organization, but none of the DAGs or the tasks I've written will run at all.


I have written one extremely simple DAG, which I am sharing below:


from datetime import datetime, timedelta
import random

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

default_args
= {
   
'owner': 'airflow',
   
'depends_on_past': False,
   
'email_on_failure': False,
   
'email_on_retry': False,
   
'queue': 'airflow',
   
'start_date': datetime.today() - timedelta(days=2),
   
'schedule_interval': None,
   
'retries': 2,
   
'retry_delay': timedelta(seconds=15),
   
'priority_weight': 10,
}


example_dag
= DAG(
   
'example_dag',
    default_args
=default_args,
    schedule_interval
=timedelta(days=1)
)


def always_succeed():
   
pass


always_succeed_operator
= PythonOperator(
    dag
=example_dag,
    python_callable
=always_succeed,
    task_id
='always_succeed'
)


def might_fail():
   
return 1 / random.randint(0, 1)


might_fail_operator
= PythonOperator(
    dag
=example_dag, python_callable=might_fail, task_id='might_fail'
)


might_fail_operator
.set_upstream(always_succeed_operator)

This DAG should take well under a minute to run, but every instance, whether automatically scheduled or manually triggered from teh web server, has been stuck in the `scheduled` status for over an hour. Screenshot below:

Screenshot from 2018-12-26 15-47-01.png



I've noticed that none of these tasks has a hostname listed in the table, but when I run the same DAG on my local machine, they do (it's the name of the docker image I'm running them in). This looks like a clue to what's wrong, but I don't know where to look further on that.

Task instance info for each of the stuck tasks (all of the tasks) lists this as the reason:

All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count

If this task instance does not start soon please contact your Airflow administrator for assistance.


Also, I am seeing the Kubernetes pods for the scheduler and worker failing constantly. They've failed 20 times today, at last check (a couple more times since I took these screen shots). No apparent problem in the logs:


Screenshot from 2018-12-26 15-47-27.png


Screenshot from 2018-12-26 15-46-43.png


Thanks for your help.
- Eric

Feng Lu

unread,
Jan 1, 2019, 1:29:13 PM1/1/19
to eric....@farfetch.com, cloud-composer-discuss
It looks like the airflow workers aren't successfully started, could you please check that:
- your GKE cluster is live and healthy.
- the service account used to run the Composer environment has the right IAM roles (e.g., composer.worker)?

Feng  

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/2ff8c9a6-a46e-4301-a84a-8ccef7ddbf2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eric....@farfetch.com

unread,
Jan 10, 2019, 4:40:33 PM1/10/19
to cloud-composer-discuss
It looks like the issue was that the start_date was being computed in the `default_args` dict. We removed that and this issue went away.

BTW - it seems like Google Groups is bugged on Firefox for Ubuntu. When typing in the textarea, some keys are interpreted as actions outside the textarea. For example, when I type a "t", a new tab is opened as if I hit Ctrl+T. But not all letters work like this; "s" works fine and doesn't bring up the save page window.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages