Airflow DAG job in running state but idle for long time

6,402 views
Skip to first unread message

jay...@oath.com

unread,
May 7, 2018, 12:39:08 PM5/7/18
to cloud-composer-discuss
Hi,

I see an issue where our dag has scheduled to be run but it is sitting idle, I know for sure that job that I am running wont take more than 5 mins, here are the logs that I see - 

--------------------------------------------------------------------------------
Starting attempt 1 of 3
--------------------------------------------------------------------------------

[2018-05-07 10:49:45,442] {models.py:1427} INFO - Executing <Task(BigQueryOperator): validate_delete_14> on 2018-05-06 00:00:00
[2018-05-07 10:49:45,445] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run gdpr_delete_lite_video_hourly_dag_prod validate_delete_14 2018-05-06T00:00:00 --job_id 4256 --raw -sd DAGS_FOLDER/gdpr_delete_lite_video_hourly_dag_prod.py']


It's been 7 hours not and their is no progress after above statement. This is happening to multiple jobs. 


Thanks,
Jayesh
Message has been deleted

Wilson Lian

unread,
May 7, 2018, 7:20:11 PM5/7/18
to Jayesh Shah, cloud-composer-discuss
Hi Jayesh,

Can you please follow the instructions below to connect to Flower and check whether the stuck tasks are arriving in the Celery queue?
  1. Install flower: gcloud beta composer environments update ENVIRONMENT_NAME --update-pypi-package "flower"
  2. Determine the Cloud Composer environment's Kubernetes Engine cluster
  3. Connect to the Kubernetes Engine cluster
  4. Select a worker (or scheduler) pod (matches regex "airflow-(worker|scheduler)-[-a-f0-9]+"): kubectl get pods
  5. Run flower on the selected worker pod: kubectl exec -it POD_NAME_FROM_ABOVE -c airflow-worker -- flower --broker=redis://airflow-redis-service:6379/0 --port=5555
  6. In a separate, parallel session, forward local port to flower: kubectl port-forward POD_NAME_FROM_ABOVE 5555
  7. Visit http://localhost:5555/ in your local browser.
Best,
Wilson

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/c6dc5699-83b8-482e-bcc0-454852619835%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

kevi...@umbocv.com

unread,
May 28, 2018, 4:07:57 AM5/28/18
to cloud-composer-discuss
Trying to install flower and got an ERROR: (gcloud.beta.composer.environments.update) Error updating [projects/********/locations/us-central1/environments/********]: Operation [projects/********/locations/us-central1/operations/26429fcf-a9df-48b2-8bf6-c0fa8415e337] failed: BAD REQUEST

Any ideas?

Thanks


On Tuesday, May 8, 2018 at 7:20:11 AM UTC+8, Wilson Lian wrote:
Hi Jayesh,

Can you please follow the instructions below to connect to Flower and check whether the stuck tasks are arriving in the Celery queue?
  1. Install flower: gcloud beta composer environments update ENVIRONMENT_NAME --update-pypi-package "flower"
  2. Determine the Cloud Composer environment's Kubernetes Engine cluster
  3. Connect to the Kubernetes Engine cluster
  4. Select a worker (or scheduler) pod (matches regex "airflow-(worker|scheduler)-[-a-f0-9]+"): kubectl get pods
  5. Run flower on the selected worker pod: kubectl exec -it POD_NAME_FROM_ABOVE -c airflow-worker -- flower --broker=redis://airflow-redis-service:6379/0 --port=5555
  6. In a separate, parallel session, forward local port to flower: kubectl port-forward POD_NAME_FROM_ABOVE 5555
  7. Visit http://localhost:5555/ in your local browser.
Best,
Wilson
On Mon, May 7, 2018 at 9:39 AM, jayeshs via cloud-composer-discuss <cloud-compo...@googlegroups.com> wrote:
Hi,

I see an issue where our dag has scheduled to be run but it is sitting idle, I know for sure that job that I am running wont take more than 5 mins, here are the logs that I see - 

--------------------------------------------------------------------------------
Starting attempt 1 of 3
--------------------------------------------------------------------------------

[2018-05-07 10:49:45,442] {models.py:1427} INFO - Executing <Task(BigQueryOperator): validate_delete_14> on 2018-05-06 00:00:00
[2018-05-07 10:49:45,445] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run gdpr_delete_lite_video_hourly_dag_prod validate_delete_14 2018-05-06T00:00:00 --job_id 4256 --raw -sd DAGS_FOLDER/gdpr_delete_lite_video_hourly_dag_prod.py']


It's been 7 hours not and their is no progress after above statement. This is happening to multiple jobs. 


Thanks,
Jayesh

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Wilson Lian

unread,
May 31, 2018, 12:14:43 AM5/31/18
to kevi...@umbocv.com, cloud-composer-discuss
Hi Kevin,

I looked into the failed operation, and it seems like the image failed to build. I'd check 2 things:
1) Ensure that your default compute engine service account PROJECT_NUM...@developer.gserviceaccount.com has either the project Editor or project-level roles/cloudbuild.builds.builder role.
2) Check the Kubernetes Engine workloads tab (https://console.cloud.google.com/kubernetes/workload?project=PROJECT_ID) for a workload namedcomposer-agent-26429fcf-a9df-48b2-8bf6-c0fa8415e337 and check its logs.

best,
Wilson

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

we...@betabrand.com

unread,
Jul 8, 2018, 10:53:45 PM7/8/18
to cloud-composer-discuss
Hey Wilson,

I'm having the exact same problem as described by Wilson. When looking into Flower, it doesn't appear that my stuck tasks are appearing in the celery queue. How should I proceed next to resolve this issue?

Wells
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

we...@betabrand.com

unread,
Jul 8, 2018, 10:55:01 PM7/8/18
to cloud-composer-discuss
I meant as described by Jayesh :)

Feng Lu

unread,
Jul 9, 2018, 3:00:45 AM7/9/18
to we...@betabrand.com, cloud-composer-discuss
Hi Wells,

The problem is caused by abrupt termination/restart of airflow worker pod, here's the detailed error sequence:
1. Celery backend hands out a task (say foo) to one of the celery workers inside an airflow pod for execution.
2. The celery worker executes task foo as a LocalTaskJob
3. The LocalTaskJob, which periodically sends heartbeat pings to the database, spawns another process to run the actual operator (e.g., foo operator).
4. Due to resource limitation exceeded (maybe OOM?), this particular airflow pod gets restarted at this time. That's also why the task log terminates prematurely. 

Task foo is now in a very unfortunate state, celery thinks that the airflow worker pod is doing the work to execute foo but the pod just got restarted and lost all process states. 
Therefore one has to wait for "visibility timeout" (default to 6 hours in airflow) before task foo gets reassigned to another worker. 

Possible solutions:
- minimize worker pod restart events by not exceeding resource limits (e.g., use more powerful VMs, reduce parallelism).
- we are working on a patch that detects this bad state and sends the task to celeryexecutor again for execution.

Feng 

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

Jon Dugan

unread,
Jul 11, 2018, 5:12:19 PM7/11/18
to cloud-composer-discuss


On Monday, July 9, 2018 at 2:00:45 AM UTC-5, Feng Lu wrote:
Possible solutions:
- minimize worker pod restart events by not exceeding resource limits (e.g., use more powerful VMs, reduce parallelism).
- we are working on a patch that detects this bad state and sends the task to celeryexecutor again for execution.

Hi Feng,

I'm really glad that there is work underway to address this issue!

It took me several hours to track this down when I ran into it, but I learned a lot!

Anyway, glad to hear this issue is being worked.

Thanks,

Jon 

Terrence Szymanski

unread,
Jul 11, 2018, 9:32:12 PM7/11/18
to cloud-composer-discuss
We've also been struggling with this and it's hard to debug because of the limited log output. Definitely looking forward to that patch.
Terry

ala...@umusic.com

unread,
Jul 17, 2018, 12:33:29 AM7/17/18
to cloud-composer-discuss
Thanks Feng for confirming the issue is being worked on. We ran into the same parallel tasks hanging problem. Right now, we bumped our composer cluster machine type to get going. Looking forward the patch rolling out soon!
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

Conrad Lee

unread,
Oct 3, 2018, 8:46:30 AM10/3/18
to cloud-compo...@googlegroups.com
Does anyone know how to configure cerlery's "visibility timeout" using cloud composer?  I searched the web for this but didn't find anything.

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

rajashekar...@figmd.com

unread,
Feb 18, 2019, 1:43:48 AM2/18/19
to cloud-composer-discuss
Hello all,

I am facing same problem is this problem was resolved?

if it is can you explain procedure how to resolve

Skyler Slade

unread,
Feb 18, 2019, 8:58:38 AM2/18/19
to rajashekar...@figmd.com, cloud-composer-discuss
This happened to us. We couldn’t figure out how to fix it and we were up against a deadline, so we just bailed on Cloud Composer and ran Airflow ourselves. We haven’t had any problems since then.

We suspect the problem was caused by Composer using Redis as its Celery result backend. If you look at the Composer logs, you’ll see it logging warnings about this choice. On our self-hosted Airflow, we use PostgreSQL for our result backed. 
Please do not include any Protected Health Information (PHI) when responding to this email!  

FIGmd does not transmit via email any healthcare information protected by federal and/or state laws unless authorized by the subject patient or under circumstances where patient authorization is not required. If FIGmd receives PHI directly from a subject patient, we consider this to be an authorized distribution of PHI by the subject patient. 
 
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (or any attachment) is prohibited. If you have received this email in error, please notify the original sender and delete this message (along with any attachments) from your computer and if possible your server.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Skyler Slade

unread,
Feb 18, 2019, 11:15:19 AM2/18/19
to Raj Shekar, cloud-composer-discuss
I don't believe so. This setting is controlled via the celery celery_result_backend config option which appears to be one of the blocked config options: https://cloud.google.com/composer/docs/concepts/airflow-configurations.

I'm not sure that this was the problem, but regardless, it's definitely not the recommended way to run Airflow or Celery. The Airflow docs specifically say "Make sure to use a database backed result backend" -- https://airflow.readthedocs.io/en/stable/howto/executor/use-celery.html.

On Mon, Feb 18, 2019 at 9:41 AM Raj Shekar <sheka...@gmail.com> wrote:
Hi Skyler,

Thanks for your response...

"On our self-hosted Airflow, we use PostgreSQL for our result backed" means you installed airflow in your server then the problem was solved?

Is there anyway to change celery use back-end postgres?

Thanks,
Rajashekar.




--
Skyler Slade
Director of Site Reliability
SharpSpring Marketing Automation
T: +1 352-317-2537 E: sky...@sharpspring.com W: www.sharpspring.com
  

Raj Shekar

unread,
Feb 19, 2019, 4:19:55 AM2/19/19
to cloud-composer-discuss
Thanks Skyler,

Then May i know how you solved issue.?
    You changed database backed result to postgred?

I am also thinking to change result backend postgres instead of radis.
            broker_url = redis://airflow-redis-service:xxxx/0
            result_backend = redis://airflow-redis-service:xxxx/0

Sorry if i am asking basic question, i am started using airflow recently.

Thanks,
Rajashekar.

Skyler Slade

unread,
Feb 19, 2019, 9:22:10 AM2/19/19
to cloud-composer-discuss
Raj,

My config is:

broker_url = redis://redis-host:6379/1
result_backend
= db+postgresql://my-db-user:my-db-pass@postgres-host/airflow

Feng Lu

unread,
Feb 20, 2019, 4:05:26 AM2/20/19
to rajashekar...@figmd.com, cloud-composer-discuss
Hi Rajashkar, 

It shouldn't be a problem anymore unless your task indeed runs for a long time. Could you please open a support case so we can inspect your environment details and task logs? 

Feng 

Please do not include any Protected Health Information (PHI) when responding to this email!  

FIGmd does not transmit via email any healthcare information protected by federal and/or state laws unless authorized by the subject patient or under circumstances where patient authorization is not required. If FIGmd receives PHI directly from a subject patient, we consider this to be an authorized distribution of PHI by the subject patient. 
 
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (or any attachment) is prohibited. If you have received this email in error, please notify the original sender and delete this message (along with any attachments) from your computer and if possible your server.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Feng Lu

unread,
Feb 20, 2019, 4:14:00 AM2/20/19
to Skyler Slade, cloud-composer-discuss
Thank you Skyler for sharing your experience. Composer deploys redis as a StatefulSet app that preserves all key entries in persistent disk for durability. 

We haven't seen Airflow DAG with idle task for a long time, Raj, if you could open a GCP support case, we would love to take a look. 
Please feel free to PM me your support case number. 

Feng 

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Skyler Slade

unread,
Feb 20, 2019, 9:00:30 AM2/20/19
to Feng Lu, cloud-composer-discuss
> with idle task for a long time

This happened to me approximately 3-4 weeks ago, for about 200 DAGs. I wish I had more info to share, sorry. I have since destroyed the environment.

Raj Shekar

unread,
Feb 27, 2019, 1:14:30 AM2/27/19
to Skyler Slade, Feng Lu, cloud-composer-discuss
Hi ALL,

Thanks Feng your help and suggestions...!!!

After upgraded my machine type(increased my RAM size) then the problem was solved.


Thanks,
Rajashekar.

Reply all
Reply to author
Forward
0 new messages