Task failed without any logs -> AirflowException: Celery command failed display in Flower Dashboard

thibault clement

unread,

Jul 18, 2018, 10:27:13 PM7/18/18

to cloud-composer-discuss

Hi,

Sometimes, randomly one of my task failed without any logs:

Even if my task is configured to make 3 retries, it seems that no retry was done.

Using Flower opening the web dashboard and picking one the failed task, I was able to see the following exception:

AirflowException: Celery command failed

Also all these failed seems to appears on the same worker.

The issue with this exception is that it seems to killed the task definitively and no retry is seems to be performed.

Do you have any idea how I can avoid this ? It makes my DAGs very instable.

Thanks,

Thibault

jeremy.w...@everoad.com

unread,

Jul 20, 2018, 8:25:56 AM7/20/18

to cloud-composer-discuss

Hi,

Same issue on my side.

Also, with this kind of things you are not noticed by mail (because the task did not fail, it did not run!), which make you pretty blind when you are launching late scripts in your dag.

I already tried to get the Kubernetes credentials, custom the configmap, delete the pods several times (not when dag were running), but nothing seems to really work I always have a least one task (and all the linked downstream) each hour which is not running properly.

If anyone has any tips I will be glad to ear about them.

Jeremy

Feng Lu

unread,

Jul 31, 2018, 2:00:02 AM7/31/18

to jeremy.w...@everoad.com, cloud-composer-discuss

This is the case where the celery worker fails to execute the task, if you have configured task-level retries, it will very likely succeed the next time.

I would also recommend you try out our GA release which includes a lot of stability related fixes.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/6d5bf420-0264-420e-8107-baff2b2aae05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

thibault clement

unread,

Jul 31, 2018, 2:15:56 AM7/31/18

to fen...@google.com, jeremy.w...@everoad.com, cloud-compo...@googlegroups.com

Hi Feng,

Yes I clearly notice more stability since moving to 1.0.0.

Thanks,

Thibault

To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CACZXaYsMaW0CNmkqEqAeXQ9tDGKwQoCa82F-uGwu0hvFhfzuGA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Thibault Clément

+65 8753 5289 (Singapore)
+33 (0)6 66 11 77 39 (France)
thibau...@gmail.com

jeremy.w...@everoad.com

unread,

Jul 31, 2018, 3:33:26 AM7/31/18

to cloud-composer-discuss

Hi,

Same on my side with 1.0.0, this version is way more stable. However I still have some failure, I'll try to get several retries.

Thanks.

Le mardi 31 juillet 2018 08:00:02 UTC+2, Feng Lu a écrit :

This is the case where the celery worker fails to execute the task, if you have configured task-level retries, it will very likely succeed the next time.
I would also recommend you try out our GA release which includes a lot of stability related fixes.

On Fri, Jul 20, 2018 at 5:25 AM <jeremy.w...@everoad.com> wrote:

Hi,

Same issue on my side.

Also, with this kind of things you are not noticed by mail (because the task did not fail, it did not run!), which make you pretty blind when you are launching late scripts in your dag.

I already tried to get the Kubernetes credentials, custom the configmap, delete the pods several times (not when dag were running), but nothing seems to really work I always have a least one task (and all the linked downstream) each hour which is not running properly.

If anyone has any tips I will be glad to ear about them.

Jeremy

Le jeudi 19 juillet 2018 04:27:13 UTC+2, thibault clement a écrit :
Hi,

Sometimes, randomly one of my task failed without any logs:
Even if my task is configured to make 3 retries, it seems that no retry was done.

Using Flower opening the web dashboard and picking one the failed task, I was able to see the following exception:

AirflowException: Celery command failed

Also all these failed seems to appears on the same worker.

The issue with this exception is that it seems to killed the task definitively and no retry is seems to be performed.

Do you have any idea how I can avoid this ? It makes my DAGs very instable.

Thanks,

Thibault

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

thibault clement

unread,

Oct 29, 2018, 9:36:01 PM10/29/18

to cloud-composer-discuss

Hi guys,

I still have the issue with a new DAG where a lot of task randomly failed without any logs and any retries.

I said randomly as when I clear the failed tasks from the UI, they run again and some success or failed (I can do that until no task failed...).

Looking the the stackdriver I got these logs:

textPayload: "[2018-10-30 01:09:47,884] {jobs.py:1439} ERROR - Cannot load the dag bag to handle failure for <TaskInstance: TASK_NAME 2018-10-29 03:16:11.865389 [queued]>. Setting task to FAILED without callbacks or retries. Do you have enough resources?

For information I'm using the latest composer version: composer-1.2.0-airflow-1.9.0 and my machine type are n1-highmem-2. Looking the activity of my workers machine, it's was clearly not an issue of resources, the machine was underutilize.

For information, retries are set on DAG level but also on Task level. It doesn't change anything.

Does anyone have any idea how to fix this ?

Thanks,

Thibault

jeremy.w...@everoad.com

unread,

Nov 6, 2018, 9:01:01 AM11/6/18

to cloud-composer-discuss

Same problem on my side, just launch a new cluster updated to the last release composer-1.3.0-airflow-1.9.0

I'm running several tests this afternoon but for now my only solution is to follow each DAG and clear all the task in this case.

In my case the only thing which could explain this fact is the launch of several task in parallel (between 10/12 at the same time) but I still have a lot of ressources on the cluster (watching kubectl top nodes on the cluster during the DAG). However, once the first one have been launched, I have a sync task (wait for all these famous 10/12 first tasks) and then I am re launching 10/12 tasks and these ones do not fail...

I also face a new case where tasks do not have any logs but are now in the state 'up_for_retry' which is significantly better on my use case but as this is not always the case and they can still be just 'failed' w/o any logs, I need to follow each DAG run...

Would be glad to have some support :-) I can provide all the data which could be relevant to solve this issue...

jeremy.w...@everoad.com

unread,

Nov 6, 2018, 9:05:03 AM11/6/18

to cloud-composer-discuss

Sorry for double posting, just to say that I tried to override these values but still nothing better :

core

dag_concurrency 30
parallelism 45

scheduler

max_threads 4

jeremy.w...@everoad.com

unread,

Nov 8, 2018, 3:29:09 AM11/8/18

to cloud-composer-discuss

Just to let you know it seems to really be a problem of ressources. I tried to install a new cluster with n1-standard-4 machines (x3) instead of n1-standard-2 and it worked without any failure.

I think that when you are around 50/70% CPU when you want to trigger several dags which have severals task in // at the same time, it fails.

Well this is solved for me but it's definitely a huger cost than expected...

Good luck and regards,

Jeremy

thibault clement

unread,

Nov 12, 2018, 10:47:49 PM11/12/18

to jeremy.w...@everoad.com, cloud-compo...@googlegroups.com

Thanks Jeremy for your tests and let us know that you manage to solve it with upgrading worker machine.

I will try from my side.

Thibault

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/6d5bf420-0264-420e-8107-baff2b2aae05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloud-composer-discuss/cxiivo9wHZQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/be58bdef-acd0-485e-923a-75457b5ecefc%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Bob Muscovite

unread,

Sep 25, 2020, 9:07:31 AM9/25/20

to cloud-composer-discuss

This seems to occur in our case also, seemingly when jobs are scheduled to run on a Celery worker that is being terminated.

Reply all

Reply to author

Forward