Incorrect task termination after hostname warning

986 views
Skip to first unread message

wou...@poki.com

unread,
Mar 25, 2019, 10:52:22 AM3/25/19
to cloud-composer-discuss
Hi,

Since this month we are experiencing some issues in Cloud Composer where tasks are incorrectly failed while the command that the task is executing is actually not failing. 

I noticed in the logs (just before a task is terminated) the following log lines:

WARNING - The recorded hostname airflow-worker-5d74ddb497-ck2wr does not match this instance's hostname airflow-worker-5d74ddb497-lvqt8

Immediately after this, Airflow terminates our task (I replaced the dag-name and task-name with placeholders):

INFO - Terminating descendant processes of ['/usr/bin/python /usr/local/bin/airflow run ##dag-name## ##task-name## 2019-03-22T02:00:00+00:00 --job_id 494 --raw -sd DAGS_FOLDER/poki-production-build.zip --cfg_path /tmp/tmpx6s5f4jf'] PID: 15223

The warning about the hostname can be found in this heartbeat function in Airflow.

The warning seems to indicate to me that (for some reason) the worker for the task switches and then it fails because the hostname of the new worker does not match the original hostname of the worker that was assigned task.

Has anyone else experienced something like this?

We are experiencing this on the latest image composer-1.5.2-airflow-1.10.1, but we also experienced this issue on earlier images (we upgraded to see if that fixed the issue).

Thanks,
Wouter

Shehzan Devani

unread,
Mar 26, 2019, 2:18:57 PM3/26/19
to cloud-composer-discuss
Hi Wouter,

How often are you running into this? I wonder if it coincides with any pod evictions on your kubernetes cluster.
Another thought is that the scheduler might be scheduling multiple task instances. As you can see here, when the task instance gets refreshed, the worker just looks for the first task instance which may belong to another worker.

wou...@poki.com

unread,
Mar 26, 2019, 5:07:55 PM3/26/19
to cloud-composer-discuss
Hi Shehzan,

I don't see any indications that pods are evicted. This issue happens almost every DAG run, it started happening since beginning of March. If you want I can provide more specific logs or provide more data in a way that would help you get a better insight.

Best,
Wouter

Shehzan Devani

unread,
Mar 26, 2019, 6:03:07 PM3/26/19
to wou...@poki.com, cloud-composer-discuss
You could check on the theory of the scheduler scheduling multiple task instances for the same task. Mind finding one of the failed tasks and running an ad hoc query to see if you find duplicates?

If there aren't duplicates then maybe scheduler + worker logs will be useful. (Make sure there's no sensitive info in the logs :) )

--
You received this message because you are subscribed to a topic in the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloud-composer-discuss/ghopFynREXY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/735c6b30-1eae-43f6-ac9a-4a0e5928f9ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Shehzan Devani
Software Engineer @ Google

wou...@poki.com

unread,
Mar 27, 2019, 7:43:28 AM3/27/19
to cloud-composer-discuss
Hi Shehzan,

I queried the Airflow DB for one of the failing tasks, but could not find any duplicate entries for that task-execution_date combination.

I have exported a slice of the logs to Google Sheets here, and highlighted the log line that refers to the task termination, with two lines above it the hostname warning. Any insight might be useful.

Thanks again,
Wouter
To unsubscribe from this group and all its topics, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

wou...@poki.com

unread,
Mar 27, 2019, 9:26:35 AM3/27/19
to cloud-composer-discuss
I investigated further in the multiple task instances angle. In my previous post I mentioned I queried the DB (specifically the table task_instances) and didn't see any duplicate entries. However, when I check the jobs table related to the tasks around the timestamp of when our task failed, I see these entries:

+------+-----------------+---------+--------------+----------------------------+----------------------------+----------------------------+----------------+------------------------------------+----------+
| id   | dag_id          | state   | job_type     | start_date                 | end_date                   | latest_heartbeat           | executor_class | hostname                           | unixname |
+------+-----------------+---------+--------------+----------------------------+----------------------------+----------------------------+----------------+------------------------------------+----------+
| 1403 | production-poki | success | LocalTaskJob | 2019-03-27 05:16:25.286102 | 2019-03-27 05:35:42.650791 | 2019-03-27 05:16:25.286120 | CeleryExecutor | airflow-worker-5d74ddb497-29f4c    | airflow  |
| 1402 | production-poki | success | LocalTaskJob | 2019-03-27 05:12:46.252704 | 2019-03-27 06:21:22.380016 | 2019-03-27 05:12:46.252720 | CeleryExecutor | airflow-worker-5d74ddb497-ck2wr    | airflow  |
| 1401 | production-poki | success | LocalTaskJob | 2019-03-27 05:10:08.055841 | 2019-03-27 05:28:15.305873 | 2019-03-27 05:10:08.055857 | CeleryExecutor | airflow-worker-5d74ddb497-ck2wr    | airflow  |

Eventhough it doesn't mention the task_id, it might be interesting to see as there were not many tasks to be running/scheduled at that moment.

Additionally, I see three scheduler processes running at the moment:

$ ps aux | grep scheduler
airflow        1  0.0  0.0  19780  3176 ?        Ss   13:09   0:00 /bin/bash /var/local/airflow.sh scheduler
airflow       35  6.0  1.8 225152 71332 ?        S    13:09   0:04 /usr/bin/python /usr/local/bin/airflow scheduler -r 600
airflow      257 18.0  1.8 239492 68592 ?        R    13:10   0:00 /usr/bin/python /usr/local/bin/airflow scheduler -r 600
airflow      258 18.0  1.8 239492 68400 ?        R    13:10   0:00 /usr/bin/python /usr/local/bin/airflow scheduler -r 600

Is this to be expected?

Best,
Wouter

Shehzan Devani

unread,
Mar 28, 2019, 12:23:09 PM3/28/19
to wou...@poki.com, cloud-composer-discuss
Hmm that's interesting. Having multiple schedulers can definitely cause this issue. One hypothesis is that multiple schedulers scheduled this exact task with the same (dag_id, task_id, execution_date). The schedulers selected different workers to run the tasks. The workers started the task, but when they update their task instance from the database, they likely selected the wrong task instance (since the query is not strict enough). This caused the exception to be raised.

This leaves a couple of lingering questions:
  1. Does this exception cause the task to be marked as failed? Perhaps you can look at Airflow's task instance log to see if it indicates this or not. If this is the case, I wonder if this is the correct behavior or if we're just in an undefined state by having multiple schedulers. If this isn't the case, something else is causing the task failure.
  2. Why would there be multiple schedulers running? Can you try running kubectl get pods to make sure there's only 1 scheduler pod running.
Just to rule out the fact that something might be wrong with the DAG, does this happen consistently to one task or does it sporadically hit multiple tasks.

To unsubscribe from this group and all its topics, send an email to cloud-composer-di...@googlegroups.com.


--
Shehzan Devani
Software Engineer @ Google

--
You received this message because you are subscribed to a topic in the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloud-composer-discuss/ghopFynREXY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

wou...@poki.com

unread,
Mar 29, 2019, 6:52:27 AM3/29/19
to cloud-composer-discuss
To answer your questions:
  1. The task is failed and retried, until the maximum amount of retries are done, then it is permanently failed.
  2. I see that there is a single scheduler that is continuously running with a certain PID, and that the other two schedulers are continuously restarting with different PID's. Perhaps this is the main scheduler process launching some subprocesses? I'm not sure if this is expected behavior.
It does not happen consistently with one certain task, but it looks like long running tasks are prone to hit this issue more. The task is a BashOperator task that might take a few hours.
To unsubscribe from this group and all its topics, send an email to cloud-composer-discuss+unsub...@googlegroups.com.


--
Shehzan Devani
Software Engineer @ Google

--
You received this message because you are subscribed to a topic in the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloud-composer-discuss/ghopFynREXY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages