AWX job terminated unexpectedly

Elia Curti

unread,

Oct 20, 2022, 9:41:43 AM10/20/22

to AWX Project

Hello,

I'm having an issue with a job template on AWX that terminated unexpectedly.

AKS version: 1.23.12

AWX version: 21.6.0

It's a long automation but after 31m and 7000 tasks executed fails each time as you can see from pic.

No errors displayed in Stdout in AWX ui.

Running:

kubectl logs awx-c7647dbf5-2mt88 -n awx -c awx-task | grep error

I see this error:

2022-10-20 11:24:53,613 WARNING [a5fc6bd17e2146a883b741b0bd8a3bbd] awx.main.dispatch job 142359 (error) encountered an
error (rc=None), please see task stdout for details.

I've tried to get logs directly from automation pod:

kubectl logs automation-job-142359-nxxpb -n awx -f

The automation stops without errors displayed and automation pod destroyed from aks.

This is the last line before the crash:

{"uuid": "5fce5d17-9f63-4705-bf05-adf9b4b2a73b", "counter": 26487, "stdout": "", "start_line": 24454, "end_line": 24454, "runner_ident": "142359", "event": "runner_on_start", "job_id": 142359, "pid": 21, "created": "2022-10-20T11:19:39.658349", "parent_uuid": "02901256-661c-2a55-e472-000000001bc8", "event_data": {"playbook": "log_query_for_all_cids_from_appservers.yml", "playbook_uuid": "ef3e2675-daf8-4ac4-84db-9b797d3ffc6e", "play": "log query on tse db for all customers", "play_uuid": "02901256-661c-2a55-e472-000000000066", "play_pattern": "res_alyante_srv_grp_prod", "task": "set facts", "task_uuid": "02901256-661c-2a55-e472-000000001bc8", "task_action": "set_fact", "task_args": "", "task_path": "/runner/project/roles/common/exec_query_with_logs_on_win_vm/tasks/log_query.yml:2", "role": "exec_query_with_logs_on_win_vm", "host": "aly110-srv", "uuid": "5fce5d17-9f63-4705-bf05-adf9b4b2a73b"}}

What could be the problem? Why the automation-job pod is destroyed unexpectedly? Is possible something like a timeout after about 31min?

Who runs the command for delete automation-job pod?

Thank you for your help.

Elia Curti

unread,

Oct 20, 2022, 11:14:05 AM10/20/22

to AWX Project

I've done other test we a simple role like this:

- name: loop date for 60 min

shell: |

date

register: result

until: result.stdout == "LOOP"

retries: 60

delay: 60

It's not an issue due to a timeout... it worked fine for 60 minutes.

So what is the problem? Tasks number?

Thank you for your help.

kurokobo

unread,

Oct 20, 2022, 2:46:32 PM10/20/22

to awx-p...@googlegroups.com

Hi,

There seems to be a huge logs, so I guess your issue might be caused by https://github.com/ansible/awx/issues/11338

Try increasing "containerLogMaxSizeMB" for your AKS node (default: 10 MB).

Refer: https://learn.microsoft.com/en-us/azure/aks/custom-node-configuration

Regards,

------- Original Message -------

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/40b0903b-b78e-49c7-af61-b14e3b166fffn%40googlegroups.com.

Elia Curti

unread,

Oct 21, 2022, 6:17:31 AM10/21/22

to AWX Project

You're right!!! I don't know how much time I wasted looking for a solution :(

I've increased from 50MB (default in my aks nodepool ) to 100MB and the job templated completed without crash.

For verify aks nodepool configuration:

kubectl proxy

curl -sSL "http://localhost:8001/api/v1/nodes/NODE-NAME/proxy/configz" | python3 -m json.tool

The question now is... which is the right value for log rotation? 100, 200, 500?

Is there no possibility that this thing will fixed in a future AWX release? This is only a workaround... it works but it's not a stable solution.

Thank you for you help!

kurokobo

unread,

Oct 21, 2022, 6:40:46 PM10/21/22

to awx-p...@googlegroups.com

Congrats :)

> The question now is... which is the right value for log rotation? 100, 200, 500?

I think there is no right value. That configuration affects not only logs for EE but also logs for all pods on the cluster, so it totally depends on the size of the log for your job and capacity of the node.

> Is there no possibility that this thing will fixed in a future AWX release? This is only a workaround... it works but it's not a stable solution.

The root cause is not in the AWX implementation, but in the behavior of the Kubernetes API.

Refer https://github.com/kubernetes/kubernetes/issues/59902

I agree with you that this issue should be fixed, but sadly seems take long time since the issue is kept opened for five years...

Regards,

------- Original Message -------

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/831ab1e0-4652-4dd0-9b89-2ce11b71bfdan%40googlegroups.com.

Elia Curti

unread,

Oct 24, 2022, 4:32:40 AM10/24/22

to AWX Project

I have an AKS cluster with 3 nodes dedicated to AWX so there should not be problems of capacity for the nodes. Maybe I will try 200MB or 500MB and I hope will be enough.

The very strange thing is that I'm migrating my automation from a very old AWX version (3.0.1.0) and same templates worked without errors. The 3.0.1.0 AWX version is not affected to this AKS cluster issue?

Regards,

Elia

Elia Curti

unread,

Oct 25, 2022, 6:32:06 AM10/25/22

to AWX Project

For a specif template with about 100000 log lines I have increased log rotation to 500MB and now it's working.

How is it possible with AWX 3.0.1.0 older version there isn't any issue due to log rotation on aks?

Thank you for your help.

kurokobo

unread,

Oct 25, 2022, 7:31:32 PM10/25/22

to awx-p...@googlegroups.com

Hi,

> How is it possible with AWX 3.0.1.0 older version there isn't any issue due to log rotation on aks?

The root cause of the issue is in Kubernetes API, and the recent AWX (18+) is designed to use that API to gather logs from EE.

In my understanding, in the older AWX 3.x, no Kubernetes API is required to run job. There is no EE and your playbook run directly under virtualenv on AWX container.

I think that's why this issue affects recent AWX only.

Regards,

------- Original Message -------

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/b1d2a669-61f3-4568-8465-9d1ae1b85159n%40googlegroups.com.

Elia Curti

unread,

Oct 26, 2022, 4:59:00 AM10/26/22

to AWX Project

Hi,

it makes sense what you say.

Thanks for the explanation!

Elia Curti

unread,

Dec 2, 2022, 6:23:23 AM12/2/22

to AWX Project

Hello,

I'm having a similar issue with job terminated unexpectedly but this time is not due to log rotation that I have solved last time on my aks cluster.

In AWX I see the job (id: 144294) failed but I can't visualize the job details, AWX UI try to load the log without success:

So direct

in awx-task container

2022-12-02 01:15:09,647 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 pre run
2022-12-02 01:15:09,861 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 preparing playbook
2022-12-02 01:15:09,996 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 running playbook
2022-12-02 01:15:10,332 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 work unit id received
2022-12-02 01:15:10,370 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 work unit id assigned
2022-12-02 02:01:11,660 INFO [544931daa2964f35a9686d3004f8628b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 144294
2022-12-02 02:06:17,396 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 post run
2022-12-02 02:06:55,686 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 finalize run
2022-12-02 02:06:55,756 WARNING [544931daa2964f35a9686d3004f8628b] awx.main.dispatch job 144294 (error) encountered an error (rc=None), please see task stdout for details.

in awx-ee container

ERROR 2022/12/02 02:06:20 Exceeded retries for reading stdout /tmp/receptor/awx-c7647dbf5-85g85/FF2xD2Oc/stdout

I've also checked stdout directly from awx db in "awx_jobevent" table but there are not errors for this job id... probably the pod was terminated so I found last command executed.

Any ideas?

Thank you very much.

Elia

AWX Project

unread,

Dec 2, 2022, 1:17:16 PM12/2/22

to AWX Project

Hi, is this a consistent failure you are seeing in your deployment? based on logs, seems the job is running for about 45 minutes to completion.

For debugging purposes, you could set RECEPTOR_RELEASE_WORK in the django settings (see below), and after encountering this issue, check the receptor work unit by execing into the awx pod, awx-ee container, and cat the /tmp/receptor/awx-c7647dbf5-85g85/FF2xD2Oc/status file

```

extra_settings:
- setting: RECEPTOR_RELEASE_WORK
value: "False"

```

If you provide that info we can help further, thanks!

AWX Team

ponnurangam h

unread,

Jul 14, 2023, 11:08:33 AM7/14/23

to AWX Project

Hi We had the same issue with EKS and fixed with by increasing the log size, but want to know how you did at template leave. so that we can use only for a particular template

thanks in advance

AWX Project

unread,

Jul 14, 2023, 2:18:28 PM7/14/23

to AWX Project

increasing log size is a k8s cluster level setting (and will apply to all job pods launched), and cannot be set per AWX template. What reason did you have for wanting this at the template level?

AWX Team

ponnurangam h

unread,

Jul 14, 2023, 2:23:03 PM7/14/23

to awx-p...@googlegroups.com

Thanks for the clarity,

That is what we did for the target node using tolerance or affinity rules, but our dev environment goes for reboots overnight as budget constraints, this creates problem with configuration inconsistencies.

This is the case with EKS

Sent from my iPhone

On 14-Jul-2023, at 11:48 PM, AWX Project <awx-p...@googlegroups.com> wrote:

You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/OX-3_k-GEB8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/4dd12c75-e829-440c-ae71-18e24a71bb9an%40googlegroups.com.

Wei-Yen Tan

unread,

Jul 14, 2023, 3:36:05 PM7/14/23

to awx-p...@googlegroups.com

Interested in this and forgive my ignorance but how did you increase the k8slogs?

Sent from Outlook for iOS

From: awx-p...@googlegroups.com <awx-p...@googlegroups.com> on behalf of ponnurangam h <ponnur...@gmail.com>
Sent: Saturday, July 15, 2023 6:22:47 AM
To: awx-p...@googlegroups.com <awx-p...@googlegroups.com>
Subject: Re: [awx-project] Re: AWX job terminated unexpectedly

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/1875F098-EABA-4DCA-8176-853F8C59E49C%40gmail.com.

ponnurangam h

unread,

Jul 14, 2023, 11:35:35 PM7/14/23

to awx-p...@googlegroups.com

We modified a node level kubelet configurations and updated the maxcontainerlogsize to 100mib from default 10.

Adding to we made sure the automation pods will only gets created on the same node using tolerance and node affinity rules through our deployment

but the same in a different EKS region was good with similar setup, still not sure why this is happening

Sent from my iPhone

On 15-Jul-2023, at 1:06 AM, Wei-Yen Tan <weiye...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/SYBP282MB3525CC1CFAC3BB7B0A25D74DAD34A%40SYBP282MB3525.AUSP282.PROD.OUTLOOK.COM.

Elia Curti

unread,

Jul 17, 2023, 3:18:55 AM7/17/23

to awx-p...@googlegroups.com

We also modified the "maxcontainerlogsize" from 10 to 500M in AKS 1.25.

The problem occurs less often but is not solved.

I'm confused about the RECEPTOR_KUBE_SUPPORT_RECONNECT flag in the awx spec... What value should be set in this case to prevent the issue? Enabled or disabled?

Thank you very much for help.

Elia

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/8F7CA9F2-8A2F-4E3A-8B3D-DFED5F94A15F%40gmail.com.

ponnurangam h

unread,

Jul 17, 2023, 3:25:17 AM7/17/23

to awx-p...@googlegroups.com

We set it as disabled

Sent from my iPhone

On 17-Jul-2023, at 12:48 PM, Elia Curti <elia....@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/CAFMujAoLxU2_3d6WWFaOnyPMRxG%2BAj1BrPPXmOaKR5nyB1nuaA%40mail.gmail.com.

AWX Project

unread,

Jul 19, 2023, 2:45:52 PM7/19/23

to AWX Project

RECEPTOR_KUBE_SUPPORT_RECONNECT should be enabled if you k8s kubelet version supports it. See here for the details of the versions https://github.com/ansible/receptor/pull/683

AWX Team

Elia Curti

unread,

Jul 19, 2023, 3:10:53 PM7/19/23

to awx-p...@googlegroups.com

We are in aks 1.25.6 so it should be supported.

By the way this flag enable didn't solved our issue for unexpected jobs terminations.

Elia

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/bc03b634-b289-431b-9820-4955887cc9bbn%40googlegroups.com.

AWX Project

unread,

Jul 26, 2023, 2:52:58 PM7/26/23

to AWX Project

Hello,

Could you provide us with any error messages you are seeing in your AWX EE container for the job that fails? We would also like to know approximately how many jobs out of 100 are failing this way. Thank you for the additional information!

-AWX Team

Elia Curti

unread,

Jul 27, 2023, 3:22:23 AM7/27/23

to awx-p...@googlegroups.com

Hello,

I'll try to check on error messages next time when the issue will happens.

The issue occurs not so often, maybe once every 2/3 months. But when happens it's very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it's difficult to recovery the previous state.

Thank you very much for you help.

Elia

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/6c12731b-21c3-48b9-9a8a-55c276d8234fn%40googlegroups.com.

Elia Curti

unread,

Sep 25, 2023, 4:25:10 AM9/25/23

to AWX Project

Hello,

it happened again but this time I get the errors from AWX-EE container if could help:

DEBUG 2023/09/22 23:29:40 Client connected to control service @
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS:
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST:
DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset
DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000
DEBUG 2023/09/22 23:29:41 Control service closed
DEBUG 2023/09/22 23:29:41 Client disconnected from control service @
DEBUG 2023/09/22 23:29:41 Client connected to control service @
ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout
WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:42 Client disconnected from control service @
WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support
DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}