AWX Crashes when launch 7 concurrent jobs

451 views

Skip to first unread message

Gregory Machin

unread,

Mar 3, 2023, 10:11:47 PM3/3/23

to AWX Project

Hi,

I have AWX running on a VM with the following spec:

- 4 vCPU (report as AMD EPYC 7402P 24-Core Processor)

- 16 GB of RAM

- 28GB virtual disk 18GB used 8.7G Free

OS - Ubuntu 22.04.1 LTS

AWX - About reports the version as 21.4.0

When a large number of concurrent jobs are started at the same time, AWX crashes, with errors in the browser 404 or 502. Sometimes it recovers and I can login but the jobs will have crashed with "Task was marked as running but was not present in the job queue, so it has been marked as failed." other times it doesn't respond and I reboot the server.

It feels like a resource issue, but I'm not sure where to look as K3s is not an area I have much knowledge in.

What is the likely cause ?

Gregory Machin

unread,

Mar 4, 2023, 12:01:34 AM3/4/23

to AWX Project

Looks like the Redis container is having issues :

I was following the logs and I started the workflow template and the connection to the container was lost

1:M 04 Mar 2023 04:40:29.124 * Background saving terminated with success
1:signal-handler (1677905080) Received SIGTERM scheduling shutdown...
1:M 04 Mar 2023 04:44:40.786 # User requested shutdown...
1:M 04 Mar 2023 04:44:40.786 * Saving the final RDB snapshot before exiting.
1:M 04 Mar 2023 04:44:40.795 * DB saved on disk
1:M 04 Mar 2023 04:44:40.795 * Removing the unix socket file.
1:M 04 Mar 2023 04:44:40.795 # Redis is now ready to exit, bye bye...
rpc error: code = NotFound desc = an error occurred when try to find container "76b43903e9dabc1f72e0f70b07e07e34ebc18520bf5f23ebdd0535a1d19b8f3a": not foundroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-vc9w5 -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
unable to retrieve container logs for containerd://8ae010256adb598c6f842f821c4a960809a9c2e8dae37edde6d73e0e68f94cbdroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-vc9w5 -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
unable to retrieve container logs for containerd://8ae010256adb598c6f842f821c4a960809a9c2e8dae37edde6d73e0e68f94cbdroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-z558t -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
1:C 04 Mar 2023 04:49:55.581 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 04 Mar 2023 04:49:55.581 # Redis version=7.0.9, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 04 Mar 2023 04:49:55.581 # Configuration loaded
1:M 04 Mar 2023 04:49:55.582 * monotonic clock: POSIX clock_gettime
1:M 04 Mar 2023 04:49:55.582 * Running mode=standalone, port=0.
1:M 04 Mar 2023 04:49:55.582 # Server initialized
1:M 04 Mar 2023 04:49:55.583 * The server is now ready to accept connections at /var/run/redis/redis.sock
1:signal-handler (1677905534) Received SIGTERM scheduling shutdown...
1:M 04 Mar 2023 04:52:14.604 # User requested shutdown...
1:M 04 Mar 2023 04:52:14.604 * Saving the final RDB snapshot before exiting.
1:M 04 Mar 2023 04:52:14.617 * DB saved on disk
1:M 04 Mar 2023 04:52:14.618 * Removing the unix socket file.
1:M 04 Mar 2023 04:52:14.619 # Redis is now ready to exit, bye bye...

root@server# kubectl -n awx get all
NAME READY STATUS RESTARTS AGE
pod/awx-788749fb7f-gvtv9 0/4 ContainerStatusUnknown 45 (30d ago) 166d
pod/awx-788749fb7f-4l99n 0/4 ContainerStatusUnknown 4 13d
pod/awx-788749fb7f-hmqdg 0/4 ContainerStatusUnknown 2 3h44m
pod/awx-788749fb7f-f4xmj 0/4 ContainerStatusUnknown 3 3h10m
pod/awx-788749fb7f-s8rr5 0/4 ContainerStatusUnknown 3 103m
pod/awx-postgres-13-0 1/1 Running 14 (26m ago) 166d
pod/awx-operator-controller-manager-7f89bd5797-lwjpx 2/2 Running 23 (26m ago) 138d
pod/awx-788749fb7f-vc9w5 0/4 ContainerStatusUnknown 5 (26m ago) 46m
pod/awx-788749fb7f-z558t 0/4 ContainerStatusUnknown 4 12m
pod/awx-788749fb7f-qzzbg 0/4 Pending 0 4m46s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/awx-operator-controller-manager-metrics-service ClusterIP 10.43.88.44 <none> 8443/TCP 203d
service/awx-postgres-13 ClusterIP None <none> 5432/TCP 203d
service/awx-service ClusterIP 10.43.137.182 <none> 80/TCP 203d

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/awx-operator-controller-manager 1/1 1 1 203d
deployment.apps/awx 0/1 1 0 203d

NAME DESIRED CURRENT READY AGE
replicaset.apps/awx-5d7b85bc77 0 0 0 203d
replicaset.apps/awx-operator-controller-manager-7f89bd5797 1 1 1 203d
replicaset.apps/awx-788749fb7f 1 1 0 200d

NAME READY AGE
statefulset.apps/awx-postgres-13 1/1 203d
root@server:~#

I saw there was a new running instance, and that shut down " Received SIGTERM scheduling shutdown..." and became " ContainerStatusUnknown"

Why would the Redis container shutdown without an visible errors ?

AWX Project

unread,

Mar 8, 2023, 2:50:41 PM3/8/23

to AWX Project

what does kubectl describe on one of the ContainerStatusUnknown pods report?

verify that the nodes these job pods were assigned to are healthy by running "kubectl get node"

We suspect the underlying nodes are unhealthy (memory issues maybe) and causing the pods to crash

AWX Team

Gregory Machin

unread,

Mar 9, 2023, 7:55:12 PM3/9/23

to awx-p...@googlegroups.com

Thanks for getting back to me,

run kubectl -n awx delete deployment awx

which has cleared them from the list.

And the started a 2 new job ,each copies files to 2 servers.

I have lost access to AWX , getting gateway error , but now getting "not found" on the jobs page.

ansible:~/awx-on-k3s/base# kubectl get node

NAME STATUS ROLES AGE VERSION

gglvansible Ready control-plane,master 209d v1.25.6+k3s1

ansible:~/awx-on-k3s/base#

root@ansible:~/awx-on-k3s/base# kubectl -n awx get all

NAME READY STATUS RESTARTS AGE

pod/awx-postgres-13-0 1/1 Running 21 (85m ago) 171d

pod/awx-operator-controller-manager-68d6f576b4-7672r 2/2 Running 0 79m

pod/automation-job-1157-x7rb4 1/1 Running 0 7m38s

pod/automation-job-1156-tvhjj 1/1 Running 0 7m40s

pod/awx-9668dcb98-nzg5q 0/4 ContainerStatusUnknown 3 46m

pod/awx-9668dcb98-dh56c 4/4 Running 0 6m5s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

service/awx-operator-controller-manager-metrics-service ClusterIP 10.43.88.44 <none> 8443/TCP 209d

service/awx-postgres-13 ClusterIP None <none> 5432/TCP 209d

service/awx-service ClusterIP 10.43.137.182 <none> 80/TCP 209d

NAME READY UP-TO-DATE AVAILABLE AGE

deployment.apps/awx-operator-controller-manager 1/1 1 1 209d

deployment.apps/awx 1/1 1 1 46m

NAME DESIRED CURRENT READY AGE

replicaset.apps/awx-operator-controller-manager-68d6f576b4 1 1 1 79m

replicaset.apps/awx-operator-controller-manager-7f89bd5797 0 0 0 209d

replicaset.apps/awx-9668dcb98 1 1 1 46m

NAME READY AGE

statefulset.apps/awx-postgres-13 1/1 209d

root@ansible:~/awx-on-k3s/base#

root@ansible:~/awx-on-k3s/base# kubectl -n awx describe pod awx-9668dcb98-nzg5q

Name: awx-9668dcb98-nzg5q

Namespace: awx

Priority: 0

Service Account: awx

Node: ansible/10.20.7.10

Start Time: Fri, 10 Mar 2023 10:47:16 +1300

Labels: app.kubernetes.io/component=awx

app.kubernetes.io/managed-by=awx-operator

app.kubernetes.io/name=awx

app.kubernetes.io/operator-version=1.3.0

app.kubernetes.io/part-of=awx

app.kubernetes.io/version=21.13.0

pod-template-hash=9668dcb98

Annotations: checksum-configmaps-config: f561cc65d89b4e3678076eccafe63ac9

checksum-configmaps-pre_stop_scripts: 68b329da9893e34099c7d8ad5cb9c940

checksum-secret-bundle_cacert: 276fa68835904533a2a8b68b5a128047

checksum-secret-ldap_cacert: 276fa68835904533a2a8b68b5a128047

checksum-secret-receptor_ca: 4ee07b571170b38048a66949f955f0dc

checksum-secret-receptor_work_signing: 796f98b768de8340c4167ba74a0b0094

checksum-secret-route_tls: d41d8cd98f00b204e9800998ecf8427e

checksum-secret-secret_key: 37ec43cc1be555e4ba78f4425301865f

checksum-secrets-app_credentials: 1754fa7c60d3bf69b54d2ffcc10bca10

checksum-storage-persistent: 68b329da9893e34099c7d8ad5cb9c940

Status: Failed

Reason: Evicted

Message: The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.

IP: 10.42.0.12

IPs:

IP: 10.42.0.12

Controlled By: ReplicaSet/awx-9668dcb98

Init Containers:

init:

Container ID: containerd://bb5e6a3ea197a81cc1fd1446b8e63435a75bffe67affcd7f16268f951a46f41f

Image: quay.io/ansible/awx-ee:latest

Image ID: quay.io/ansible/awx-ee@sha256:58fecfd22a9b8e4d639107391e867d95cc587720dcdb3cc974b930552058fbb6

Port: <none>

Host Port: <none>

Command:

/bin/sh

-c

hostname=$MY_POD_NAME

receptor --cert-makereq bits=2048 commonname=$hostname dnsname=$hostname nodeid=$hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key

receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/receptor-ca.crt cakey=/etc/receptor/tls/ca/receptor-ca.key outcert=/etc/receptor/tls/receptor.crt verify=yes

mkdir -p /etc/pki/ca-trust/extracted/{java,pem,openssl,edk2}

update-ca-trust

State: Terminated

Reason: Completed

Exit Code: 0

Started: Fri, 10 Mar 2023 10:47:17 +1300

Finished: Fri, 10 Mar 2023 10:47:18 +1300

Ready: True

Restart Count: 0

Environment:

MY_POD_NAME: awx-9668dcb98-nzg5q (v1:metadata.name)

Mounts:

/etc/pki/ca-trust/extracted from ca-trust-extracted (rw)

/etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")

/etc/receptor/tls/ from awx-receptor-tls (rw)

/etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")

/etc/receptor/tls/ca/receptor-ca.key from awx-receptor-ca (ro,path="tls.key")

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

init-projects:

Container ID: containerd://3e7d78e9a26e26b0fe717e169c00db72fb8ae7350b0fd3721b68bd02551aae7e

Image: quay.io/centos/centos:stream9

Image ID: quay.io/centos/centos@sha256:3332c6692307ba0bdd916c8681a9a7184ca7630de3706aef3476d4ceb286531f

Port: <none>

Host Port: <none>

Command:

/bin/sh

-c

chmod 775 /var/lib/awx/projects

chgrp 1000 /var/lib/awx/projects

State: Terminated

Reason: Completed

Exit Code: 0

Started: Fri, 10 Mar 2023 10:47:18 +1300

Finished: Fri, 10 Mar 2023 10:47:18 +1300

Ready: True

Restart Count: 0

Environment:

MY_POD_NAME: awx-9668dcb98-nzg5q (v1:metadata.name)

Mounts:

/var/lib/awx/projects from awx-projects (rw)

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

Containers:

redis:

Container ID:

Image: docker.io/redis:7

Image ID:

Port: <none>

Host Port: <none>

Args:

redis-server

/etc/redis.conf

State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was terminated

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Last State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was deleted. The container used to be Running

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Ready: False

Restart Count: 1

Requests:

cpu: 50m

memory: 64Mi

Environment: <none>

Mounts:

/data from awx-redis-data (rw)

/etc/redis.conf from awx-redis-config (ro,path="redis.conf")

/var/run/redis from awx-redis-socket (rw)

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

awx-web:

Container ID:

Image: quay.io/ansible/awx:21.13.0

Image ID:

Port: 8052/TCP

Host Port: 0/TCP

Args:

/usr/bin/launch_awx.sh

State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was terminated

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Last State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was deleted. The container used to be Running

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Ready: False

Restart Count: 1

Environment:

MY_POD_NAMESPACE: awx (v1:metadata.namespace)

UWSGI_MOUNT_PATH: /

Mounts:

/etc/nginx/nginx.conf from awx-nginx-conf (ro,path="nginx.conf")

/etc/openldap/certs/ldap-ca.crt from awx-ldap-cacert (ro,path="ldap-ca.crt")

/etc/pki/ca-trust/extracted from ca-trust-extracted (rw)

/etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")

/etc/receptor/signing/work-public-key.pem from awx-receptor-work-signing (ro,path="work-public-key.pem")

/etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")

/etc/receptor/tls/ca/receptor-ca.key from awx-receptor-ca (ro,path="tls.key")

/etc/tower/SECRET_KEY from awx-secret-key (ro,path="SECRET_KEY")

/etc/tower/conf.d/credentials.py from awx-application-credentials (ro,path="credentials.py")

/etc/tower/conf.d/execution_environments.py from awx-application-credentials (ro,path="execution_environments.py")

/etc/tower/conf.d/ldap.py from awx-application-credentials (ro,path="ldap.py")

/etc/tower/settings.py from awx-settings (ro,path="settings.py")

/var/lib/awx/projects from awx-projects (rw)

/var/lib/awx/rsyslog from rsyslog-dir (rw)

/var/run/awx-rsyslog from rsyslog-socket (rw)

/var/run/redis from awx-redis-socket (rw)

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

/var/run/supervisor from supervisor-socket (rw)

awx-task:

Container ID: containerd://7de76df9cdc0621fd1acf2a73f80a59fb3eb9a2007142a34a88c430afc06bce9

Image: quay.io/ansible/awx:21.13.0

Image ID: quay.io/ansible/awx@sha256:111c5acb675f2e156d99e6ddfbaeb0b482a2fe37e7a28e6d9ffcacb0141620c9

Port: <none>

Host Port: <none>

Args:

/usr/bin/launch_awx_task.sh

State: Terminated

Reason: Completed

Exit Code: 0

Started: Fri, 10 Mar 2023 10:47:19 +1300

Finished: Fri, 10 Mar 2023 11:28:03 +1300

Ready: False

Restart Count: 0

Environment:

SUPERVISOR_WEB_CONFIG_PATH: /etc/supervisord.conf

AWX_SKIP_MIGRATIONS: 1

MY_POD_UID: (v1:metadata.uid)

MY_POD_IP: (v1:status.podIP)

MY_POD_NAMESPACE: awx (v1:metadata.namespace)

Mounts:

/etc/pki/ca-trust/extracted from ca-trust-extracted (rw)

/etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")

/etc/receptor/ from awx-receptor-config (rw)

/etc/receptor/signing/work-private-key.pem from awx-receptor-work-signing (ro,path="work-private-key.pem")

/etc/tower/SECRET_KEY from awx-secret-key (ro,path="SECRET_KEY")

/etc/tower/conf.d/credentials.py from awx-application-credentials (ro,path="credentials.py")

/etc/tower/conf.d/execution_environments.py from awx-application-credentials (ro,path="execution_environments.py")

/etc/tower/conf.d/ldap.py from awx-application-credentials (ro,path="ldap.py")

/etc/tower/settings.py from awx-settings (ro,path="settings.py")

/var/lib/awx/projects from awx-projects (rw)

/var/lib/awx/rsyslog from rsyslog-dir (rw)

/var/run/awx-rsyslog from rsyslog-socket (rw)

/var/run/receptor from receptor-socket (rw)

/var/run/redis from awx-redis-socket (rw)

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

/var/run/supervisor from supervisor-socket (rw)

awx-ee:

Container ID:

Image: quay.io/ansible/awx-ee:latest

Image ID:

Port: <none>

Host Port: <none>

Args:

/bin/sh

-c

if [ ! -f /etc/receptor/receptor.conf ]; then

cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf

sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf

exec receptor --config /etc/receptor/receptor.conf

State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was terminated

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Last State: Terminated

Reason: ContainerStatusUnknown

Message: The container could not be located when the pod was deleted. The container used to be Running

Exit Code: 137

Started: Mon, 01 Jan 0001 00:00:00 +0000

Finished: Mon, 01 Jan 0001 00:00:00 +0000

Ready: False

Restart Count: 1

Environment: <none>

Mounts:

/etc/pki/ca-trust/extracted from ca-trust-extracted (rw)

/etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")

/etc/receptor/ from awx-receptor-config (rw)

/etc/receptor/receptor-default.conf from awx-default-receptor-config (rw,path="receptor.conf")

/etc/receptor/signing/work-private-key.pem from awx-receptor-work-signing (ro,path="work-private-key.pem")

/etc/receptor/tls/ from awx-receptor-tls (rw)

/etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")

/var/lib/awx/projects from awx-projects (rw)

/var/run/receptor from receptor-socket (rw)

/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)

Conditions:

Type Status

Initialized True

Ready False

ContainersReady False

PodScheduled True

Volumes:

ca-trust-extracted:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

awx-bundle-cacert:

Type: Secret (a volume populated by a Secret)

SecretName: awx-custom-certs

Optional: false

awx-ldap-cacert:

Type: Secret (a volume populated by a Secret)

SecretName: awx-custom-certs

Optional: false

awx-application-credentials:

Type: Secret (a volume populated by a Secret)

SecretName: awx-app-credentials

Optional: false

awx-receptor-tls:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

awx-receptor-ca:

Type: Secret (a volume populated by a Secret)

SecretName: awx-receptor-ca

Optional: false

awx-receptor-work-signing:

Type: Secret (a volume populated by a Secret)

SecretName: awx-receptor-work-signing

Optional: false

awx-secret-key:

Type: Secret (a volume populated by a Secret)

SecretName: awx-secret-key

Optional: false

awx-settings:

Type: ConfigMap (a volume populated by a ConfigMap)

Name: awx-awx-configmap

Optional: false

awx-nginx-conf:

Type: ConfigMap (a volume populated by a ConfigMap)

Name: awx-awx-configmap

Optional: false

awx-redis-config:

Type: ConfigMap (a volume populated by a ConfigMap)

Name: awx-awx-configmap

Optional: false

awx-redis-socket:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

awx-redis-data:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

supervisor-socket:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

rsyslog-socket:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

receptor-socket:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

rsyslog-dir:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

awx-receptor-config:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

awx-default-receptor-config:

Type: ConfigMap (a volume populated by a ConfigMap)

Name: awx-awx-configmap

Optional: false

awx-projects:

Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)

ClaimName: awx-projects-claim

ReadOnly: false

kube-api-access-f96x7:

Type: Projected (a volume that contains injected data from multiple sources)

TokenExpirationSeconds: 3607

ConfigMapName: kube-root-ca.crt

ConfigMapOptional: <nil>

DownwardAPI: true

QoS Class: Burstable

Node-Selectors: <none>

Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s

node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Scheduled 53m default-scheduler Successfully assigned awx/awx-9668dcb98-nzg5q to ansible

Normal Pulled 53m kubelet Container image "quay.io/ansible/awx-ee:latest" already present on machine

Normal Created 53m kubelet Created container init

Normal Started 53m kubelet Started container init

Normal Pulled 52m kubelet Container image "quay.io/centos/centos:stream9" already present on machine

Normal Created 52m kubelet Created container init-projects

Normal Started 52m kubelet Started container init-projects

Normal Pulled 52m kubelet Container image "docker.io/redis:7" already present on machine

Normal Created 52m kubelet Created container redis

Normal Started 52m kubelet Started container redis

Normal Pulled 52m kubelet Container image "quay.io/ansible/awx:21.13.0" already present on machine

Normal Created 52m kubelet Created container awx-web

Normal Started 52m kubelet Started container awx-web

Normal Pulled 52m kubelet Container image "quay.io/ansible/awx:21.13.0" already present on machine

Normal Created 52m kubelet Created container awx-task

Normal Started 52m kubelet Started container awx-task

Normal Pulled 52m kubelet Container image "quay.io/ansible/awx-ee:latest" already present on machine

Normal Created 52m kubelet Created container awx-ee

Normal Started 52m kubelet Started container awx-ee

Warning Evicted 12m kubelet The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.

Normal Killing 12m kubelet Stopping container redis

Normal Killing 12m kubelet Stopping container awx-ee

Normal Killing 12m kubelet Stopping container awx-task

Normal Killing 12m kubelet Stopping container awx-web

Warning ExceededGracePeriod 12m kubelet Container runtime did not kill the pod within specified grace period.

root@ansible:~/awx-on-k3s/base#

Last time I tried this there where no events.

this looks like a possible issue

Warning Evicted 12m kubelet The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.

and I see that there are no resources configured in awx.yaml

---

apiVersion: awx.ansible.com/v1beta1

kind: AWX

metadata:

name: awx

spec:

# These parameters are designed for use with:

# - AWX Operator: 0.26.0

# https://github.com/ansible/awx-operator/blob/0.26.0/README.md

# - AWX: 21.4.0

# https://github.com/ansible/awx/blob/21.4.0/INSTALL.md

admin_user: admin

admin_password_secret: awx-admin-password

ingress_type: ingress

ingress_tls_secret: awx-secret-tls

hostname: awx.gallagher.local

postgres_configuration_secret: awx-postgres-configuration

postgres_storage_class: awx-postgres-volume

postgres_storage_requirements:

requests:

storage: 8Gi

projects_persistence: true

projects_existing_claim: awx-projects-claim

postgres_init_container_resource_requirements: {}

postgres_resource_requirements: {}

web_resource_requirements: {}

task_resource_requirements: {}

ee_resource_requirements: {}

ldap_cacert_secret: awx-custom-certs

bundle_cacert_secret: awx-custom-certs

# Uncomment to reveal "censored" logs

#no_log: "false"

I configure the resources as follows :

based on the example :

web_resource_requirements:

requests:

cpu: 250m

memory: 2Gi

limits:

cpu: 1000m

memory: 4Gi

task_resource_requirements:

requests:

cpu: 250m

memory: 1Gi

limits:

cpu: 2000m

memory: 2Gi

ee_resource_requirements:

requests:

cpu: 250m

memory: 100Mi

limits:

cpu: 500m

memory: 2Gi

This seems to have helped when running 2 jobs , and 2 concurrent hosts each

but, I then ran 4 jobs each with a single host and it crashed and

These are the events for the crashed pod

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Scheduled 26m default-scheduler Successfully assigned awx/awx-6b55565fcd-q74wv to gglvansible

Normal Pulled 26m kubelet Container image "quay.io/ansible/awx-ee:latest" already present on machine

Normal Created 26m kubelet Created container init

Normal Started 26m kubelet Started container init

Normal Pulled 26m kubelet Container image "quay.io/centos/centos:stream9" already present on machine

Normal Created 26m kubelet Created container init-projects

Normal Started 26m kubelet Started container init-projects

Normal Pulled 26m kubelet Container image "docker.io/redis:7" already present on machine

Normal Created 26m kubelet Created container redis

Normal Started 26m kubelet Started container redis

Normal Pulled 26m kubelet Container image "quay.io/ansible/awx:21.13.0" already present on machine

Normal Created 26m kubelet Created container awx-web

Normal Started 26m kubelet Started container awx-web

Normal Pulled 26m kubelet Container image "quay.io/ansible/awx:21.13.0" already present on machine

Normal Created 26m kubelet Created container awx-task

Normal Started 26m kubelet Started container awx-task

Normal Pulled 26m kubelet Container image "quay.io/ansible/awx-ee:latest" already present on machine

Normal Created 26m kubelet Created container awx-ee

Normal Started 26m kubelet Started container awx-ee

Warning Evicted 12m kubelet The node was low on resource: ephemeral-storage. Container redis was using 32Ki , which exceeds its request of 0. Container awx-task was using 3260068Ki, which exceeds its request of 0. Container awx-ee was using 917 416Ki, which exceeds its request of 0. Container awx-web was using 632Ki, which exceeds its request of 0.

Normal Killing 12m kubelet Stopping container redis

Normal Killing 12m kubelet Stopping container awx-task

Normal Killing 12m kubelet Stopping container awx-ee

Normal Killing 12m kubelet Stopping container awx-web

Warning ExceededGracePeriod 12m kubelet Container runtime did not kill the pod within specified grace period.

root@gglvansible:~/awx-on-k3s/base#

Which as a similar error as the previous example.

any thoughts ?

--
You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/bXU6OzzK23g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/8cb2b102-80b1-469c-abb5-ee9229e8b157n%40googlegroups.com.

AWX Project

unread,

Mar 10, 2023, 1:41:00 PM3/10/23

to AWX Project

yeah the pods are being evicted due to low resources. It might be worth playing around and setting the requests values for the web, ee, and awx containers

https://github.com/ansible/awx-operator#containers-resource-requirements

AWX Team

Gregory Machin

unread,

Mar 13, 2023, 5:06:17 AM3/13/23

to awx-p...@googlegroups.com

Thank you for that assistance.

I configured the resource limits and then after that I discovered that the system was running out of disk space causing it to fall over.

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/8bd16269-e5b8-4a09-a24a-1e32916ec198n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages