AWX Crashes when launch 7 concurrent jobs

451 views
Skip to first unread message

Gregory Machin

unread,
Mar 3, 2023, 10:11:47 PM3/3/23
to AWX Project
Hi,

I have AWX running on a VM with the following spec:
- 4 vCPU  (report as AMD EPYC 7402P 24-Core Processor)
- 16 GB of RAM
- 28GB virtual disk 18GB used 8.7G Free

OS - Ubuntu 22.04.1 LTS
AWX - About reports the version as 21.4.0

When a large number of concurrent jobs are started at the same time, AWX crashes, with errors in the browser 404 or 502. Sometimes it recovers and I can login but the jobs will have crashed  with "Task was marked as running but was not present in the job queue, so it has been marked as failed." other times it doesn't respond and I reboot the server.

It feels like a resource issue, but I'm not sure where to look as K3s is not an area I have much knowledge in.

What is the likely cause ? 







 

Gregory Machin

unread,
Mar 4, 2023, 12:01:34 AM3/4/23
to AWX Project
Looks like the Redis container is having issues :

I was following the logs and I started the workflow template and the connection to the container was lost

1:M 04 Mar 2023 04:40:29.124 * Background saving terminated with success
1:signal-handler (1677905080) Received SIGTERM scheduling shutdown...
1:M 04 Mar 2023 04:44:40.786 # User requested shutdown...
1:M 04 Mar 2023 04:44:40.786 * Saving the final RDB snapshot before exiting.
1:M 04 Mar 2023 04:44:40.795 * DB saved on disk
1:M 04 Mar 2023 04:44:40.795 * Removing the unix socket file.
1:M 04 Mar 2023 04:44:40.795 # Redis is now ready to exit, bye bye...
rpc error: code = NotFound desc = an error occurred when try to find container "76b43903e9dabc1f72e0f70b07e07e34ebc18520bf5f23ebdd0535a1d19b8f3a": not foundroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-vc9w5 -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
unable to retrieve container logs for containerd://8ae010256adb598c6f842f821c4a960809a9c2e8dae37edde6d73e0e68f94cbdroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-vc9w5 -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
unable to retrieve container logs for containerd://8ae010256adb598c6f842f821c4a960809a9c2e8dae37edde6d73e0e68f94cbdroot@server:~# kubectl -n awx logs pod/awx-788749fb7f-z558t -f
Defaulted container "redis" out of: redis, awx-web, awx-task, awx-ee, init (init)
1:C 04 Mar 2023 04:49:55.581 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 04 Mar 2023 04:49:55.581 # Redis version=7.0.9, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 04 Mar 2023 04:49:55.581 # Configuration loaded
1:M 04 Mar 2023 04:49:55.582 * monotonic clock: POSIX clock_gettime
1:M 04 Mar 2023 04:49:55.582 * Running mode=standalone, port=0.
1:M 04 Mar 2023 04:49:55.582 # Server initialized
1:M 04 Mar 2023 04:49:55.583 * The server is now ready to accept connections at /var/run/redis/redis.sock
1:signal-handler (1677905534) Received SIGTERM scheduling shutdown...
1:M 04 Mar 2023 04:52:14.604 # User requested shutdown...
1:M 04 Mar 2023 04:52:14.604 * Saving the final RDB snapshot before exiting.
1:M 04 Mar 2023 04:52:14.617 * DB saved on disk
1:M 04 Mar 2023 04:52:14.618 * Removing the unix socket file.
1:M 04 Mar 2023 04:52:14.619 # Redis is now ready to exit, bye bye...



root@server# kubectl -n awx get all
NAME                                                   READY   STATUS                   RESTARTS       AGE
pod/awx-788749fb7f-gvtv9                               0/4     ContainerStatusUnknown   45 (30d ago)   166d
pod/awx-788749fb7f-4l99n                               0/4     ContainerStatusUnknown   4              13d
pod/awx-788749fb7f-hmqdg                               0/4     ContainerStatusUnknown   2              3h44m
pod/awx-788749fb7f-f4xmj                               0/4     ContainerStatusUnknown   3              3h10m
pod/awx-788749fb7f-s8rr5                               0/4     ContainerStatusUnknown   3              103m
pod/awx-postgres-13-0                                  1/1     Running                  14 (26m ago)   166d
pod/awx-operator-controller-manager-7f89bd5797-lwjpx   2/2     Running                  23 (26m ago)   138d
pod/awx-788749fb7f-vc9w5                               0/4     ContainerStatusUnknown   5 (26m ago)    46m
pod/awx-788749fb7f-z558t                               0/4     ContainerStatusUnknown   4              12m
pod/awx-788749fb7f-qzzbg                               0/4     Pending                  0              4m46s

NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.88.44     <none>        8443/TCP   203d
service/awx-postgres-13                                   ClusterIP   None            <none>        5432/TCP   203d
service/awx-service                                       ClusterIP   10.43.137.182   <none>        80/TCP     203d

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   1/1     1            1           203d
deployment.apps/awx                               0/1     1            0           203d

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-5d7b85bc77                               0         0         0       203d
replicaset.apps/awx-operator-controller-manager-7f89bd5797   1         1         1       203d
replicaset.apps/awx-788749fb7f                               1         1         0       200d

NAME                               READY   AGE
statefulset.apps/awx-postgres-13   1/1     203d
root@server:~#

I saw there was a new running instance, and that shut down " Received SIGTERM scheduling shutdown..." and became " ContainerStatusUnknown" 

Why would the Redis container shutdown without an visible errors ?

AWX Project

unread,
Mar 8, 2023, 2:50:41 PM3/8/23
to AWX Project
what does kubectl describe on one of the ContainerStatusUnknown pods report?

verify that the nodes these job pods were assigned to are healthy by running  "kubectl get node"

We suspect the underlying nodes are unhealthy (memory issues maybe) and causing the pods to crash

AWX Team

Gregory Machin

unread,
Mar 9, 2023, 7:55:12 PM3/9/23
to awx-p...@googlegroups.com
Thanks for getting back to me,


run  kubectl -n awx delete deployment awx

which has cleared them from the list.

And the started a 2 new job ,each copies files to 2 servers.
I have lost access to AWX , getting gateway error , but now getting "not found" on the jobs page.

ansible:~/awx-on-k3s/base# kubectl get node
NAME          STATUS   ROLES                  AGE    VERSION
gglvansible   Ready    control-plane,master   209d   v1.25.6+k3s1
ansible:~/awx-on-k3s/base#

root@ansible:~/awx-on-k3s/base# kubectl -n awx get all
NAME                                                   READY   STATUS                   RESTARTS       AGE
pod/awx-postgres-13-0                                  1/1     Running                  21 (85m ago)   171d
pod/awx-operator-controller-manager-68d6f576b4-7672r   2/2     Running                  0              79m
pod/automation-job-1157-x7rb4                          1/1     Running                  0              7m38s
pod/automation-job-1156-tvhjj                          1/1     Running                  0              7m40s
pod/awx-9668dcb98-nzg5q                                0/4     ContainerStatusUnknown   3              46m
pod/awx-9668dcb98-dh56c                                4/4     Running                  0              6m5s

NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.88.44     <none>        8443/TCP   209d
service/awx-postgres-13                                   ClusterIP   None            <none>        5432/TCP   209d
service/awx-service                                       ClusterIP   10.43.137.182   <none>        80/TCP     209d

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   1/1     1            1           209d
deployment.apps/awx                               1/1     1            1           46m

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-controller-manager-68d6f576b4   1         1         1       79m
replicaset.apps/awx-operator-controller-manager-7f89bd5797   0         0         0       209d
replicaset.apps/awx-9668dcb98                                1         1         1       46m

NAME                               READY   AGE
statefulset.apps/awx-postgres-13   1/1     209d
root@ansible:~/awx-on-k3s/base#


root@ansible:~/awx-on-k3s/base# kubectl -n awx describe pod awx-9668dcb98-nzg5q
Name:             awx-9668dcb98-nzg5q
Namespace:        awx
Priority:         0
Service Account:  awx
Node:             ansible/10.20.7.10
Start Time:       Fri, 10 Mar 2023 10:47:16 +1300
Labels:           app.kubernetes.io/component=awx
                  app.kubernetes.io/name=awx
                  app.kubernetes.io/operator-version=1.3.0
                  app.kubernetes.io/part-of=awx
                  app.kubernetes.io/version=21.13.0
                  pod-template-hash=9668dcb98
Annotations:      checksum-configmaps-config: f561cc65d89b4e3678076eccafe63ac9
                  checksum-configmaps-pre_stop_scripts: 68b329da9893e34099c7d8ad5cb9c940
                  checksum-secret-bundle_cacert: 276fa68835904533a2a8b68b5a128047
                  checksum-secret-ldap_cacert: 276fa68835904533a2a8b68b5a128047
                  checksum-secret-receptor_ca: 4ee07b571170b38048a66949f955f0dc
                  checksum-secret-receptor_work_signing: 796f98b768de8340c4167ba74a0b0094
                  checksum-secret-route_tls: d41d8cd98f00b204e9800998ecf8427e
                  checksum-secret-secret_key: 37ec43cc1be555e4ba78f4425301865f
                  checksum-secrets-app_credentials: 1754fa7c60d3bf69b54d2ffcc10bca10
                  checksum-storage-persistent: 68b329da9893e34099c7d8ad5cb9c940
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.
IP:               10.42.0.12
IPs:
  IP:           10.42.0.12
Controlled By:  ReplicaSet/awx-9668dcb98
Init Containers:
  init:
    Container ID:  containerd://bb5e6a3ea197a81cc1fd1446b8e63435a75bffe67affcd7f16268f951a46f41f
    Image:         quay.io/ansible/awx-ee:latest
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      hostname=$MY_POD_NAME
      receptor --cert-makereq bits=2048 commonname=$hostname dnsname=$hostname nodeid=$hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key
      receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/receptor-ca.crt cakey=/etc/receptor/tls/ca/receptor-ca.key outcert=/etc/receptor/tls/receptor.crt verify=yes
      mkdir -p /etc/pki/ca-trust/extracted/{java,pem,openssl,edk2}
      update-ca-trust

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 10 Mar 2023 10:47:17 +1300
      Finished:     Fri, 10 Mar 2023 10:47:18 +1300
    Ready:          True
    Restart Count:  0
    Environment:
      MY_POD_NAME:  awx-9668dcb98-nzg5q (v1:metadata.name)
    Mounts:
      /etc/pki/ca-trust/extracted from ca-trust-extracted (rw)
      /etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")
      /etc/receptor/tls/ from awx-receptor-tls (rw)
      /etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")
      /etc/receptor/tls/ca/receptor-ca.key from awx-receptor-ca (ro,path="tls.key")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
  init-projects:
    Container ID:  containerd://3e7d78e9a26e26b0fe717e169c00db72fb8ae7350b0fd3721b68bd02551aae7e
    Image:         quay.io/centos/centos:stream9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      chmod 775 /var/lib/awx/projects
      chgrp 1000 /var/lib/awx/projects

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 10 Mar 2023 10:47:18 +1300
      Finished:     Fri, 10 Mar 2023 10:47:18 +1300
    Ready:          True
    Restart Count:  0
    Environment:
      MY_POD_NAME:  awx-9668dcb98-nzg5q (v1:metadata.name)
    Mounts:
      /var/lib/awx/projects from awx-projects (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
Containers:
  redis:
    Container ID:
    Image:         docker.io/redis:7
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      redis-server
      /etc/redis.conf
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Requests:
      cpu:        50m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /data from awx-redis-data (rw)
      /etc/redis.conf from awx-redis-config (ro,path="redis.conf")
      /var/run/redis from awx-redis-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
  awx-web:
    Container ID:
    Image:         quay.io/ansible/awx:21.13.0
    Image ID:
    Port:          8052/TCP
    Host Port:     0/TCP
    Args:
      /usr/bin/launch_awx.sh
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Environment:
      MY_POD_NAMESPACE:  awx (v1:metadata.namespace)
      UWSGI_MOUNT_PATH:  /
    Mounts:
      /etc/nginx/nginx.conf from awx-nginx-conf (ro,path="nginx.conf")
      /etc/openldap/certs/ldap-ca.crt from awx-ldap-cacert (ro,path="ldap-ca.crt")
      /etc/pki/ca-trust/extracted from ca-trust-extracted (rw)
      /etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")
      /etc/receptor/signing/work-public-key.pem from awx-receptor-work-signing (ro,path="work-public-key.pem")
      /etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")
      /etc/receptor/tls/ca/receptor-ca.key from awx-receptor-ca (ro,path="tls.key")
      /etc/tower/SECRET_KEY from awx-secret-key (ro,path="SECRET_KEY")
      /etc/tower/conf.d/credentials.py from awx-application-credentials (ro,path="credentials.py")
      /etc/tower/conf.d/execution_environments.py from awx-application-credentials (ro,path="execution_environments.py")
      /etc/tower/conf.d/ldap.py from awx-application-credentials (ro,path="ldap.py")
      /etc/tower/settings.py from awx-settings (ro,path="settings.py")
      /var/lib/awx/projects from awx-projects (rw)
      /var/lib/awx/rsyslog from rsyslog-dir (rw)
      /var/run/awx-rsyslog from rsyslog-socket (rw)
      /var/run/redis from awx-redis-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
      /var/run/supervisor from supervisor-socket (rw)
  awx-task:
    Container ID:  containerd://7de76df9cdc0621fd1acf2a73f80a59fb3eb9a2007142a34a88c430afc06bce9
    Image:         quay.io/ansible/awx:21.13.0
    Port:          <none>
    Host Port:     <none>
    Args:
      /usr/bin/launch_awx_task.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 10 Mar 2023 10:47:19 +1300
      Finished:     Fri, 10 Mar 2023 11:28:03 +1300
    Ready:          False
    Restart Count:  0
    Environment:
      SUPERVISOR_WEB_CONFIG_PATH:  /etc/supervisord.conf
      AWX_SKIP_MIGRATIONS:         1
      MY_POD_UID:                   (v1:metadata.uid)
      MY_POD_IP:                    (v1:status.podIP)
      MY_POD_NAMESPACE:            awx (v1:metadata.namespace)
    Mounts:
      /etc/pki/ca-trust/extracted from ca-trust-extracted (rw)
      /etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")
      /etc/receptor/ from awx-receptor-config (rw)
      /etc/receptor/signing/work-private-key.pem from awx-receptor-work-signing (ro,path="work-private-key.pem")
      /etc/tower/SECRET_KEY from awx-secret-key (ro,path="SECRET_KEY")
      /etc/tower/conf.d/credentials.py from awx-application-credentials (ro,path="credentials.py")
      /etc/tower/conf.d/execution_environments.py from awx-application-credentials (ro,path="execution_environments.py")
      /etc/tower/conf.d/ldap.py from awx-application-credentials (ro,path="ldap.py")
      /etc/tower/settings.py from awx-settings (ro,path="settings.py")
      /var/lib/awx/projects from awx-projects (rw)
      /var/lib/awx/rsyslog from rsyslog-dir (rw)
      /var/run/awx-rsyslog from rsyslog-socket (rw)
      /var/run/receptor from receptor-socket (rw)
      /var/run/redis from awx-redis-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
      /var/run/supervisor from supervisor-socket (rw)
  awx-ee:
    Container ID:
    Image:         quay.io/ansible/awx-ee:latest
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      if [ ! -f /etc/receptor/receptor.conf ]; then
        cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf
        sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf
      fi
      exec receptor --config /etc/receptor/receptor.conf

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /etc/pki/ca-trust/extracted from ca-trust-extracted (rw)
      /etc/pki/ca-trust/source/anchors/bundle-ca.crt from awx-bundle-cacert (ro,path="bundle-ca.crt")
      /etc/receptor/ from awx-receptor-config (rw)
      /etc/receptor/receptor-default.conf from awx-default-receptor-config (rw,path="receptor.conf")
      /etc/receptor/signing/work-private-key.pem from awx-receptor-work-signing (ro,path="work-private-key.pem")
      /etc/receptor/tls/ from awx-receptor-tls (rw)
      /etc/receptor/tls/ca/receptor-ca.crt from awx-receptor-ca (ro,path="tls.crt")
      /var/lib/awx/projects from awx-projects (rw)
      /var/run/receptor from receptor-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f96x7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  ca-trust-extracted:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  awx-bundle-cacert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-custom-certs
    Optional:    false
  awx-ldap-cacert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-custom-certs
    Optional:    false
  awx-application-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-app-credentials
    Optional:    false
  awx-receptor-tls:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  awx-receptor-ca:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-receptor-ca
    Optional:    false
  awx-receptor-work-signing:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-receptor-work-signing
    Optional:    false
  awx-secret-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  awx-secret-key
    Optional:    false
  awx-settings:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      awx-awx-configmap
    Optional:  false
  awx-nginx-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      awx-awx-configmap
    Optional:  false
  awx-redis-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      awx-awx-configmap
    Optional:  false
  awx-redis-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  awx-redis-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  supervisor-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  rsyslog-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  receptor-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  rsyslog-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  awx-receptor-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  awx-default-receptor-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      awx-awx-configmap
    Optional:  false
  awx-projects:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  awx-projects-claim
    ReadOnly:   false
  kube-api-access-f96x7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Normal   Scheduled            53m   default-scheduler  Successfully assigned awx/awx-9668dcb98-nzg5q to ansible
  Normal   Pulled               53m   kubelet            Container image "quay.io/ansible/awx-ee:latest" already present on machine
  Normal   Created              53m   kubelet            Created container init
  Normal   Started              53m   kubelet            Started container init
  Normal   Pulled               52m   kubelet            Container image "quay.io/centos/centos:stream9" already present on machine
  Normal   Created              52m   kubelet            Created container init-projects
  Normal   Started              52m   kubelet            Started container init-projects
  Normal   Pulled               52m   kubelet            Container image "docker.io/redis:7" already present on machine
  Normal   Created              52m   kubelet            Created container redis
  Normal   Started              52m   kubelet            Started container redis
  Normal   Pulled               52m   kubelet            Container image "quay.io/ansible/awx:21.13.0" already present on machine
  Normal   Created              52m   kubelet            Created container awx-web
  Normal   Started              52m   kubelet            Started container awx-web
  Normal   Pulled               52m   kubelet            Container image "quay.io/ansible/awx:21.13.0" already present on machine
  Normal   Created              52m   kubelet            Created container awx-task
  Normal   Started              52m   kubelet            Started container awx-task
  Normal   Pulled               52m   kubelet            Container image "quay.io/ansible/awx-ee:latest" already present on machine
  Normal   Created              52m   kubelet            Created container awx-ee
  Normal   Started              52m   kubelet            Started container awx-ee
  Warning  Evicted              12m   kubelet            The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.
  Normal   Killing              12m   kubelet            Stopping container redis
  Normal   Killing              12m   kubelet            Stopping container awx-ee
  Normal   Killing              12m   kubelet            Stopping container awx-task
  Normal   Killing              12m   kubelet            Stopping container awx-web
  Warning  ExceededGracePeriod  12m   kubelet            Container runtime did not kill the pod within specified grace period.
root@ansible:~/awx-on-k3s/base#



Last time I tried this there where no events.

this looks like a possible issue 

  Warning  Evicted              12m   kubelet            The node was low on resource: ephemeral-storage. Container awx-ee was using 1008644Ki, which exceeds its request of 0. Container redis was using 36Ki, which exceeds its request of 0. Container awx-task was using 873704Ki, which exceeds its request of 0. Container awx-web was using 360Ki, which exceeds its request of 0.

and I see that there are no resources configured in awx.yaml


---
kind: AWX
metadata:
  name: awx
spec:
  # These parameters are designed for use with:
  # - AWX Operator: 0.26.0
  # - AWX: 21.4.0

  admin_user: admin
  admin_password_secret: awx-admin-password

  ingress_type: ingress
  ingress_tls_secret: awx-secret-tls
  hostname: awx.gallagher.local

  postgres_configuration_secret: awx-postgres-configuration

  postgres_storage_class: awx-postgres-volume
  postgres_storage_requirements:
    requests:
      storage: 8Gi

  projects_persistence: true
  projects_existing_claim: awx-projects-claim

  postgres_init_container_resource_requirements: {}
  postgres_resource_requirements: {}
  web_resource_requirements: {}
  task_resource_requirements: {}
  ee_resource_requirements: {}

  ldap_cacert_secret: awx-custom-certs
  bundle_cacert_secret: awx-custom-certs

  # Uncomment to reveal "censored" logs
  #no_log: "false"


I configure the resources as follows :

based on the example :
  web_resource_requirements:
    requests:
      cpu: 250m
      memory: 2Gi
    limits:
      cpu: 1000m
      memory: 4Gi

  task_resource_requirements:
    requests:
      cpu: 250m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi

  ee_resource_requirements:
    requests:
      cpu: 250m
      memory: 100Mi
    limits:
      cpu: 500m
      memory: 2Gi

This seems to have helped when running 2 jobs , and 2 concurrent hosts each
but, I then ran 4 jobs each with a single host and it crashed and 

  These are the events for the crashed pod

Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Normal   Scheduled            26m   default-scheduler  Successfully assigned awx/awx-6b55565fcd-q74wv to gglvansible
  Normal   Pulled               26m   kubelet            Container image "quay.io/ansible/awx-ee:latest" already present on machine
  Normal   Created              26m   kubelet            Created container init
  Normal   Started              26m   kubelet            Started container init
  Normal   Pulled               26m   kubelet            Container image "quay.io/centos/centos:stream9" already present on machine
  Normal   Created              26m   kubelet            Created container init-projects
  Normal   Started              26m   kubelet            Started container init-projects
  Normal   Pulled               26m   kubelet            Container image "docker.io/redis:7" already present on machine
  Normal   Created              26m   kubelet            Created container redis
  Normal   Started              26m   kubelet            Started container redis
  Normal   Pulled               26m   kubelet            Container image "quay.io/ansible/awx:21.13.0" already present on machine
  Normal   Created              26m   kubelet            Created container awx-web
  Normal   Started              26m   kubelet            Started container awx-web
  Normal   Pulled               26m   kubelet            Container image "quay.io/ansible/awx:21.13.0" already present on machine
  Normal   Created              26m   kubelet            Created container awx-task
  Normal   Started              26m   kubelet            Started container awx-task
  Normal   Pulled               26m   kubelet            Container image "quay.io/ansible/awx-ee:latest" already present on machine
  Normal   Created              26m   kubelet            Created container awx-ee
  Normal   Started              26m   kubelet            Started container awx-ee
  Warning  Evicted              12m   kubelet            The node was low on resource: ephemeral-storage. Container redis was using 32Ki                                                                                                     , which exceeds its request of 0. Container awx-task was using 3260068Ki, which exceeds its request of 0. Container awx-ee was using 917                                                                                                     416Ki, which exceeds its request of 0. Container awx-web was using 632Ki, which exceeds its request of 0.
  Normal   Killing              12m   kubelet            Stopping container redis
  Normal   Killing              12m   kubelet            Stopping container awx-task
  Normal   Killing              12m   kubelet            Stopping container awx-ee
  Normal   Killing              12m   kubelet            Stopping container awx-web
  Warning  ExceededGracePeriod  12m   kubelet            Container runtime did not kill the pod within specified grace period.
root@gglvansible:~/awx-on-k3s/base#

Which as a similar error as the previous example.

any thoughts ?


--
You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/bXU6OzzK23g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/8cb2b102-80b1-469c-abb5-ee9229e8b157n%40googlegroups.com.

AWX Project

unread,
Mar 10, 2023, 1:41:00 PM3/10/23
to AWX Project
yeah the pods are being evicted due to low resources. It might be worth playing around and setting the requests values for the web, ee, and awx containers


AWX Team

Gregory Machin

unread,
Mar 13, 2023, 5:06:17 AM3/13/23
to awx-p...@googlegroups.com
Thank you for that assistance.

I configured the resource limits and then after that I discovered that the system was running out of disk space causing it to fall over. 

Reply all
Reply to author
Forward
0 new messages