Re: [awx-project] Controller can no longer create automation nodes

118 views
Skip to first unread message

Christian Adams

unread,
Jul 6, 2023, 5:08:13 PM7/6/23
to awx-p...@googlegroups.com
A multi-attach error usually means you have the PVC mounted in multiple containers, but the storageClass that created the PVC does not support RWX. I would check to see if any other pods are mounting the same Postgres pvc. As a first step.

AWX Team

On Thu, Jul 6, 2023 at 5:03 PM Kevin Knox <kevin.k...@gmail.com> wrote:
At 3:00 this afternoon, for no apparent reason, my Dev AWX instance just quit being able to create new automation nodes. It had been running for more than a month and processed 130k jobs successfully. The Controller knows what jobs should be running, and it attempts to start them, but the automation pod never instantiates, not even to "container creating."

I have this in the task node logs:
[wait-for-migrations] Waiting for database migrations... [wait-for-migrations] Attempt 1 of 30 Instance Group already registered controlplane Instance Group already registered default Successfully registered instance None (changed: True) 2023-07-06 20:03:18,436 INFO RPC interface 'supervisor' initialized 2023-07-06 20:03:18,436 INFO RPC interface 'supervisor' initialized 2023-07-06 20:03:18,436 CRIT Server 'unix_http_server' running without any HTTP authentication checking 2023-07-06 20:03:18,436 CRIT Server 'unix_http_server' running without any HTTP authentication checking 2023-07-06 20:03:18,436 INFO supervisord started with pid 7 2023-07-06 20:03:18,436 INFO supervisord started with pid 7 2023-07-06 20:03:19,439 INFO spawned: 'superwatcher' with pid 28 2023-07-06 20:03:19,439 INFO spawned: 'superwatcher' with pid 28 2023-07-06 20:03:19,441 INFO spawned: 'dispatcher' with pid 29 2023-07-06 20:03:19,441 INFO spawned: 'dispatcher' with pid 29 2023-07-06 20:03:19,444 INFO spawned: 'callback-receiver' with pid 30 2023-07-06 20:03:19,444 INFO spawned: 'callback-receiver' with pid 30 READY 2023-07-06 20:03:20,445 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2023-07-06 20:03:20,445 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2023-07-06 20:03:21,682 WARNING [-] awx.main.dispatch.periodic periodic beat started 2023-07-06 20:03:21,711 INFO [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'it-mass-awx-75665894f8-tn5hd'] 2023-07-06 20:03:49,740 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs) 2023-07-06 20:03:49,740 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs) 2023-07-06 20:03:49,740 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs) 2023-07-06 20:03:49,740 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs) 2023-07-06 20:03:53,060 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task 725de955-a55a-419e-a05a-1df5e660726d starting awx.main.scheduler.tasks.task_manager(*[]) took 11.3476 to ack, 0.0066 in local dispatcher 2023-07-06 20:03:53,070 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task fce9b15c-da78-4abe-a147-5e7a87455e27 starting awx.main.scheduler.tasks.dependency_manager(*[]) took 11.3492 to ack, 0.0021 in local dispatcher 2023-07-06 20:03:53,070 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task 6ba68fd4-b0bc-41e0-880f-905231102968 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[]) took 11.3491 to ack, 0.0016 in local dispatcher 2023-07-06 20:03:53,266 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.analytics.job_lifecycle job-135181 waiting 2023-07-06 20:05:23,213 INFO [1b4bf41d3da14c55a374fe65ca76fd99] awx.analytics.job_lifecycle job-135182 waiting 2023-07-06 20:05:27,382 ERROR [4dd2dae62f04481fb4fb50e22dba78b9] awx.main.dispatch job 135181 (failed) is no longer waiting; reaping 2023-07-06 20:05:27,517 INFO [4dd2dae62f04481fb4fb50e22dba78b9] awx.analytics.job_lifecycle job-135181 notifications sent 2023-07-06 20:05:47,475 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 waiting 2023-07-06 20:05:47,688 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 pre run 2023-07-06 20:05:47,716 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 created 2023-07-06 20:05:47,717 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 controller node chosen 2023-07-06 20:05:47,717 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 execution node chosen 2023-07-06 20:05:47,815 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 pre run 2023-07-06 20:05:47,821 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 preparing playbook 2023-07-06 20:05:47,864 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 running playbook 2023-07-06 20:05:47,895 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 work unit id received 2023-07-06 20:05:47,907 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 work unit id assigned 2023-07-06 20:05:50,217 INFO [-] awx.analytics.job_lifecycle projectupdate-135184 stats wrapup finished 2023-07-06 20:05:50,417 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 135184 2023-07-06 20:05:50,418 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 post run 2023-07-06 20:05:50,455 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 finalize run 2023-07-06 20:05:50,531 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 post run 2023-07-06 20:05:55,909 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 finalize run 2023-07-06 20:05:55,913 WARNING [c1ee5d7108e946c79df34a3aebf13ef7] awx.main.dispatch job 135183 (error) encountered an error (rc=None), please see task stdout for details. 2023-07-06 20:05:56,014 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 notifications sent
I have this in the OpenShift logs:
Warning   FailedAttachVolume       pod/it-mass-awx-postgres-13-0                           Multi-Attach error for volume "pvc-a87c67f2-025e-4ce1-90fe-e43320d153dc" Volume is already exclusively attached to one node and can't be attached to another

Any hints where I can dig? 

Thank you,

Kevin
(I posted about 30 minutes ago and saw my post in the list, but I don't see it now after multiple refreshes, so I have reposted.)

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/d36cae95-18c5-4b12-830e-dd6174c65956n%40googlegroups.com.
--
__________________________________________________
Christian M. Adams
Software Engineer at Ansible - Red Hat
cha...@redhat.com  |  (919) 218-5080  |  GitHub: rooftopcellist

Kevin Knox

unread,
Jul 6, 2023, 5:15:09 PM7/6/23
to AWX Project
Thank you! Will do.

Kevin

Kevin Knox

unread,
Jul 7, 2023, 9:23:41 AM7/7/23
to AWX Project
I'm almost sure the PVC thing is a red herring. First, the GUI is working fine. I have all my data, all my templates, all my inventories, and the job records exist. Second, I have this yaml for the PVC:
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
  creationTimestamp: "2023-01-16T20:17:46Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/component: database
    app.kubernetes.io/instance: postgres-13-it-mass-awx
    app.kubernetes.io/managed-by: awx-operator
    app.kubernetes.io/name: postgres-13
```

This smells like something is wrong in the working pod. The controller can tell it to spin up a new automation pod, but it cannot do it. I don't believe this is likely a database error. 

Kevin Knox

unread,
Jul 7, 2023, 9:58:50 AM7/7/23
to AWX Project
So, here is the log of a failure trying to update a github project and then a success from an identical system. What I see is the EOF event happens before the post run event on the failure. That's really thin gruel for analysis and probably means nothing. I'm told to look at stdout for details. Any insights into how I might do that? Anywhere else I might be able to look for clues?

```
2023-07-07 13:51:02,277 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 controller node chosen
2023-07-07 13:51:02,278 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 execution node chosen
2023-07-07 13:51:02,328 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 waiting
2023-07-07 13:51:02,456 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 pre run
2023-07-07 13:51:02,461 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 preparing playbook
2023-07-07 13:51:02,498 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 running playbook
2023-07-07 13:51:02,531 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 work unit id received
2023-07-07 13:51:02,546 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 work unit id assigned
2023-07-07 13:51:04,849 INFO     [-] awx.analytics.job_lifecycle projectupdate-139415 stats wrapup finished
2023-07-07 13:51:05,056 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 139415
2023-07-07 13:51:05,058 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 post run
2023-07-07 13:51:05,146 INFO     [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 finalize run
2023-07-07 13:51:05,158 WARNING  [6592890e3fcd4ba39b4605a69fde8808] awx.main.dispatch project_update 139415 (failed) encountered an error (rc=None), please see task stdout for details.
```
success:
```
2023-07-07 13:52:25,843 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 controller node chosen
2023-07-07 13:52:25,843 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 execution node chosen
2023-07-07 13:52:25,901 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 waiting
2023-07-07 13:52:26,327 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 pre run
2023-07-07 13:52:26,354 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 preparing playbook
2023-07-07 13:52:26,393 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 running playbook
2023-07-07 13:52:26,414 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 work unit id received
2023-07-07 13:52:26,430 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 work unit id assigned
2023-07-07 13:52:52,571 INFO [-] awx.analytics.job_lifecycle projectupdate-1366 stats wrapup finished
2023-07-07 13:52:52,763 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 post run
2023-07-07 13:52:52,763 INFO [927983d6cb734c91babf825790086fb3] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 1366
2023-07-07 13:52:53,503 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 finalize run
                                                                                                                                                                                        
```

Kevin Knox

unread,
Jul 7, 2023, 10:20:39 AM7/7/23
to AWX Project
Another instance of the same event on the failing system with DEBUG turned on. I see nothing helpful. 
```
2023-07-07 14:14:41,880 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task 05abd5d7-869c-4501-9d39-74b36f900a93 starting awx.main.scheduler.tasks.task_manager(*[])
2023-07-07 14:14:41,883 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13924
2023-07-07 14:14:41,884 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13930
2023-07-07 14:14:41,885 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13932
2023-07-07 14:14:41,885 DEBUG    [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13924
2023-07-07 14:14:41,886 DEBUG    [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13930
2023-07-07 14:14:41,887 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task 959118df-2b92-4340-a4e7-dd1550ad1356 starting awx.main.scheduler.tasks.dependency_manager(*[])
2023-07-07 14:14:41,887 DEBUG    [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task ab9c6c4c-48b9-431c-830a-47c35b22b495 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2023-07-07 14:14:41,889 DEBUG    [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13932
2023-07-07 14:14:42,869 DEBUG    [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task 29851216-1182-48ee-a77f-24699201625a starting awx.main.scheduler.tasks.task_manager(*[])
2023-07-07 14:14:42,911 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 controller node chosen
2023-07-07 14:14:42,911 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 execution node chosen
2023-07-07 14:14:42,965 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 waiting
2023-07-07 14:14:42,977 DEBUG    [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.jobs.RunProjectUpdate(*[139419])
2023-07-07 14:14:43,173 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 pre run
2023-07-07 14:14:43,205 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 preparing playbook
2023-07-07 14:14:43,242 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 running playbook
2023-07-07 14:14:43,264 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 work unit id received
2023-07-07 14:14:43,277 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 work unit id assigned
2023-07-07 14:14:44,786 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,786 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,787 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,794 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,040 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,043 DEBUG    [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,093 INFO     [-] awx.analytics.job_lifecycle projectupdate-139419 stats wrapup finished
2023-07-07 14:14:46,290 INFO     [9d34897d960f4a909943fc691cc5de35] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 139419
2023-07-07 14:14:46,291 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 post run
2023-07-07 14:14:46,368 INFO     [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 finalize run
2023-07-07 14:14:46,374 WARNING  [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch project_update 139419 (failed) encountered an error (rc=None), please see task stdout for details.
2023-07-07 14:14:46,374 DEBUG    [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.system.handle_work_error(*[])
2023-07-07 14:14:46,382 DEBUG    [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.system.handle_work_success(*[])

```

Kevin Knox

unread,
Jul 7, 2023, 10:28:55 AM7/7/23
to AWX Project
OK. This may have been due to an expired Project credential. The error does not look like I expect, but who am I to say. I reinstalled the yaml for the whole project then got a credential error. I fixed the credential and things are much happier now. So, I don't know why reinstalling the project got me to my error, but I'll take it. 
Reply all
Reply to author
Forward
0 new messages