At 3:00 this afternoon, for no apparent reason, my Dev AWX instance just quit being able to create new automation nodes. It had been running for more than a month and processed 130k jobs successfully. The Controller knows what jobs should be running, and it attempts to start them, but the automation pod never instantiates, not even to "container creating."
I have this in the task node logs:
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2023-07-06 20:03:18,436 INFO RPC interface 'supervisor' initialized
2023-07-06 20:03:18,436 INFO RPC interface 'supervisor' initialized
2023-07-06 20:03:18,436 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-07-06 20:03:18,436 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-07-06 20:03:18,436 INFO supervisord started with pid 7
2023-07-06 20:03:18,436 INFO supervisord started with pid 7
2023-07-06 20:03:19,439 INFO spawned: 'superwatcher' with pid 28
2023-07-06 20:03:19,439 INFO spawned: 'superwatcher' with pid 28
2023-07-06 20:03:19,441 INFO spawned: 'dispatcher' with pid 29
2023-07-06 20:03:19,441 INFO spawned: 'dispatcher' with pid 29
2023-07-06 20:03:19,444 INFO spawned: 'callback-receiver' with pid 30
2023-07-06 20:03:19,444 INFO spawned: 'callback-receiver' with pid 30
READY
2023-07-06 20:03:20,445 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-06 20:03:20,445 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-07-06 20:03:21,682 WARNING [-] awx.main.dispatch.periodic periodic beat started
2023-07-06 20:03:21,711 INFO [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'it-mass-awx-75665894f8-tn5hd']
2023-07-06 20:03:49,740 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-07-06 20:03:49,740 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-07-06 20:03:49,740 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-07-06 20:03:49,740 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2023-07-06 20:03:53,060 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task 725de955-a55a-419e-a05a-1df5e660726d starting awx.main.scheduler.tasks.task_manager(*[]) took 11.3476 to ack, 0.0066 in local dispatcher
2023-07-06 20:03:53,070 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task fce9b15c-da78-4abe-a147-5e7a87455e27 starting awx.main.scheduler.tasks.dependency_manager(*[]) took 11.3492 to ack, 0.0021 in local dispatcher
2023-07-06 20:03:53,070 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.main.dispatch task 6ba68fd4-b0bc-41e0-880f-905231102968 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[]) took 11.3491 to ack, 0.0016 in local dispatcher
2023-07-06 20:03:53,266 INFO [f372fdfaf38b4e1eaa01d064781c67a6] awx.analytics.job_lifecycle job-135181 waiting
2023-07-06 20:05:23,213 INFO [1b4bf41d3da14c55a374fe65ca76fd99] awx.analytics.job_lifecycle job-135182 waiting
2023-07-06 20:05:27,382 ERROR [4dd2dae62f04481fb4fb50e22dba78b9] awx.main.dispatch job 135181 (failed) is no longer waiting; reaping
2023-07-06 20:05:27,517 INFO [4dd2dae62f04481fb4fb50e22dba78b9] awx.analytics.job_lifecycle job-135181 notifications sent
2023-07-06 20:05:47,475 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 waiting
2023-07-06 20:05:47,688 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 pre run
2023-07-06 20:05:47,716 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 created
2023-07-06 20:05:47,717 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 controller node chosen
2023-07-06 20:05:47,717 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 execution node chosen
2023-07-06 20:05:47,815 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 pre run
2023-07-06 20:05:47,821 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 preparing playbook
2023-07-06 20:05:47,864 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 running playbook
2023-07-06 20:05:47,895 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 work unit id received
2023-07-06 20:05:47,907 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 work unit id assigned
2023-07-06 20:05:50,217 INFO [-] awx.analytics.job_lifecycle projectupdate-135184 stats wrapup finished
2023-07-06 20:05:50,417 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 135184
2023-07-06 20:05:50,418 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 post run
2023-07-06 20:05:50,455 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle projectupdate-135184 finalize run
2023-07-06 20:05:50,531 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 post run
2023-07-06 20:05:55,909 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 finalize run
2023-07-06 20:05:55,913 WARNING [c1ee5d7108e946c79df34a3aebf13ef7] awx.main.dispatch job 135183 (error) encountered an error (rc=None), please see task stdout for details.
2023-07-06 20:05:56,014 INFO [c1ee5d7108e946c79df34a3aebf13ef7] awx.analytics.job_lifecycle job-135183 notifications sent
I have this in the OpenShift logs:
Warning FailedAttachVolume pod/it-mass-awx-postgres-13-0 Multi-Attach error for volume "pvc-a87c67f2-025e-4ce1-90fe-e43320d153dc" Volume is already exclusively attached to one node and can't be attached to another
Any hints where I can dig?
Thank you,
Kevin
(I posted about 30 minutes ago and saw my post in the list, but I don't see it now after multiple refreshes, so I have reposted.)
--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/d36cae95-18c5-4b12-830e-dd6174c65956n%40googlegroups.com.