multiple instance for AWX instalation (HA)

5,909 views
Skip to first unread message

Oğuz Yarımtepe

unread,
Jan 3, 2018, 3:39:03 AM1/3/18
to AWX Project
How can i install a high available version of AWX? I want to test a fail over scenario where the current installation server is down, but i should be able to access to another instance. 

Any tip?

Matthew Jones

unread,
Jan 8, 2018, 11:21:36 AM1/8/18
to Oğuz Yarımtepe, AWX Project
I have been working an openshift/kubernetes based scalable system under a branch called scalable_clusters on the awx github repo.

For traditional redundancy, I'm afraid that's not a focus for this at the moment.

On Wed, Jan 3, 2018 at 3:39 AM, Oğuz Yarımtepe <oguzya...@gmail.com> wrote:
How can i install a high available version of AWX? I want to test a fail over scenario where the current installation server is down, but i should be able to access to another instance. 

Any tip?

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project+unsubscribe@googlegroups.com.
To post to this group, send email to awx-p...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/65554be7-f42a-4f56-ad58-76cfc6e369f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Matt Jones
Principal Software Engineer
Ansible Tower

Bruno Casano

unread,
Jan 26, 2018, 12:45:19 PM1/26/18
to AWX Project
Is it possible to install two versions and point them to the same DB instance?

Bill Nottingham

unread,
Jan 26, 2018, 12:49:29 PM1/26/18
to Bruno Casano, AWX Project
Bruno Casano (bru...@gmail.com) said:
> Is it possible to install two versions and point them to the same DB
> instance?

Same database *server*? Yes. Same AWX database *on* the server? Absolutely not.

Bill

Bruno Casano

unread,
Jan 26, 2018, 3:09:48 PM1/26/18
to Bill Nottingham, AWX Project
Thanks Bill! 

Oğuz Yarımtepe

unread,
Jan 29, 2018, 1:29:17 AM1/29/18
to Matthew Jones, AWX Project
Is it stable and can be tested now?

--
Oğuz Yarımtepe
http://about.me/oguzy

Philipp Wiesner

unread,
Jan 30, 2018, 4:54:58 AM1/30/18
to AWX Project
We are running AWX in a clustered HA environment. But for this, some manual adjustments in the installation roles had been done. Further you need to create a RabbitMQ Cluster. For this we disabled the RabbitMQ containers for the installation and set them up beforehand on all the nodes. After the RabbitMQ cluster was running, we changed the RabbitMQ connection details in the roles.

The following files were changed:

image_build/files/launch_awx_task.sh
awx-manage provision_instance --hostname=$CLUSTER_NODE
awx-manage register_queue --queuename=tower --hostnames=$CLUSTER_NODE



image_build/files/settings.py
CLUSTER_HOST_ID = os.getenv("CLUSTER_NODE", "awx")


image_build/files/supervisor_task.conf
command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l ERROR --autoscale=50,4 -Ofair -Q
tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s

local_docker/tasks/main.yml
uncomment every rabbitmq container reference
- name: Activate AWX Web Container
    ...
    env:
      CLUSTER_NODE: "{{ cluster_node | default('localhost') }}"
      ...
      RABBITMQ_USER: "awx"
      RABBITMQ_PASSWORD: "<password>"
      RABBITMQ_HOST: "{{ cluster_node | default('localhost')}}"
      RABBITMQ_PORT: "5672"
      RABBITMQ_VHOST: "awx"

The same was changed for the AWX Task container

in front of the nodes a HAProxy is running with roundrobin load balacing.

Dong Yang

unread,
Feb 7, 2018, 8:17:02 PM2/7/18
to AWX Project
Hi Philipp- I’m currently looking into your solution by separating out RabbitMQ container from the installer . How you handle the postgresdb - you have it installed separately as well or within a container ?

dnc92301

unread,
Feb 9, 2018, 1:24:46 PM2/9/18
to AWX Project
Phillipp - can you comment where you specify the CLUSTER_NODE name . I’m getting errors connecting to local host : amqp://awx:**@127.0.0.1:5672/awx

Jay Kumar

unread,
Feb 13, 2018, 3:54:55 AM2/13/18
to AWX Project
My HA solution for AWX could be extreme, but here is what I am doing.

1. HAproxy loadbalancing RabbitMQ, Memcached & Postgresql
2. RabbitMQ Cluster 3 nodes
3. Memcached on 3 nodes, active/passive configured on HAproxy
4. Posgresql HA (Master/Slave) using Patroni, active/passive with automatic failover configured on HAProxy
5. AWX task/web docker instances.

I am using only a single instances on awx-task/awx-web containers, everything seems to be working fine.

Would want to test multiple loadbalanced awx-task/awx-web contrainers.





Philipp Wiesner

unread,
Feb 13, 2018, 6:25:54 AM2/13/18
to AWX Project
dnc92301:

Postgres is running on a seperate node, we have set the Postgres DB connection details in the installation inventory for the application nodes. On these only awx-task, awx-web and memcached are running in a container.
The variable CLUSTER_NODE was added by ourself to the installation inventory and is set to the hostname of the machine where the playbook is executed on. We copy the modified AWX installation directory to a new node, set the correct CLUSTER_NODE variable and run the installation playbook, which will then setup a new cluster member.

dnc92301

unread,
Feb 25, 2018, 12:26:45 AM2/25/18
to AWX Project

Hi Phillipp

I still couldn’t get it to work . It would be great if we can work to resolve this separately . I had rabbitmq installed separately as well as Postgres . I’m getting a bunch of errors with regards to .

INFO success : awx-celeryd-beat entered RUNNING state , process has stayed up for > 1 than 1 seconds.
INFO exited : channels-worker ( exit status 1 ; not expected )

I think the problem has to do with celeryd -

ps -ef|grep celery - shows that - celery@localhost - where it should be mapped to celery@hostname (on a working Tower installation )

Thanks again

dnc92301

unread,
Feb 25, 2018, 6:29:19 PM2/25/18
to AWX Project
Earlier issue was due to misconfiguration at the external database server which was fixed .

I’ve got 2 AWX instances running but getting connection refised . Consumer: cannot connect to amgp://awx:**@127.0.0.1:5672/awx [Errnk 111] Connection refused .

Philipp Wiesner

unread,
Mar 6, 2018, 3:52:01 AM3/6/18
to AWX Project
Hi dnc92301,

sorry for my late response, somehow the notification is not working properly. The issue is probably, that your RABBITMQ_HOST environment variable inside the container is not set properly set. At the moment it tries to connect against your local container network. As rabbitMQ has been moved out of the container context you need to set the RABBITMQ_HOST environment variable to your host where RabbitMQ is running on. We have set it to the FQDN of the RabbitMQ Host.

In the local_docker role local_docker/tasks/main.yml you can set those environment variables like this:


    env:
      CLUSTER_NODE: "{{ cluster_node | default('localhost') }}"
      ...
      RABBITMQ_USER: "awx"
      RABBITMQ_PASSWORD: "<password>"
      RABBITMQ_HOST: "{{ cluster_node | default('localhost')}}"
      RABBITMQ_PORT: "5672"
      RABBITMQ_VHOST: "awx"

You have to set them both for the awx_web and awx_task container image. If you have any further issues, let me know.

dnc92301

unread,
Mar 6, 2018, 9:19:29 PM3/6/18
to AWX Project

Thanks Phillipp -

I see 2 references of cluster_node within env both needs to be set to “cluster_node = hostname.fqdn” - And this need to be set within INVENTORY file ?

Philipp Wiesner

unread,
Mar 7, 2018, 4:47:22 AM3/7/18
to AWX Project
Yes, we have set this in the inventory file.

dnc92301

unread,
Mar 7, 2018, 9:27:48 PM3/7/18
to AWX Project
Hi Phillipp - can you provide the specific tag where you had HA working? is using the latest 1.0.4.*? or previous release.

Thanks again.

dnc92301

unread,
Mar 7, 2018, 10:50:59 PM3/7/18
to AWX Project
Here's the error msg I've been getting -

2018-03-08 03:17:34,613: ERROR/MainProcess] Unrecoverable error: AccessRefused(403, u"ACCESS_REFUSED - access to exchange 'celeryev' in vhost 'awx' refused for user 'awx'", (40, 10), 'Exchange.declare')

Philipp Wiesner

unread,
Mar 8, 2018, 4:14:00 AM3/8/18
to AWX Project
Hi dnc92301,

we are currently using the release 1.0.2. But I thing it should also work with later release on a fresh installment. The error you got looks like a connection issue against RabbitMQ. Have you set up the user AWX in your RabbitMQ cluster?

We have set up the RabbitMQ Cluster with the following commands:

[root@host rabbitmq]# rabbitmqctl delete_user guest
[root@host rabbitmq]# rabbitmqctl add_vhost awx
[root@host rabbitmq]# rabbitmqctl add_user awx <password>
[root@host rabbitmq]# rabbitmqctl set_permissions -p awx awx ".*" ".*" ".*"
[root@host rabbitmq]# rabbitmqctl set_policy -p awx ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

dnc92301

unread,
Mar 8, 2018, 1:15:21 PM3/8/18
to AWX Project
Hi Phillipp ,

Yes it looks like it’s working somewhat now after setting the proper permission for the awx user . However , how do you have ensure cluster is configured correctly . Under instance group , do you see both nodes within the cluster . Right now I only see 1 node . I’ve tried kicking off a job and then reboot the server in the middle of the run , does the job fails over automatically to the other node within the cluster ?

Thanks

Philipp Wiesner

unread,
Mar 8, 2018, 1:36:05 PM3/8/18
to AWX Project
Hi Dnc92301,

you should see both your nodes below the AWX instance group.


The image shows, that on each node RabbitMQ is running, which are providing the cluster functionality. With AWX you have on each node then docker containers running with the awx_web, awx_task and memcache images. For each node the settings.py is in this way configured, that the rabbitmq_host points to it one machine. Meaning that when you have two nodes: node1 and node2, that on node1 the settings.py rabbitmq_host points to node1 and on node2 the settings.py rabbitmq_host points to node2. The same goes for the celery worker. Therefore we have added this cluster_node inventory variable. We have the installation directory on each seperate node, set the cluster_node to the machine name and run the installation playbook on each node.

The cluster work in this way, that when the capacity on one node is full, the next triggered job will be started on the next node of the cluster with free capacity. A failover of a job like mentioned in your example will not work, as the playbook run is triggered by one worker process on one node. When you restart this node, the worker process is lost.

The page I provided to you gives you a good overview, of what the cluster functionality of Ansible Tower/AWX provides and how it works. For our setup I have basically reenginered this setup into AWX.

Matthew Jones

unread,
Mar 8, 2018, 2:26:28 PM3/8/18
to Philipp Wiesner, AWX Project
It's probably worth pointing out at this point that clustered AWX is now supported on Openshift and Kubernetes without needing to hand-roll your own solution.

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project+unsubscribe@googlegroups.com.
To post to this group, send email to awx-p...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

dnc92301

unread,
Mar 8, 2018, 3:01:53 PM3/8/18
to AWX Project
I did take your suggested recommendation by updating all config files - settings.py and others .

I’m looking at tower HA configuration . API/2/ping it shows both nodes within the instance group . I’m AWX it lists awx within instance_group , ps ef |grep celery - shows celery@awx (on both nodes ) .I think a working setup should be showing as celery@hostname

Thanks for all your help ..

Philipp Wiesner

unread,
Mar 8, 2018, 3:20:40 PM3/8/18
to AWX Project
I Think you are on the right path, when the API already shows both nodes within the instance group. 

There is this file:
image_build/files/supervisor_task.conf


In the installation image_build role. If you adjust the celery worker command to:
command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l DEBUG --autoscale=50,4 -Ofair -Q
tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s

ps ef | grep celery should also show the correct output. Please be aware of the ENV_ in front of the environment variable, this is needed by supervisord for reading environment variables.

dnc92301

unread,
Mar 8, 2018, 3:38:11 PM3/8/18
to AWX Project
It seems the variable is not being passed corrected in the supervisor_task.conf file . ...tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s - n celery@%(ENV_CLUSTER_NODE)s .

As celeryd starts up as ,tower_broadcast_all,tower,awx -n celery@localhost .

In tower , it is *tower_broadcast_all,tower,hostname -n celery@hostname

Within inventory i tried with cluster_node , wig and without FQDN and also with cluster_node and CLUSTER_NODE.

It’s odd

Philipp Wiesner

unread,
Mar 9, 2018, 5:30:37 AM3/9/18
to AWX Project
Are you also setting the CLUSTER_NODE variable in your env?

Here an except from our local_docker/tasks/main.yml file:


 env:
      CLUSTER_NODE: "{{ cluster_node | default('localhost') }}"

This was set for the awx task and web container, providing the cluster_node variable to the environment. So supervisor_task.conf file can read this inside the container.

dnc92301

unread,
Mar 17, 2018, 8:52:08 AM3/17/18
to AWX Project
Hi Phillipp - can you tell me what version /tag you used for your setup ? I wanted to see if I can reproduce it . I had done exactly the changes you had described by separating out Rabbitmq (2-nodes) and having containers(awx/task/memcache) installed (with the modified configurations) on the same hosts running rabbitmq. PostgreSQL on a different node . So in all a 3 server configuration. Thanks

Abhishek.J Gowda

unread,
Jun 7, 2018, 3:13:25 AM6/7/18
to AWX Project
Has anyone successfuly setup the high available AWX set up which can scale?, need some suggestions, thanks

dnc92301

unread,
Jun 14, 2018, 10:23:26 PM6/14/18
to AWX Project
Docker swarm might be the way forward ..

Abhishek.J Gowda

unread,
Jun 21, 2018, 11:52:23 PM6/21/18
to AWX Project
hi Matt Jones,

can you pls point me to the documentation for clustered AWX in openshift,

thanks


On Friday, March 9, 2018 at 12:56:28 AM UTC+5:30, Matthew Jones wrote:
It's probably worth pointing out at this point that clustered AWX is now supported on Openshift and Kubernetes without needing to hand-roll your own solution.
On Thu, Mar 8, 2018 at 1:36 PM, 'Philipp Wiesner' via AWX Project <awx-p...@googlegroups.com> wrote:
Hi Dnc92301,

you should see both your nodes below the AWX instance group.


The image shows, that on each node RabbitMQ is running, which are providing the cluster functionality. With AWX you have on each node then docker containers running with the awx_web, awx_task and memcache images. For each node the settings.py is in this way configured, that the rabbitmq_host points to it one machine. Meaning that when you have two nodes: node1 and node2, that on node1 the settings.py rabbitmq_host points to node1 and on node2 the settings.py rabbitmq_host points to node2. The same goes for the celery worker. Therefore we have added this cluster_node inventory variable. We have the installation directory on each seperate node, set the cluster_node to the machine name and run the installation playbook on each node.

The cluster work in this way, that when the capacity on one node is full, the next triggered job will be started on the next node of the cluster with free capacity. A failover of a job like mentioned in your example will not work, as the playbook run is triggered by one worker process on one node. When you restart this node, the worker process is lost.

The page I provided to you gives you a good overview, of what the cluster functionality of Ansible Tower/AWX provides and how it works. For our setup I have basically reenginered this setup into AWX.

Am Donnerstag, 8. März 2018 19:15:21 UTC+1 schrieb dnc92301:
Hi Phillipp ,

Yes it looks like it’s working somewhat now after setting the proper permission for the awx user . However , how do you have ensure cluster is configured correctly . Under instance group , do you see both nodes within the cluster . Right now I only see 1 node . I’ve tried kicking off a job and then reboot the server in the middle of the run , does the job fails over automatically to the other node within the cluster ?

Thanks

--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.

To post to this group, send email to awx-p...@googlegroups.com.

Matthew Jones

unread,
Jun 27, 2018, 2:23:19 PM6/27/18
to abhish...@gmail.com, AWX Project

Vandana Thakur

unread,
Sep 3, 2018, 2:46:43 AM9/3/18
to AWX Project
Hello, Did you  manage to  find  solution for this issue?

Thanks.

Vandana Thakur

unread,
Sep 4, 2018, 11:22:28 PM9/4/18
to AWX Project
Hello,

Will this  help?

Allow Containers to be installed to a stack in Docker Swarm #1287


BR//
Vandana

CV

unread,
Feb 16, 2019, 3:48:40 PM2/16/19
to AWX Project
Hi Phil,

I am having trouble finding the parts of the file you edited to accomplish a multiple instance (HA) installation. Current version is 3.0.1

which command did you replace in supervisor_task.conf with "command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l ERROR --autoscale=50,4 -Ofair -Q
tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s"?

In addition, I do not see any references to rabbitmg in main.yml or set_image.yml (the only task imported in main.yml) in local_docker/tasks/ .

On Tuesday, January 30, 2018 at 4:54:58 AM UTC-5, Philipp Wiesner wrote:
We are running AWX in a clustered HA environment. But for this, some manual adjustments in the installation roles had been done. Further you need to create a RabbitMQ Cluster. For this we disabled the RabbitMQ containers for the installation and set them up beforehand on all the nodes. After the RabbitMQ cluster was running, we changed the RabbitMQ connection details in the roles.

The following files were changed:

image_build/files/launch_awx_task.sh
awx-manage provision_instance --hostname=$CLUSTER_NODE
awx-manage register_queue --queuename=tower --hostnames=$CLUSTER_NODE



image_build/files/settings.py
CLUSTER_HOST_ID = os.getenv("CLUSTER_NODE", "awx")


image_build/files/supervisor_task.conf
command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l ERROR --autoscale=50,4 -Ofair -Q
tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s

local_docker/tasks/main.yml
uncomment every rabbitmq container reference
- name: Activate AWX Web Container
    ...
    env:
      CLUSTER_NODE: "{{ cluster_node | default('localhost') }}"
      ...
      RABBITMQ_USER: "awx"
      RABBITMQ_PASSWORD: "<password>"
      RABBITMQ_HOST: "{{ cluster_node | default('localhost')}}"
      RABBITMQ_PORT: "5672"
      RABBITMQ_VHOST: "awx"

The same was changed for the AWX Task container

in front of the nodes a HAProxy is running with roundrobin load balacing.

Sujith A R

unread,
Mar 7, 2019, 7:58:51 PM3/7/19
to CV, AWX Project
This particular change realted with celery isn't needed for latest AWX version 3.0.1 as it auto picks up and schedule jobs under what instance group we defined over there. 



--
You received this message because you are subscribed to the Google Groups "AWX Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to awx-project...@googlegroups.com.
To post to this group, send email to awx-p...@googlegroups.com.
Message has been deleted

Sujith A R

unread,
Apr 6, 2019, 12:25:45 AM4/6/19
to AWX Project
Update towards enabling HA in Docker StandAlone method: https://github.com/ansible/awx/issues/3627 
Reply all
Reply to author
Forward
0 new messages