Problem when starting up an agency cluster

526 views
Skip to first unread message

Vincent Lecrubier

unread,
Jan 10, 2017, 12:12:51 PM1/10/17
to ArangoDB
Hi, 

I have a script that deploys an arangodb cluster on a docker swarm. This script used to work perfectly with arangodb 3.0.x. However it seems to fail since 3.1, with the following error:

2017-01-10T16:51:26Z [1] INFO using endpoint 'http+tcp://0.0.0.0:5001' for non-encrypted requests
2017-01-10T16:51:26Z [1] INFO ArangoDB (version 3.1.7 [linux]) is ready for business. Have fun!
2017-01-10T16:51:27Z [1] INFO {agency} Entering gossip phase ...
2017-01-10T16:51:29Z [1] INFO {agency} Adding 0be3be42-10d6-4d88-a032-2c9f52c8e241(tcp://localhost:5003) to agent pool
2017-01-10T16:51:30Z [1] INFO {agency} Adding 2e90441b-8bc3-4df4-876c-2dececa0e052(tcp://localhost:5002) to agent pool
2017-01-10T16:51:31Z [1] ERROR {cluster} cannot create connection to server '' at endpoint 'tcp://localhost:5003'
2017-01-10T16:51:31Z [1] ERROR {cluster} cannot create connection to server '' at endpoint 'tcp://localhost:5002'
2017-01-10T16:51:35Z [1] ERROR {cluster} cannot create connection to server '' at endpoint 'tcp://localhost:5003'
2017-01-10T16:51:35Z [1] ERROR {cluster} cannot create connection to server '' at endpoint 'tcp://localhost:5002'

Here is the script that is broken since 3.1:

# Agent 1
docker service create --name arango-agent-1 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5001 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true
# Agent 2
docker service create --name arango-agent-2 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5002 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true
# Agent 3
docker service create --name arango-agent-3 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5003 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true --agency.endpoint tcp://arango-agent-1:5001 --agency.endpoint tcp://arango-agent-2:5002 --agency.endpoint tcp://arango-agent-3:5003

Any idea on this ? I think it might be related to docker swarm network layers, but It used to work on 3.0...

Also, is there a way to use zookeeper instead of a arangodb agency ? This would be helpful since my zookeeper cluster works just fine already.

Thank you very much

Kaveh Vahedipour

unread,
Jan 11, 2017, 3:37:26 AM1/11/17
to ArangoDB
Let me start off by saying that you cannot use zookeeper as an agency simply because of syntactic differences between zookeper and arangodb agency APIs.

Now to your problem. I tried your startup and the only thing that is missing is "--agency.my-address" for every agency instance. As the log messages from agent 1 indicate, the connections are not attempted to "arango-agent-2:5002" and "arango-agent-3:5003" after handshake but to "localhost:5002" and "localhost:5003" respectively. So every agent should cary an additional parameter "--agency.my-address tcp://arango-agent-1:500<X>". Because of differences and ambiguities between frameworks, we had to introduce this additional command line argument. 

Your agency shouls starup just fine as follows:

# Agent 1
docker service create --name arango-agent-1 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5001 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true --agency.my-address tcp://arango-agent-1:5001
# Agent 2
docker service create --name arango-agent-2 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5002 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true --agency.my-address tcp://arango-agent-2:5002
# Agent 3
docker service create --name arango-agent-3 --restart-max-attempts ${ARANGODB_RETRIES} --env ARANGO_ROOT_PASSWORD=${ARANGODB_PASSWORD} --network arango-network --constraint 'node.hostname==node1.local.lan' --replicas 1 arangodb:3.1.7 arangod --server.endpoint tcp://0.0.0.0:5003 --server.authentication ${ARANGODB_AUTH} --server.jwt-secret ${ARANGO_JWT_SECRET} --agency.size 3 --agency.supervision true --agency.activate true --agency.my-address tcp://arango-agent-3:5003 --agency.endpoint tcp://arango-agent-1:5001 --agency.endpoint tcp://arango-agent-2:5002


m...@arangodb.com

unread,
Jan 11, 2017, 4:55:15 AM1/11/17
to ArangoDB
Let me add to my colleague Kaveh's answer: Recently we have created a simplified method to start up an ArangoDB cluster, which might be useful for Docker swarm as well. See here for details

Vincent Lecrubier

unread,
Jan 16, 2017, 12:47:14 PM1/16/17
to ArangoDB
Ok perfect thank you very much, this worked perfectly. 

Once my agency is started the right way, everything works very well.

However, I now have a new problem related to failure modes: When i kill an agent, wait, and start another agent to replace it, I get the following error: 

2017-01-16T17:45:31Z [1] INFO ArangoDB (version 3.1.8 [linux]) is ready for business. Have fun!
2017-01-16T17:45:31Z [1] INFO {agency} Entering gossip phase ...
2017-01-16T17:45:31Z [1] INFO {agency} Adding 1d148ff1-e923-4f0e-9921-6818c2387822(tcp://arango-agent-2:5002) to agent pool
2017-01-16T17:45:31Z [1] INFO {agency} Adding 72d76ea1-a67e-4cbe-add4-240bbba0ba93(tcp://arango-agent-3:5003) to agent pool
2017-01-16T17:45:31Z [1] INFO {agency} Adding c63e1c35-1349-480d-bb1f-7dfce40aa2af(tcp://arango-agent-1:5001) to agent pool
2017-01-16T17:45:31Z [1] FATAL {agency} Too many peers in pool: 4>3


It looks like the agency does not detect that an agent was killed, so it refuses the new agent. If anyone has an idea of what i Can do about that, that would be helpful ! Thank you

Kaveh Vahedipour

unread,
Jan 16, 2017, 12:54:28 PM1/16/17
to aran...@googlegroups.com
OK. This is one of the absolute no-gos of RAFT. If you’d like to say replace an agent’s hardware or move it elsewhere, you must transfer the persisted data. With the persisted data, the new agent will identify itself with proper UUID and welcomed into the agency.

If you need more details here, I’d refer you to the RAFT paper, which you will find here: https://raft.github.io/
> --
> You received this message because you are subscribed to a topic in the Google Groups "ArangoDB" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/arangodb/s6j11-dYqNs/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to arangodb+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Vincent Lecrubier

unread,
Jan 16, 2017, 1:35:30 PM1/16/17
to ArangoDB
Ok, thank you very much, I thought it might be something like that... It might need a little bit more documentation for non-experts of RAFT (I understand that it might be part of the paid support scheme though)

Then to make my setup work perfectly, I need to confirm the path where the agent persist the data on disk. I expect my docker-based agents to store the data in their "/var/lib/arangodb3" directory, like database nodes. Let me know if I am wrong.

Thanks a lot again.

Kaveh Vahedipour

unread,
Jan 16, 2017, 1:46:08 PM1/16/17
to aran...@googlegroups.com
you’re welcome.

you should backup your agents the same way as you would backup any other persisted data store. we are going to work on a fix to replace an agent possibly without any backup but that would need to be proven to work.
Reply all
Reply to author
Forward
0 new messages