ArangoDB 3 clustering not working

940 views
Skip to first unread message

Bart DS

unread,
Jun 26, 2016, 12:36:46 PM6/26/16
to ArangoDB
First of all congratulations on the new 3.0 release!
Much improved clustering with auto-failover, Velocypack, persistent indexes, etc are all great new features.

Unfortunately I'm having several issues getting the new clustering working.
As discussed in an earlier post, I'm interested in setting up an ArangoDB cluster with docker images, not relying on Mesos at all (https://groups.google.com/forum/#!topic/arangodb/WHedmT2mkoE)
Max promised to publish a blog post on this shortly after the release, so I hope it is coming soon ;-)

The documentation already has a (small) chapter on setting up an ArangoDB 3 cluster, so I decided to try it out. (https://docs.arangodb.com/3.0/Manual/Deployment/Local.html)
However, I'm having several issues getting it working.

When I start multiple agents, they are randomly crashing (they exit without any errors)
After restarting the crashed agent(s) until they all stay up (don't crash anymore), I'm starting the first DBServer

However, it hangs on the following line:
arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://127.0.0.1:8529 --cluster.my-local-info db1 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://127.0.0.1:5001 primary1
2016-06-26T15:37:06Z [31640] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG

Any ideas what could be wrong here?
Why are the agents crashing and why is the primary server hanging during startup?

I tried on several Linux based systems, running Linux Mint 17.3 x64 with 8 GB RAM

Bart

Frank Celler

unread,
Jun 27, 2016, 2:29:23 AM6/27/16
to ArangoDB
We have to debug, why the agents are crashing. I've alerted my colleague who is responsible for the agency. Do have have any core dumps by chance?

Bart DS

unread,
Jun 27, 2016, 3:05:49 AM6/27/16
to ArangoDB


On Monday, June 27, 2016 at 8:29:23 AM UTC+2, Frank Celler wrote:
We have to debug, why the agents are crashing. I've alerted my colleague who is responsible for the agency. Do have have any core dumps by chance?

Maybe crashing isn't the correct wording here, they simply exit without any error messages.

Here is a sample output of an agent that started correctly, but then exits when another agent is started:

bart@laptop ~ $ arangod --server.endpoint tcp://127.0.0.1:5001 --server.authentication false --agency.id 0 --agency.size 3 agency1
2016-06-26T15:33:38Z [30589] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-26T15:33:38Z [30589] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-26T15:33:38Z [30589] INFO ArangoDB 3.0.0 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g-fips  1 Mar 2016
2016-06-26T15:33:38Z [30589] INFO loaded database '_system' from 'agency1/databases/database-1'
2016-06-26T15:33:38Z [30589] INFO the server has 8 (hyper) cores, using 2 scheduler thread(s), 8 dispatcher thread(s)
2016-06-26T15:33:38Z [30589] INFO running WAL recovery (1 logfiles)
2016-06-26T15:33:38Z [30589] INFO replaying WAL logfile 'agency1/journals/logfile-2.db' (1 of 1)
2016-06-26T15:33:38Z [30589] INFO WAL recovery finished successfully
2016-06-26T15:33:38Z [30589] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2016-06-26T15:33:43Z [30589] INFO In database '_system': Found 14 defined task(s), 1 task(s) to run
2016-06-26T15:33:43Z [30589] INFO In database '_system': state standalone/existing, tasks updateUserModels
2016-06-26T15:33:43Z [30589] INFO In database '_system': existing cleanup successfully finished
2016-06-26T15:33:44Z [30589] INFO using endpoint 'http+tcp://127.0.0.1:5001' for non-encrypted requests
2016-06-26T15:33:44Z [30589] INFO Authentication is turned off
2016-06-26T15:33:44Z [30589] INFO ArangoDB (version 3.0.0 [linux]) is ready for business. Have fun!
2016-06-26T15:33:45Z [30589] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:49Z [30589] WARNING {agency} I have a higher term than RPC caller.
bart@laptop
~ $

As you can see, it simply exits after printing 2 warnings regarding "higher term than RPC caller"

Bart

Kaveh Vahedipour

unread,
Jun 27, 2016, 3:07:40 AM6/27/16
to ArangoDB
Hello Bart,

Could you please provide us with some details like OS and docker versions your are running. The startup process like rough timings and such. In short I'd be extremely grateful with any kind of details you could provide us with really. I just went through the process again myself on OSX 10.11 and Ubuntu 16.04 and cannot create any unstable, let alone crashing, startups. 

KInd regards,
Kaveh

Kaveh Vahedipour

unread,
Jun 27, 2016, 3:30:33 AM6/27/16
to ArangoDB
This is from the inside of one of your docker images, right?

Bart DS

unread,
Jun 27, 2016, 3:32:04 AM6/27/16
to ArangoDB
Hi Kaveh,


On Monday, June 27, 2016 at 9:07:40 AM UTC+2, Kaveh Vahedipour wrote:
Hello Bart,

Could you please provide us with some details like OS and docker versions your are running. The startup process like rough timings and such. In short I'd be extremely grateful with any kind of details you could provide us with really. I just went through the process again myself on OSX 10.11 and Ubuntu 16.04 and cannot create any unstable, let alone crashing, startups. 


As mentioned in my initial post, I'm testing on a Linux Mint 17.3 x64 system with 8GB RAM and 256GB SSD
In the end I want to get it running inside docker indeed, but for now I'm testing it natively because the referred to documentation is for running it natively and it should also be easier than within docker I guess (cfr port mappings and such)

The startup process is exactly as explained in the refered documentation:
Start agent1 and wait until it's ready:

arangod --server.endpoint tcp://127.0.0.1:5001 --server.authentication false --agency.id 0 --agency.size 3 agency1

Start agent2 and wait until it's ready:

arangod --server.endpoint tcp://127.0.0.1:5002 --server.authentication false --agency.id 1 --agency.size 3 agency2

Start agent3 and wait until it's ready:

arangod --server.endpoint tcp://127.0.0.1:5003 --server.authentication false --agency.id 2 --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.endpoint tcp://127.0.0.1:5002 --agency.endpoint tcp://127.0.0.1:5003 --agency.notify true agency3

At this point I get following output for agent3:

2016-06-26T15:33:19Z [30429] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-26T15:33:19Z [30429] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-26T15:33:19Z [30429] INFO ArangoDB 3.0.0 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g-fips  1 Mar 2016
2016-06-26T15:33:19Z [30429] INFO loaded database '_system' from 'agency3/databases/database-1'
2016-06-26T15:33:19Z [30429] INFO the server has 8 (hyper) cores, using 2 scheduler thread(s), 8 dispatcher thread(s)
2016-06-26T15:33:19Z [30429] INFO running WAL recovery (1 logfiles)
2016-06-26T15:33:19Z [30429] INFO replaying WAL logfile 'agency3/journals/logfile-300418.db' (1 of 1)
2016-06-26T15:33:19Z [30429] INFO WAL recovery finished successfully
2016-06-26T15:33:19Z [30429] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2016-06-26T15:33:24Z [30429] INFO In database '_system': Database is up-to-date (30000/standalone/existing)
2016-06-26T15:33:24Z [30429] INFO using endpoint 'http+tcp://127.0.0.1:5003' for non-encrypted requests
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO Authentication is turned off
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO ArangoDB (version 3.0.0 [linux]) is ready for business. Have fun!
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:24Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:25Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] WARNING {agency} Unable to find matching entry to previous entry (index,term) = (0,0)
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:26Z [30429] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:26Z [30429] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'

And the agents start logging as well:

Agent1:

bart@laptop ~ $ arangod --server.endpoint tcp://127.0.0.1:5001 --server.authentication false --agency.id 0 --agency.size 3 agency1
2016-06-26T15:33:38Z [30589] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-26T15:33:38Z [30589] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-26T15:33:38Z [30589] INFO ArangoDB 3.0.0 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g-fips  1 Mar 2016
2016-06-26T15:33:38Z [30589] INFO loaded database '_system' from 'agency1/databases/database-1'
2016-06-26T15:33:38Z [30589] INFO the server has 8 (hyper) cores, using 2 scheduler thread(s), 8 dispatcher thread(s)
2016-06-26T15:33:38Z [30589] INFO running WAL recovery (1 logfiles)
2016-06-26T15:33:38Z [30589] INFO replaying WAL logfile 'agency1/journals/logfile-2.db' (1 of 1)
2016-06-26T15:33:38Z [30589] INFO WAL recovery finished successfully
2016-06-26T15:33:38Z [30589] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2016-06-26T15:33:43Z [30589] INFO In database '_system': Found 14 defined task(s), 1 task(s) to run
2016-06-26T15:33:43Z [30589] INFO In database '_system': state standalone/existing, tasks updateUserModels
2016-06-26T15:33:43Z [30589] INFO In database '_system': existing cleanup successfully finished
2016-06-26T15:33:44Z [30589] INFO using endpoint 'http+tcp://127.0.0.1:5001' for non-encrypted requests
2016-06-26T15:33:44Z [30589] INFO Authentication is turned off
2016-06-26T15:33:44Z [30589] INFO ArangoDB (version 3.0.0 [linux]) is ready for business. Have fun!
2016-06-26T15:33:45Z [30589] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:49Z [30589] WARNING {agency} I have a higher term than RPC caller.
bart@laptop ~ $

Agent2:

bart@laptop ~ $ arangod --server.endpoint tcp://127.0.0.1:5002 --server.authentication false --agency.id 1 --agency.size 3 agency2
2016-06-26T15:30:03Z [29228] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-26T15:30:03Z [29228] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-26T15:30:03Z [29228] INFO created database directory 'agency2'.
2016-06-26T15:30:03Z [29228] INFO WAL directory 'agency2/journals' does not exist. creating it...
2016-06-26T15:30:03Z [29228] INFO ArangoDB 3.0.0 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g-fips  1 Mar 2016
2016-06-26T15:30:03Z [29228] INFO loaded database '_system' from 'agency2/databases/database-1'
2016-06-26T15:30:03Z [29228] INFO the server has 8 (hyper) cores, using 2 scheduler thread(s), 8 dispatcher thread(s)
2016-06-26T15:30:04Z [29228] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2016-06-26T15:30:06Z [29228] INFO In database '_system': No version information file found in database directory.
2016-06-26T15:30:06Z [29228] INFO In database '_system': Found 14 defined task(s), 12 task(s) to run
2016-06-26T15:30:06Z [29228] INFO In database '_system': state standalone/init, tasks setupGraphs, setupUsers, createUsersIndex, addDefaultUserSystem, createModules, createRouting, insertRedirectionsAll, setupAqlFunctions, createStatistics, createFrontend, setupQueues, setupJobs
2016-06-26T15:30:06Z [29228] INFO In database '_system': init successfully finished
2016-06-26T15:30:06Z [29228] INFO using endpoint 'http+tcp://127.0.0.1:5002' for non-encrypted requests
2016-06-26T15:30:06Z [29228] INFO Authentication is turned off
2016-06-26T15:30:07Z [29228] INFO ArangoDB (version 3.0.0 [linux]) is ready for business. Have fun!
2016-06-26T15:33:01Z [29228] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:07Z [29228] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:07Z [29228] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:07Z [29228] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:07Z [29228] WARNING {agency} I have a higher term than RPC caller.
2016-06-26T15:33:07Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:08Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:08Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:08Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:09Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:09Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:10Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:10Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:11Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:11Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:12Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:12Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:12Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:12Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:13Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:13Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:14Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:14Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:15Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5001'
2016-06-26T15:33:15Z [29228] INFO cannot create connection to server '' at endpoint 'tcp://127.0.0.1:5003'
2016-06-26T15:33:17Z [29228] WARNING {agency} Unable to find matching entry to previous entry (index,term) = (6,1171)
2016-06-26T15:33:17Z [29228] WARNING {agency} Unable to find matching entry to previous entry (index,term) = (6,1171)
[...]


As you can see, this time it's agent1 that exits without any error messages.
As mentioned in my last post, it's probably not really crashing but rather simply exiting (I can't find any core dumps or other information regarding a possible crash)

If I then start Agent1 again, it can be that Agent2 or Agent3 exits or it can be that everything stays up as it should be.

In the latter case, if I start the first DBServer, it hangs like this:

bart@laptop ~ $ arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://127.0.0.1:8529 --cluster.my-local-info db1 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://127.0.0.1:5001 primary1
2016-06-26T15:37:06Z [31640] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG

I think that's all info I can give you for now.

regards,

Bart


 
KInd regards,
Kaveh

Frank Celler

unread,
Jun 27, 2016, 7:48:19 AM6/27/16
to ArangoDB
Hello Bart,

thanks a lot for the details. They helped us to track down the issue. We will try to create a fix asap.

Best
  Frank

Bart DS

unread,
Jun 27, 2016, 8:02:26 AM6/27/16
to ArangoDB
Hi Frank,

I'm glad I could help in tracking down the problem.
Thanks for the quick responses!

regards,

Bart

Kaveh Vahedipour

unread,
Jun 27, 2016, 10:32:18 AM6/27/16
to ArangoDB
Hi Bart,

we feel fairly confident that we have spotted the problem. I am currently getting your Linux flavour up to check that we actually can start the cluster according to the documentation here. Thanks again for reporting the problem. 

There are 3 ways of going on from here: 

1. We are putting the fixes in a bugfix release, 3.0.1, by the end of the week, which will include this fix. 
2. You might want to get your hands dirty and build arangodb yourself from source. You'd need to install some packages by running the following command in a terminal window: apt-get install build-essential cmake git libssl-dev
3. However, if you like, we could let you have an intermediate personal build just 4 u.

Let us know, which route you're taking.

Kind regards
Kaveh

Kaveh Vahedipour

unread,
Jun 27, 2016, 10:47:33 AM6/27/16
to ArangoDB
Addendum:

For building your own arangodb you need a little more software and instructions:

sudo add-apt-repository ppa:ubuntu-toolchain-r/test

sudo apt-get install gcc-5 g++-5


git clone https://github.com/arangodb/arangodb 

cd arangodb 

mkdir build

cd build

cmake ..

make


Kind regards,

Kaveh


Bart DS

unread,
Jun 27, 2016, 11:46:01 AM6/27/16
to ArangoDB
Hi Kaveh,

I tried building it myself using your instructions, but the build fails with an error regarding regex support:

bart@laptop ~/arangodb/build $ cmake ..
-- The CXX compiler identification is GNU 4.8.5
-- The C compiler identification is GNU 4.8.5
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Found PythonInterp: /usr/bin/python2 (found suitable version "2.7.6", minimum required is "2") 
-- Performing Test _HAS_CXX11_FLAG
-- Performing Test _HAS_CXX11_FLAG - Success
-- Checking C++11 support for "__func__"
-- Checking C++11 support for "__func__": works
-- Checking C++11 support for "auto"
-- Checking C++11 support for "auto": works
-- Checking C++11 support for "auto_ret_type"
-- Checking C++11 support for "auto_ret_type": works
-- Checking C++11 support for "class_override_final"
-- Checking C++11 support for "class_override_final": works
-- Checking C++11 support for "constexpr"
-- Checking C++11 support for "constexpr": works
-- Checking C++11 support for "conditional"
-- Checking C++11 support for "conditional": works
-- Checking C++11 support for "decltype"
-- Checking C++11 support for "decltype": works
-- Checking C++11 support for "initializer_list"
-- Checking C++11 support for "initializer_list": works
-- Checking C++11 support for "lambda"
-- Checking C++11 support for "lambda": works
-- Checking C++11 support for "range_based_for_loop"
-- Checking C++11 support for "range_based_for_loop": works
-- Checking C++11 support for "nullptr"
-- Checking C++11 support for "nullptr": works
-- Checking C++11 support for "tuple"
-- Checking C++11 support for "tuple": works
-- Checking C++11 support for "regex"
CMake Error at cmake/CheckCXX11Features.cmake:129 (message):
  Checking C++11 support for "regex": not supported
Call Stack (most recent call first):
  cmake/CheckCXX11Features.cmake:150 (cxx11_check_feature)
  CMakeLists.txt:265 (include)


-- Configuring incomplete, errors occurred!



Maybe it's easier to go for option 3 and let me have an intermediate personal build if that's not too much work on your site?
Of course I can wait until the end of the week as well for v3.0.1 to be released, but maybe it's better I can do some testing to confirm the issue is indeed resolved?

regards,

Bart

Wilfried Gösgens

unread,
Jun 27, 2016, 11:53:23 AM6/27/16
to ArangoDB
Hi Bart,
you need at least GCC 4.9 in order to build ArangoDB 3.0
It was the first version to bring stable support for C++11 Regex.

Cheers,
Willi

Kaveh Vahedipour

unread,
Jun 27, 2016, 2:41:38 PM6/27/16
to aran...@googlegroups.com
Exactly. That’s why I added the addendum. Also please use the following command for cmake to actually use the correct compiler:

CXX=g++-5 CC=gcc-5 cmake ..

Cheers,
Kaveh
> --
> You received this message because you are subscribed to a topic in the Google Groups "ArangoDB" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/arangodb/p30FN_TXp30/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to arangodb+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bart DS

unread,
Jun 27, 2016, 3:29:32 PM6/27/16
to ArangoDB


On Monday, June 27, 2016 at 8:41:38 PM UTC+2, Kaveh Vahedipour wrote:
Exactly. That’s why I added the addendum. Also please use the following command for cmake to actually use the correct compiler:

CXX=g++-5 CC=gcc-5 cmake ..

Cheers,
Kaveh


Thanks Kaveh, those env variables did the trick.
It's now compiling and it seems it will take a while, so I'll report back once I got a chance to test it out.

Bart

Kaveh Vahedipour

unread,
Jun 27, 2016, 3:30:51 PM6/27/16
to aran...@googlegroups.com
yes, it’ll take a little while :) i’m kind of preoccupied probably, but i think that it’s worth while.

cheers,
kaveh

Bart DS

unread,
Jun 28, 2016, 3:38:26 AM6/28/16
to ArangoDB
Ok, I did some tests and the agents are running fine now.
At least, initially...

When I start my first server via the following command:

./build/bin/arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://127.0.0.1:8529 --cluster.my-local-info db1 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://127.0.0.1:5001 primary1

I get following output:

bart@laptop ~/test/arangodb $ ./build/bin/arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8529 --cluster.my-address tcp://127.0.0.1:8529 --cluster.my-local-info db1 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://127.0.0.1:5001 primary1
2016-06-28T07:06:38Z [24113] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-28T07:06:38Z [24113] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-28T07:06:38Z [24113] INFO created database directory 'primary1'.
2016-06-28T07:06:38Z [24113] INFO WAL directory 'primary1/journals' does not exist. creating it...
2016-06-28T07:06:38Z [24113] INFO ArangoDB 3.0.x-devel [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1f 6 Jan 2014
2016-06-28T07:06:38Z [24113] INFO loaded database '_system' from 'primary1/databases/database-1'
2016-06-28T07:06:38Z [24113] INFO the server has 4 (hyper) cores, using 1 scheduler thread(s), 4 dispatcher thread(s)
2016-06-28T07:06:39Z [24113] INFO JavaScript using startup './js', application './js/apps'
2016-06-28T07:06:39Z [24113] INFO changing state of PRIMARY server from UNDEFINED to STARTUP
2016-06-28T07:06:39Z [24113] INFO Cluster feature is turned on. Agency version: , Agency endpoints: http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, server id: 'DBServer001', internal address: tcp://127.0.0.1:8529, role: PRIMARY
2016-06-28T07:06:39Z [24113] INFO using heartbeat interval value '1000 ms' from agency
2016-06-28T07:06:39Z [24113] INFO changing state of PRIMARY server from STARTUP to SERVING
2016-06-28T07:06:39Z [24113] INFO In database '_system': No version information file found in database directory.
2016-06-28T07:06:39Z [24113] INFO In database '_system': Database is up-to-date (30000/cluster-local/init)
2016-06-28T07:06:40Z [24113] INFO using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2016-06-28T07:06:40Z [24113] INFO Authentication is turned off
2016-06-28T07:06:40Z [24113] INFO bootstraped DB server DBServer001
2016-06-28T07:06:40Z [24113] INFO bootstraped DB server DBServer001
2016-06-28T07:06:40Z [24113] INFO bootstraped DB server DBServer001
2016-06-28T07:06:40Z [24113] INFO In database '_system': Database is up-to-date (-/db-server-local/init)
2016-06-28T07:06:40Z [24113] INFO bootstraped DB server DBServer001
2016-06-28T07:06:40Z [24113] INFO ArangoDB (version 3.0.x-devel [linux]) is ready for business. Have fun!
2016-06-28T07:06:41Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:51Z [24113] ERROR error details: {"code":307,"errorNum":0,"errorMessage":"Temporary Redirect (Temporary Redirect)","error":true}
2016-06-28T07:06:51Z [24113] ERROR error stack: ArangoError: Temporary Redirect (Temporary Redirect)\n    at Error (native)\n    at writeLocked (/home/bart/test/arangodb/js/server/modules/@arangodb/cluster.js:1663:41)\n    at createLocalDatabases (/home/bart/test/arangodb/js/server/modules/@arangodb/cluster.js:348:9)\n    at handleDatabaseChanges (/home/bart/test/arangodb/js/server/modules/@arangodb/cluster.js:461:3)\n    at handleChanges (/home/bart/test/arangodb/js/server/modules/@arangodb/cluster.js:1460:3)\n    at handlePlanChange (/home/bart/test/arangodb/js/server/modules/@arangodb/cluster.js:1675:24)
2016-06-28T07:06:51Z [24113] ERROR plan change handling failed
2016-06-28T07:06:52Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:53Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:54Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:55Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:56Z [24113] WARNING {heartbeat} heartbeat could not be sent to agency endpoints (http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001): http code: 307, body: 
2016-06-28T07:06:56Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:57Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:58Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:06:59Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:00Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:01Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:02Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:03Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:04Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:05Z [24113] WARNING {heartbeat} heartbeat could not be sent to agency endpoints (http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001): http code: 307, body: 
2016-06-28T07:07:05Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:06Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:07Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:09Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:10Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:13Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:13Z [24113] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T07:07:28Z [24113] INFO plan change handling successful


And the agents start using all cpu resources until the system becomes almost unresponsive.

At that time, the agents are logging following messages:

2016-06-28T07:06:52Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:06:52Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:06:54Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:06:57Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:06:58Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:06:58Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:02Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:02Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:10Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:13Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:25Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:30Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:30Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:33Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:07:43Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:08:24Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:08:34Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:08:54Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:08:55Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:09:14Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:09:25Z [23997] WARNING {agency} I have a higher term than RPC caller.


When I stop the primary server again via ctrl-c, after some time the cpu usage starts to decrease but the memory is increasing rapidly, finally taking up all system memory and swap, in the end freezing the whole system.
At this point the agents start to log following messages:

2016-06-28T07:12:34Z [23478] WARNING {queries} slow query: 'FOR s in @@collection FILTER s.time >= @start SORT s.time desc LIMIT 1 RETURN s', took 26.345204
2016-06-28T07:12:43Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:13:05Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:13:10Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:13:20Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:13:23Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:13:39Z [23997] WARNING {agency} I have a higher term than RPC caller.
2016-06-28T07:15:23Z [23478] WARNING {queries} slow query: 'FOR s in @@collection FILTER s.time >= @start SORT s.time desc LIMIT 1 RETURN s', took 22.091859
2016-06-28T07:21:06Z [23478] WARNING {queries} slow query: 'FOR s in @@collection FILTER s.time >= @start SORT s.time desc LIMIT 1 RETURN s', took 17.283082
2016-06-28T07:22:26Z [23478] WARNING {queries} slow query: 'FOR s in @@collection FILTER s.time >= @start SORT s.time desc LIMIT 1 RETURN s', took 19.252867
2016-06-28T07:22:26Z [23478] WARNING {queries} slow query: 'FOR s in @@collection FILTER s.time >= @start SORT s.time desc LIMIT 1 RETURN s', took 40.744055
[...]


So the initial issue with the agents seems to be resolved, but it's definitely still not working correctly

Bart


Kaveh Vahedipour

unread,
Jun 28, 2016, 3:55:58 AM6/28/16
to aran...@googlegroups.com
Hi Bart,

Thank for keeping the spirit up. An hour ago I spotted the same problem while testing. I should have a fix for that in a little bit. I’ll be in touch.

Kind regards,
Kaveh.
signature.asc

Kaveh Vahedipour

unread,
Jun 28, 2016, 4:55:48 AM6/28/16
to aran...@googlegroups.com
Hi again,

I think we’re there. I’ve tested in Linux Mint in a VM and everything looks really happy.

So, cd into the directory where you cloned arangodb from github:

cd <arangodb-dir>
git pull
cd build
make

And you will have the build with the fixes. No worries, unless you have not deleted the build directory, it’s gonna be a considerably shorted build time compared to last night.

I also don’t know how much experience you have with building projects of this size. But if you have a couple of processors and don’t need the box you are building on for any other purposes for the compile time, you might want to consider to replace the last of the above lines by 'make -j<ncpu>', where <ncpu> is the number of hyperthreaded cores you have available. I generally build with make -j8 for example.

Anyway, many thanks for hanging in there. Contributions like yours advance the development considerably.

Keep us posted, please.

Kind regards,
Kaveh.

> On 28 Jun 2016, at 09:38, Bart DS <bart...@gmail.com> wrote:
>
signature.asc

Bart DS

unread,
Jun 28, 2016, 5:40:07 AM6/28/16
to ArangoDB
Hi Kaveh,

No worries, I'm glad I can help tracking down those issues.

I think we're almost there ;-)
All 3 agents and the first primary DBServer now start correctly, but I'm not able to start an additional primary DBServer or Coordinator:

Second DBServer:

bart@laptop ~/test2/arangodb $ ./build/bin/arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8530 --cluster.my-address tcp://127.0.0.1:8530 --cluster.my-local-info db2 --cluster.my-role PRIMARY --cluster.agency-endpoint tcp://127.0.0.1:5001 primary2
2016-06-28T09:25:02Z [2076] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-28T09:25:02Z [2076] FATAL unable to determine unambiguous role for server ''. No role configured in agency (http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001)

Coordinator:

bart@laptop ~/test2/arangodb $ ./build/bin/arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:8531 --cluster.my-address tcp://127.0.0.1:8531 --cluster.my-local-info coord1 --cluster.my-role COORDINATOR --cluster.agency-endpoint tcp://127.0.0.1:5001 coordinator
2016-06-28T09:27:53Z [2276] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-28T09:27:53Z [2276] FATAL unable to determine unambiguous role for server ''. No role configured in agency (http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001)

Bart

Kaveh Vahedipour

unread,
Jun 28, 2016, 5:54:29 AM6/28/16
to aran...@googlegroups.com
ok. i should have mentioned to go back and remove the broken databases before restarting your cluster.
i.e. rm -rf agency* primary*

and then restart.


signature.asc

Bart DS

unread,
Jun 28, 2016, 8:08:25 AM6/28/16
to ArangoDB
Doh, I should have tried that before reporting back...

Anyway, I removed the old databases and restarted everything (3 agents, 2 DBServers and 1 coordinator)
Several minutes after starting the coordinator (a lot of messages appear regarding synchronization of shards) everything seems to run and I can access the web interface on http://localhost:8531

However, as soon as I started the coordinator, I started to see several warnings and errors on the DBServers and Coordinators:

DBServer1:

2016-06-28T10:03:54Z [4940] INFO ArangoDB (version 3.0.x-devel [linux]) is ready for business. Have fun!
2016-06-28T10:03:54Z [4940] INFO plan change handling successful
2016-06-28T10:03:55Z [4940] ERROR {heartbeat} Could not read Current/Version from agency.
2016-06-28T10:04:01Z [4940] INFO plan change handling successful
2016-06-28T10:04:06Z [4940] INFO plan change handling successful
2016-06-28T10:04:12Z [4940] INFO creating local shard '_system/s2' for central '_system/1'
2016-06-28T10:04:12Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s2", "planId" : "1", "leader" : "DBServer002" }
2016-06-28T10:04:12Z [4940] INFO plan change handling successful
2016-06-28T10:04:12Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:13Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:14Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:15Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:16Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:17Z [4940] INFO synchronizeOneShard: waiting for leader, _system/s2, _system/1
2016-06-28T10:04:18Z [4940] INFO plan change handling successful
2016-06-28T10:04:18Z [4940] INFO synchronizeOneShard: trying to synchronize local shard '%s/%s' for central '%s/%s' _system s2 _system 1
2016-06-28T10:04:18Z [4940] INFO {replication} connected to master at tcp://127.0.0.1:8530, id 257501827428435, version 3.0, last log tick 27
2016-06-28T10:04:19Z [4940] INFO synchronizeOneShard: synchronization worked for shard s2
2016-06-28T10:04:19Z [4940] INFO synchronizeOneShard: done, _system/s2, _system/1
2016-06-28T10:04:22Z [4940] INFO creating local shard '_system/s4' for central '_system/3'
2016-06-28T10:04:23Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s2", "planId" : "1", "leader" : "DBServer002" }
2016-06-28T10:04:23Z [4940] INFO plan change handling successful

[...]

2016-06-28T10:05:57Z [4940] INFO plan change handling successful
2016-06-28T10:05:57Z [4940] INFO Applying chunk:
2016-06-28T10:05:57Z [4940] ERROR syncCollectionFinalize: ensureIndex { "tick" : "437", "type" : 2100, "database" : "1", "cid" : "398", "cname" : "s20", "data" : { "fields" : [ "time" ], "id" : "21", "sparse" : false, "type" : "skiplist", "unique" : false } } {"errorNum":10,"errorMessage":"usage: ensureIndex(<description>)"}
2016-06-28T10:05:57Z [4940] ERROR synchronizeOneShard: Could not finalize shard synchronization s20 { "error" : true, "errorMessage" : "could not replicate body ops", "response" : { "statusCode" : 200, "status" : 200, "message" : "OK", "headers" : { "content-type" : "application/x-arango-dump; charset=utf-8", "x-arango-replication-lasttick" : "437", "x-arango-replication-active" : "true", "server" : "ArangoDB", "http/1.1" : "200 OK", "content-length" : "153", "connection" : "Keep-Alive", "x-arango-replication-frompresent" : "true", "x-arango-replication-checkmore" : "false", "x-arango-replication-lastincluded" : "437" }, "rawBody" : [Object { "length" : 153 }], "body" : "{\"tick\":\"437\",\"type\":2100,\"database\":\"1\",\"cid\":\"398\",\"cname\":\"s20\",\"data\":{\"fields\":[\"time\"],\"id\":\"21\",\"sparse\":false,\"type\":\"skiplist\",\"unique\":false}}\n" }, "exception" : [Object { "errorNum" : 10, "errorMessage" : "usage: ensureIndex(<description>)" }] }
2016-06-28T10:05:57Z [4940] ERROR synchronization of local shard '_system/s20' for central '_system/19' failed: "Did not work for shard s20."
2016-06-28T10:05:57Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s10", "planId" : "9", "leader" : "DBServer002" }
2016-06-28T10:05:57Z [4940] INFO synchronizeOneShard: done, _system/s20, _system/19
2016-06-28T10:05:57Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s15", "planId" : "14", "leader" : "DBServer002" }
2016-06-28T10:05:57Z [4940] INFO synchronizeOneShard: already done, _system/s10, _system/9

[...]

2016-06-28T10:07:08Z [4940] INFO synchronizeOneShard: trying to synchronize local shard '%s/%s' for central '%s/%s' _system s32 _system 31
2016-06-28T10:07:08Z [4940] INFO {replication} connected to master at tcp://127.0.0.1:8530, id 257501827428435, version 3.0, last log tick 788
2016-06-28T10:07:09Z [4940] INFO Applying chunk:
2016-06-28T10:07:09Z [4940] ERROR syncCollectionFinalize: ensureIndex { "tick" : "823", "type" : 2100, "database" : "1", "cid" : "768", "cname" : "s32", "data" : { "fields" : [ "mount" ], "id" : "33", "selectivityEstimate" : 1, "sparse" : false, "type" : "hash", "unique" : true } } {"errorNum":10,"errorMessage":"usage: ensureIndex(<description>)"}
2016-06-28T10:07:09Z [4940] ERROR synchronizeOneShard: Could not finalize shard synchronization s32 { "error" : true, "errorMessage" : "could not replicate body ops", "response" : { "statusCode" : 200, "status" : 200, "message" : "OK", "headers" : { "content-type" : "application/x-arango-dump; charset=utf-8", "x-arango-replication-lasttick" : "823", "x-arango-replication-active" : "true", "server" : "ArangoDB", "http/1.1" : "200 OK", "content-length" : "173", "connection" : "Keep-Alive", "x-arango-replication-frompresent" : "true", "x-arango-replication-checkmore" : "false", "x-arango-replication-lastincluded" : "823" }, "rawBody" : [Object { "length" : 173 }], "body" : "{\"tick\":\"823\",\"type\":2100,\"database\":\"1\",\"cid\":\"768\",\"cname\":\"s32\",\"data\":{\"fields\":[\"mount\"],\"id\":\"33\",\"selectivityEstimate\":1,\"sparse\":false,\"type\":\"hash\",\"unique\":true}}\n" }, "exception" : [Object { "errorNum" : 10, "errorMessage" : "usage: ensureIndex(<description>)" }] }
2016-06-28T10:07:09Z [4940] ERROR synchronization of local shard '_system/s32' for central '_system/31' failed: "Did not work for shard s32."
2016-06-28T10:07:09Z [4940] INFO synchronizeOneShard: done, _system/s32, _system/31
2016-06-28T10:07:09Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s2", "planId" : "1", "leader" : "DBServer002" }
2016-06-28T10:07:09Z [4940] INFO plan change handling successful
2016-06-28T10:07:09Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s10", "planId" : "9", "leader" : "DBServer002" }
2016-06-28T10:07:09Z [4940] INFO synchronizeOneShard: already done, _system/s2, _system/1
2016-06-28T10:07:09Z [4940] INFO scheduleOneShardSynchronization: have launched job { "database" : "_system", "shard" : "s15", "planId" : "14", "leader" : "DBServer002" }
2016-06-28T10:07:09Z [4940] INFO synchronizeOneShard: already done, _system/s10, _system/9

[...]

DBServer2:

2016-06-28T10:03:53Z [5029] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-28T10:03:53Z [5029] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-28T10:03:53Z [5029] INFO created database directory 'primary2'.
2016-06-28T10:03:53Z [5029] INFO WAL directory 'primary2/journals' does not exist. creating it...
2016-06-28T10:03:53Z [5029] INFO ArangoDB 3.0.x-devel [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1f 6 Jan 2014
2016-06-28T10:03:53Z [5029] INFO loaded database '_system' from 'primary2/databases/database-1'
2016-06-28T10:03:53Z [5029] INFO the server has 4 (hyper) cores, using 1 scheduler thread(s), 4 dispatcher thread(s)
2016-06-28T10:03:56Z [5029] INFO JavaScript using startup './js', application './js/apps'
2016-06-28T10:03:56Z [5029] INFO changing state of PRIMARY server from UNDEFINED to STARTUP
2016-06-28T10:03:57Z [5029] INFO Cluster feature is turned on. Agency version: , Agency endpoints: http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5002, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5002, server id: 'DBServer002', internal address: tcp://127.0.0.1:8530, role: PRIMARY
2016-06-28T10:03:57Z [5029] INFO using heartbeat interval value '1000 ms' from agency
2016-06-28T10:03:57Z [5029] INFO changing state of PRIMARY server from STARTUP to SERVING
2016-06-28T10:03:57Z [5029] INFO In database '_system': No version information file found in database directory.
2016-06-28T10:03:57Z [5029] INFO In database '_system': Database is up-to-date (30000/cluster-local/init)
2016-06-28T10:03:57Z [5029] INFO using endpoint 'http+tcp://0.0.0.0:8530' for non-encrypted requests
2016-06-28T10:03:57Z [5029] INFO Authentication is turned off
2016-06-28T10:03:57Z [5029] INFO bootstraped DB server DBServer002
2016-06-28T10:03:57Z [5029] INFO bootstraped DB server DBServer002
2016-06-28T10:03:57Z [5029] INFO bootstraped DB server DBServer002
2016-06-28T10:03:57Z [5029] INFO In database '_system': Database is up-to-date (-/db-server-local/init)
2016-06-28T10:03:57Z [5029] INFO bootstraped DB server DBServer002
2016-06-28T10:03:57Z [5029] INFO ArangoDB (version 3.0.x-devel [linux]) is ready for business. Have fun!

[...]

2016-06-28T10:07:54Z [5029] INFO plan change handling successful
2016-06-28T10:07:56Z [5029] INFO plan change handling successful
2016-06-28T10:08:00Z [5029] INFO plan change handling successful
2016-06-28T10:08:00Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/16/s17
2016-06-28T10:08:01Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/16/s17
2016-06-28T10:08:01Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/16/s17
2016-06-28T10:08:02Z [5029] ERROR {replication} insertLocal: dropping follower DBServer001 for shard s17
2016-06-28T10:08:02Z [5029] ERROR {replication} insertLocal: dropping follower DBServer001 for shard s20
2016-06-28T10:08:03Z [5029] INFO plan change handling successful
2016-06-28T10:08:08Z [5029] INFO plan change handling successful
2016-06-28T10:08:15Z [5029] INFO plan change handling successful
2016-06-28T10:08:16Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:17Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:17Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:18Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:18Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:19Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:19Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:20Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20

[...]

2016-06-28T10:08:45Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:45Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:46Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:46Z [5029] ERROR FollowerInfo::add, timeout in agency operation for key Current/Collections/_system/19/s20
2016-06-28T10:08:46Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:47Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:47Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:48Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:48Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:49Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:08:49Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20

[...]

2016-06-28T10:09:16Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:09:16Z [5029] ERROR FollowerInfo::remove, timeout in agency operation for key Current/Collections/_system/19/s20
2016-06-28T10:09:16Z [5029] ERROR {replication} insertLocal: dropping follower DBServer001 for shard s20
2016-06-28T10:15:20Z [5029] ERROR {replication} insertLocal: dropping follower DBServer001 for shard s23
2016-06-28T10:15:20Z [5029] INFO plan change handling successful
2016-06-28T10:15:22Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:15:22Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:15:23Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:15:23Z [5029] WARNING FollowerInfo::add, could not cas key Current/Collections/_system/19/s20

[...]

2016-06-28T10:16:53Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:16:54Z [5029] WARNING FollowerInfo::remove, could not cas key Current/Collections/_system/19/s20
2016-06-28T10:16:54Z [5029] ERROR FollowerInfo::remove, timeout in agency operation for key Current/Collections/_system/19/s20
2016-06-28T10:16:54Z [5029] ERROR {replication} insertLocal: dropping follower DBServer001 for shard s20


Coordinator1:

2016-06-28T10:04:04Z [5109] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2016-06-28T10:04:04Z [5109] INFO Waiting for a DBserver to show up...
2016-06-28T10:04:04Z [5109] INFO Found 2 DBservers.
2016-06-28T10:04:04Z [5109] INFO file-descriptors (nofiles) hard limit is 4096, soft limit is 1024
2016-06-28T10:04:04Z [5109] INFO created database directory 'coordinator'.
2016-06-28T10:04:04Z [5109] INFO WAL directory 'coordinator/journals' does not exist. creating it...
2016-06-28T10:04:04Z [5109] INFO ArangoDB 3.0.x-devel [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1f 6 Jan 2014
2016-06-28T10:04:04Z [5109] INFO loaded database '_system' from 'coordinator/databases/database-1'
2016-06-28T10:04:04Z [5109] INFO the server has 4 (hyper) cores, using 1 scheduler thread(s), 4 dispatcher thread(s)
2016-06-28T10:04:05Z [5109] INFO JavaScript using startup './js', application './js/apps'
2016-06-28T10:04:05Z [5109] INFO changing state of COORDINATOR server from UNDEFINED to STARTUP
2016-06-28T10:04:05Z [5109] INFO Cluster feature is turned on. Agency version: , Agency endpoints: http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5003, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, http+tcp://127.0.0.1:5001, server id: 'Coordinator001', internal address: tcp://127.0.0.1:8531, role: COORDINATOR
2016-06-28T10:04:05Z [5109] INFO using heartbeat interval value '1000 ms' from agency
2016-06-28T10:04:05Z [5109] INFO changing state of COORDINATOR server from STARTUP to SERVING
2016-06-28T10:04:05Z [5109] INFO In database '_system': No version information file found in database directory.
2016-06-28T10:04:05Z [5109] INFO In database '_system': Database is up-to-date (30000/cluster-local/init)
2016-06-28T10:04:06Z [5109] INFO using endpoint 'http+tcp://0.0.0.0:8531' for non-encrypted requests
2016-06-28T10:04:06Z [5109] INFO Authentication is turned off
2016-06-28T10:04:06Z [5109] INFO In database '_system': Found 14 defined task(s), 12 task(s) to run
2016-06-28T10:04:06Z [5109] INFO In database '_system': state coordinator-global/init, tasks setupGraphs, setupUsers, createUsersIndex, addDefaultUserSystem, createModules, createRouting, insertRedirectionsAll, setupAqlFunctions, createStatistics, createFrontend, setupQueues, setupJobs
2016-06-28T10:06:52Z [5109] INFO In database '_system': init successfully finished
2016-06-28T10:07:11Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:490:8)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)
2016-06-28T10:07:11Z [5109] INFO Waiting for synchronous replication of system collections...
2016-06-28T10:07:22Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:501:12)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)
2016-06-28T10:07:47Z [5109] INFO bootstraped coordinator Coordinator001
2016-06-28T10:07:47Z [5109] INFO ArangoDB (version 3.0.x-devel [linux]) is ready for business. Have fun!
2016-06-28T10:09:16Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:501:12)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)
2016-06-28T10:09:16Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:501:12)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)
2016-06-28T10:16:54Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:501:12)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)
2016-06-28T10:16:54Z [5109] WARNING catch error in historian: ArangoError: internal error\n    at Error (native)\n    at Object.exports.historian (/home/bart/test2/arangodb/js/server/modules/@arangodb/statistics.js:501:12)\n    at eval (eval at undefined, <anonymous>:3:54)\n    at eval (eval at undefined, <anonymous>:3:70)


The web interface doesn't show any Coordinators or DBServers in the dashboard (Coordinators: 0, DBServers: 0, RAM: 3.3%, Connections: 7)
and when I go to the "Nodes" section, the Overview tab is empty.


Bart

Kaveh Vahedipour

unread,
Jun 28, 2016, 8:12:01 AM6/28/16
to aran...@googlegroups.com
Could you please try something else for me. This is just that I know, if we are just messing up the startup. We’ll know more from there hopefully:

Go into the directory, where you cloned arangodb:

cd arangodb
./scripts/startLocalCluster 3 2 1

The log and databases end up in cluster. Could you please share those logs with me?

Kind regards,
Kaveh
signature.asc

Bart DS

unread,
Jun 28, 2016, 9:32:23 AM6/28/16
to ArangoDB
Hi Kaveh,

With this script, the cluster is starting fine.
No errors or warnings and Coordinator and DBServers are visible in the web interface.

So I did a new test:
- cleaned up all my databases again
- started 3 agents, 2 dbservers and 1 coordinator, but this time with different names and different ports

Result: no warnings or errors anymore on the console.
However, the web interface still shows 0 Coordinators and 0 DBServers and the Nodes tab is also still empty

I tried to attach the logs from the script, but I keep getting an "upload error" from google

Bart

Kaveh Vahedipour

unread,
Jun 28, 2016, 10:03:14 AM6/28/16
to aran...@googlegroups.com
Hi Bart,

> With this script, the cluster is starting fine.
> No errors or warnings and Coordinator and DBServers are visible in the web interface.
>

We’re fine if you have no warnings and no errors. In that case I don’t really need the output. So don’t worry about the logs.

> So I did a new test:
> - cleaned up all my databases again
> - started 3 agents, 2 dbservers and 1 coordinator, but this time with different names and different ports
>
> Result: no warnings or errors anymore on the console.
> However, the web interface still shows 0 Coordinators and 0 DBServers and the Nodes tab is also still empty

This is easily explained. If you look inside the the script you will appreciate the startup parameters in the agency regarding the supervision in the agency, which serves as a sanity checking and in context of a cluster management system like mesosphere, cluster resizing, shard moving, failover etc facility for a running cluster. This facility populates the information behind this page. If it is not running (default), this page remains empty. There will be a fix for this bug in 3.0.1.

Bart, again, we greatly appreciate the time you invested. We don’t take this kind of involvement for granted.

Kind regards,
Kaveh
signature.asc

Bart DS

unread,
Jun 28, 2016, 10:11:49 AM6/28/16
to ArangoDB
Ok, so if I understand correctly I shouldn't worry about the missing data in the web interface for now as a fix for it will be available in 3.0.1?
Is that correct?

regards,

Bart

Kaveh Vahedipour

unread,
Jun 28, 2016, 10:15:02 AM6/28/16
to aran...@googlegroups.com
yes, exactly.

cheers,
kaveh
signature.asc

Bart DS

unread,
Jun 28, 2016, 10:30:16 AM6/28/16
to ArangoDB


On Tuesday, June 28, 2016 at 4:15:02 PM UTC+2, Kaveh Vahedipour wrote:
yes, exactly.

cheers,
kaveh


Ok, that's great!

And thank you very much for the great support, I really appreciate that!

regards,

Bart
Reply all
Reply to author
Forward
0 new messages