Observed issues with running a cluster in Windows Azure

Amir Khawaja

unread,

Mar 24, 2015, 12:19:05 PM3/24/15

to orient-...@googlegroups.com

Greetings, everyone. Has anyone had much success running an OrientDB 2.0.5 cluster in Azure? I created a cluster in Windows Azure with 4 nodes using CentOS 7 and OrientDB Community 2.0.4 -- 2 nodes in US East2 and 2 nodes in US West. There is a Site-to-Site VPN connection between the two regions in Azure and data is flowing between machines across the network. I have three databases that I have currently deployed and testing. I find that many times the synchronization between databases does not occur. For instance, if I startup the first node in US East2 and once that comes online, fire up the second node in US West, the US West node will not come online telling me that the database is not yet online. At that point, I kill the process and then eventually the database comes online. I even have to go so far as to delete the databases in the database path folder. I do this a few times and eventually the server may startup. Sometimes, I will have three of the four nodes working and the fourth just refuses to come online.

The VM size selected for each node in the cluster is a D4 (4 cores, 28GB RAM). This should be more than sufficient to handle most loads. Surely, I must be missing something as this is not acceptable production behavior. For reference, I am pasting the hazelcast.xml and default-distributed-db-config.json files here in hopes that someone has some pointers for me.

*** hazelcast.xml ***

<?xml version="1.0" encoding="UTF-8"?>

~ Licensed under the Apache License, Version 2.0 (the "License"); ~ you may

not use this file except in compliance with the License. ~ You may obtain

a copy of the License at ~ ~ http://www.apache.org/licenses/LICENSE-2.0 ~

~ Unless required by applicable law or agreed to in writing, software ~ distributed

under the License is distributed on an "AS IS" BASIS, ~ WITHOUT WARRANTIES

OR CONDITIONS OF ANY KIND, either express or implied. ~ See the License for

the specific language governing permissions and ~ limitations under the License. -->

<hazelcast

xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.0.xsd"

xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<group>

<password>[password]</password>

</group>

<join>

<multicast-group>235.1.1.1</multicast-group>

<multicast-port>2434</multicast-port>

</multicast>

<tcp-ip enabled="true">

</tcp-ip>

</join>

</network>

<executor-service>

<pool-size>16</pool-size>

</executor-service>

</hazelcast>

*** default-distributed-db-config.json ***

{

"autoDeploy": true,

"hotAlignment": true,

"executionMode": "synchronous",

"readQuorum": 1,

"writeQuorum": 3,

"failureAvailableNodesLessQuorum": false,

"readYourWrites": true,

"clusters": {

"internal": {

},

"index": {

},

"*": {

"servers" : [ "<NEW_NODE>" ]

}

Thank you for any assistance you can offer.

Amir.

Colin

unread,

Mar 24, 2015, 12:25:16 PM3/24/15

to orient-...@googlegroups.com

Hi Amir,

Is it consistently a problem between the same machines not seeing each other?

I'm a little confused as you say "the US West node will not come online telling me that the database is not yet online. At that point, I kill the process and then eventually the database comes online."

Do you mean you kill the database process and then restart it and then it starts communicating?

In your distributed json file, try setting "hotAlignment" to false.

Can you see on each machine when Hazelcast 'sees' all the members? Are all the members showing up?

-Colin

Orient Technologies

The Company behind OrientDB

Amir Khawaja

unread,

Mar 24, 2015, 12:32:21 PM3/24/15

to orient-...@googlegroups.com

Hi Colin,

Thank you for the prompt response.

I'm a little confused as you say "the US West node will not come online telling me that the database is not yet online. At that point, I kill the process and then eventually the database comes online."

Do you mean you kill the database process and then restart it and then it starts communicating?

Yes. I kill the database process on the cluster node where the OrientDB is not coming online.

Can you see on each machine when Hazelcast 'sees' all the members? Are all the members showing up?

Yes. I see the databases are talking to each other as the IP address of the nodes show up in the log of each database server.

I will try setting hotAlignment to false and report my results on this thread.

Amir.

Colin

unread,

Mar 24, 2015, 12:49:25 PM3/24/15

to orient-...@googlegroups.com

Hi Amir,

You might also do a ping and a traceroute between the machines and see what kind of latency you're getting, just in case it's a timeout issue with Hazelcast.

-Colin

Amir Khawaja

unread,

Mar 24, 2015, 12:52:58 PM3/24/15

to orient-...@googlegroups.com

Hi Colin,

I checked the latency prior to posting and between regions it is about 65ms on average. What should I set the latency to for Hazelcast?

Amir.

Colin

unread,

Mar 24, 2015, 1:20:37 PM3/24/15

to orient-...@googlegroups.com

That latency should be fine so long as it's consistent.

-Colin

Amir Khawaja

unread,

Mar 24, 2015, 1:40:15 PM3/24/15

to orient-...@googlegroups.com

The cluster is now online in US East2 and US West. I did the following:

- Changed the default-distributed-db-config.json to:

{

"replication": true,

"autoDeploy": true,

"hotAlignment": false,

"resyncEvery": 15,

"clusters": {

"internal": {

"replication": false

},

"index": {

"replication": false

},

"*": {

"replication": true,

"readQuorum": 1,

"writeQuorum": 1,

"failureAvailableNodesLessQuorum": false,

"readYourWrites": true,

"partitioning": {

"strategy": "round-robin",

"default": 0,

"partitions": [

[ "<NEW_NODE>" ]

]

}

- Deleted the distributed-config.json file from each database folder and restarted each node in the cluster.

Now, when I connect to one of the nodes and try to delete a vertex, I receive the following error:

com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing distributed request (id=141

from=odb02uw task=command_sql(delete vertex #42:2) userName=) against database 'vis.[]' to nodes [odb02ue2, odb02uw,

odb01uw, odb01ue2] --> com.orientechnologies.orient.server.distributed.ODistributedException: Quorum 4 not reached for

request (id=141 from=odb02uw task=command_sql(delete vertex #42:2) userName=). Timeout=407ms Servers in timeout/

conflict are: - odb02ue2: com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on execution

of command: sql.delete vertex #42:2 - odb01ue2: com.orientechnologies.orient.core.exception.

OCommandExecutionException: Error on execution of command: sql.delete vertex #42:2 - odb01uw: com.orientechnologies.

orient.core.exception.OCommandExecutionException: Error on execution of command: sql.delete vertex #42:2 Received:

{odb02uw=com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on execution of command: sql.

delete vertex #42:2, odb01uw=com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on

execution of command: sql.delete vertex #42:2, odb02ue2=com.orientechnologies.orient.core.exception.

OCommandExecutionException: Error on execution of command: sql.delete vertex #42:2, odb01ue2=com.orientechnologies.

orient.core.exception.OCommandExecutionException: Error on execution of command: sql.delete vertex #42:2}

Why am I not able to delete a vertex?

Amir.

Colin

unread,

Mar 24, 2015, 2:02:46 PM3/24/15

to orient-...@googlegroups.com

For some reason it's trying to reach a quorum of 4.

Could you paste your database's distributed-config.json file please?

-Colin

Amir Khawaja

unread,

Mar 24, 2015, 2:13:31 PM3/24/15

to orient-...@googlegroups.com

Please find the contents of the distributed-config.json file below:

Amir.

Amir Khawaja

unread,

Mar 24, 2015, 5:21:25 PM3/24/15

to orient-...@googlegroups.com

Continuing with this thread. I ended up just deleting the database and recreating it and the problem went away. Not sure why it went away. Nevertheless, I am now using the following default-distributed-db-config.json:

{

"replication": true,

"autoDeploy": true,

"hotAlignment": false,

"resyncEvery": 15,

"clusters": {

"internal": {

"replication": false

},

"index": {

"replication": false

},

"*": {

"replication": true,

"readQuorum": 1,

"writeQuorum": 1,

"failureAvailableNodesLessQuorum": false,

"readYourWrites": true,

"partitioning": {

"strategy": "round-robin",

"default": 0,

"partitions": [

[ "<NEW_NODE>" ]

]

}

However, I noticed that now the following warning appears in the logs on each cluster node:

WARNING readQuorum setting not found for cluster=[class name]_[node name] in distributed-config.json

Why would this warning appear? Is it something that will eventually compromise data integrity? Does anyone have any ideas about this? Thanks.

Amir.

Luca Garulli

unread,

Mar 24, 2015, 5:26:06 PM3/24/15

to orient-database

Hi,

How many nodes do you have? I saw 4 nodes in JSON file. Are really 4 or were you playing with tests?

Lvc@

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amir Khawaja

unread,

Mar 24, 2015, 5:30:35 PM3/24/15

to orient-database

I have 4 nodes running. Internally, we have developed 3 apps that make use of OrientDB. I have these app databases deployed to this 4 node cluster.

Amir.

NODE>"]},"studio_odb02uw":{"@type":"d","@version":0,"

servers":["odb02uw","odb01ue2","odb02ue2","odb01uw","<NEW

Changed the default-distributed-db-config.json to:

{
"replication": true,
"autoDeploy": true,
"hotAlignment": false,
"resyncEvery": 15,
"clusters": {
"internal": {
"replication": false
},
"index": {
"replication": false
},
"*": {
"replication": true,
"readQuorum": 1,
"writeQuorum": 1,
"failureAvailableNodesLessQuorum": false,
"readYourWrites": true,
"partitioning": {
"strategy": "round-robin",
"default": 0,
"partitions": [
[ "<NEW_NODE>" ]
]
}
}
}
}

Deleted the distributed-config.json file from each database folder and restarted each node in the cluster.

Now, when I connect to one of the nodes and try to delete a vertex, I
receive the following error:

com.orientechnologies.orient.server.distributed.ODistributedException:
Error on executing distributed request (id=141
from=odb02uw task=command_sql(delete vertex #42:2) userName=) against
database 'vis.[]' to nodes [odb02ue2, odb02uw,
odb01uw, odb01ue2] --> com.orientechnologies.orient.server.distributed.ODistributedException:
Quorum 4 not reached for
request (id=141 from=odb02uw task=command_sql(delete vertex #42:2)
userName=). Timeout=407ms Servers in timeout/
conflict are: - odb02ue2: com.orientechnologies.orient.core.exception.OCommandExecutionException:
Error on execution
of command: sql.delete vertex #42:2 - odb01ue2:
com.orientechnologies.orient.core.exception.
OCommandExecutionException: Error on execution of command: sql.delete
vertex #42:2 - odb01uw: com.orientechnologies.
orient.core.exception.OCommandExecutionException: Error on execution
of command: sql.delete vertex #42:2 Received:
{odb02uw=com.orientechnologies.orient.core.exception.OCommandExecutionException:
Error on execution of command: sql.
delete vertex #42:2, odb01uw=com.orientechnologies.
orient.core.exception.OCommandExecutionException: Error on

execution of command: sql.delete vertex #42:2, odb02ue2=com.

You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/JT6tBvZh8Lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

Amir Khawaja

unread,

Mar 24, 2015, 5:33:06 PM3/24/15

to orient-...@googlegroups.com

Just tried to add a vertex to one of the nodes in the cluster and the data replicated to 3 of the 4 nodes with one of the nodes logging the following error stacktrace:

2015-03-24 21:12:59:814 INFO [odb01uw]<-[odb02uw] received updated status odb02uw.vis=ONLINE [OHazelcastPlugin][odb01uw] error on executing distributed request -1: -

com.hazelcast.nio.serialization.HazelcastSerializationException: Problem while reading Externalizable class : com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedRequest, exception: java.io.InvalidClassException: com.orientechnologies.orient.server.distributed.task.OAbstractRecordReplicatedTask; local class incompatible: stream classdesc serialVersionUID = -6130455906873421137, local class serialVersionUID = -1423725696856068566

at com.hazelcast.nio.serialization.DefaultSerializers$Externalizer.read(DefaultSerializers.java:150)

at com.hazelcast.nio.serialization.DefaultSerializers$Externalizer.read(DefaultSerializers.java:124)

at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:63)

at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:285)

at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:262)

at com.hazelcast.spi.impl.NodeEngineImpl.toObject(NodeEngineImpl.java:186)

at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:330)

at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:289)

at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:181)

at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:160)

at com.hazelcast.queue.impl.proxy.QueueProxySupport.invokeAndGet(QueueProxySupport.java:173)

at com.hazelcast.queue.impl.proxy.QueueProxySupport.pollInternal(QueueProxySupport.java:120)

at com.hazelcast.queue.impl.proxy.QueueProxyImpl.poll(QueueProxyImpl.java:87)

at com.hazelcast.queue.impl.proxy.QueueProxyImpl.take(QueueProxyImpl.java:81)

at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:254)

at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:218)

at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:110)

[odb01uw] error on executing distributed request -1: -