Jim DeWaard

unread,

Mar 31, 2016, 2:46:59 PM3/31/16

to coprHD

Hello,

I stood up a 5 node CoprHD cluster which appears to be healthy. However, if I take down any one node, I get the following message through the portal:

Application Error

Error 8500: Database connection failed. Please check the database services and network connectivity on all nodes

The cluster was deployed via importing OVF files into vSphere. Attached is an example of one of the "ovfenv.properties" files and the output of the logs. Please let me know if additional information is needed to troubleshoot this issue. It seems the Cassandra cluster is up and functional, but perhaps the portal service is not configured properly. In any case, I have not been able to track it down yet.

If anyone has any ideas on where I should look next, please let me know. Thanks!

Jim D.

logs.zip

ovfenv.properties

Jim DeWaard

unread,

Mar 31, 2016, 3:30:49 PM3/31/16

to coprHD

Looking at this a bit further, I'm seeing the following in the authsvc.log:

com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=10.66.139.49(10.66.139.49):9260, latency=1(1), attempts=1]UnavailableException()

This seems to point to an issue with the replication factor. But the keyspace seems to indicate the RF is 5:

CREATE KEYSPACE "StorageOS" WITH replication = {

'class': 'SimpleStrategy',

'replication_factor': '5'

};

I'm not well versed in Cassandra, so maybe this is expected behavior.

Thanks in advance.

Rodrigo Oshiro

unread,

Apr 1, 2016, 10:30:42 AM4/1/16

to coprHD

Just wondering, if you take down 2 and leave an odd number of nodes will you still get this error?

Jim DeWaard

unread,

Apr 1, 2016, 10:58:36 AM4/1/16

to coprHD

I'll give it a try. I'm trying to run a repair on each node at the moment via "/opt/storageos/nodetool repair". The first node came online and was available in a 1+0 configuration before the remaining 4 nodes were deployed, so my theory is the RF was changed to 5 on the keyspace, but the nodes still need the repair command executed to appropriately distribute data across the new nodes. I'm on my last node now, so I'll update the group when it's finished. I'll also try your suggestion.

Thanks,

Jim DeWaard

unread,

Apr 1, 2016, 8:52:01 PM4/1/16

to coprHD

No luck with either attempt. But it seems to be tied to the individual nodes. For instance, the Portal will remain available if I take down nodes 3 and/or 5, but will go down if I take down 1, 2 or 4. I've been scouring configuration files to find a difference between these nodes, but haven't found anything yet. I also saw a post suggesting that I change the Cassandra consistency level to LOCAL_QUORUM, but I can't seem to find this option in any of the config files (I believe this is a client-side setting).

If I don't make any progress tonight, I may try to redeploy a 2+1 cluster to see if the same issue exists. Let me know if you have any additional suggestions.

Thanks again.

Jim D

Rodrigo Oshiro

unread,

Apr 1, 2016, 9:59:35 PM4/1/16

to coprHD

Hm, this tool here is not installed on the RPM, but its on source control:

- https://review.coprhd.org/projects/CH/repos/coprhd-controller/browse/tools/scripts/node_recovery.sh

You ran nodetool directly right? There are some other options on the script.

Jim DeWaard

unread,

Apr 4, 2016, 1:39:33 PM4/4/16

to coprHD

Hello Rodrigo,

Before I got a chance to run this utility, I had to take all nodes in the cluster offline for an unrelated issue. When the cluster came back online, I did some additional testing and HA is functioning properly now. Since I originally deployed the cluster, I hadn't taken all nodes down at one time. So perhaps there is a cluster-wide configuration which cannot be updated while nodes are online.

This is an unfortunate situation as I wasn't able to determine actual root cause. The only change I made before halting the nodes was to modify the RF on the "GeoStorageOS" keyspace to 5, where it was previously not defined. I don't know if this had any impact, but it's worth mentioning.

Please let me know if you have any thoughts.

Thanks,

Jim DeWaard

unread,

Apr 14, 2016, 4:23:17 PM4/14/16

to coprHD

I finally had a chance to deploy a second 3 node cluster and found the solution to this problem. When deploying a 3 or 5 node cluster from the CoprHD OVA, it's not enough to just modify the /etc/ovfenv.properties file to configure CoprHD for high availability. You also need to alter the GeoStorageOS keyspace.

Here's how you do it:

1. Login to one of the CoprHD nodes via SSH as 'svcuser'.

2. Connect to the GeoStorageOS instance via cqlsh.

sudo /opt/storageos/bin/cqlsh localhost 9260

3. In the cqlsh shell, output the schema for the GeoStorageOS keyspace. This will output a ton of stuff, but focus on the first 5 lines or so. Notice the 'vdc1' property is set to 1. The 'vdc1' property defines a datacenter in Cassandra and the value defines the number of replicas which should exist in that specific datacenter. Looking at the output, we have just 1 replica in the 'vdc1' datacenter which does will allow for a highly available configuration.

cqlsh> DESCRIBE KEYSPACE GeoStorageOS 

CREATE KEYSPACE "GeoStorageOS" WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'vdc1': '1'
};

4. So we need to alter the 'GeoStorageOS' keyspace and set the 'vdc1' property to 3 or 5 (depending on your desired cluster size).

cqlsh> ALTER KEYSPACE "GeoStorageOS" WITH REPLICATION = {
  'class' : 'NetworkTopologyStrategy', 
  'vdc1' : 3
};

5. Replicas are automatically created on the remaining nodes after a few minutes. So no additional action is required.

I see this as a bug which will need to be fixed, but I wanted to get the solution documented for anyone else testing with HA configurations. Please let me know if you have any feedback.

Thanks,

Jim D

Jim DeWaard

unread,

Apr 14, 2016, 9:12:56 PM4/14/16

to coprHD

Just a quick correction.

"Looking at the output, we have just 1 replica in the 'vdc1' datacenter which does NOT allow for a highly available configuration."

John Mark Walker

unread,

Apr 14, 2016, 10:44:17 PM4/14/16

to Jim DeWaard, coprHD

Thanks for looking into this - this sounds like a good topic for a blog post. Let me know if you'd like access to the coprhd blog.

--
You received this message because you are subscribed to the Google Groups "coprHD" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coprhd+un...@googlegroups.com.
To post to this group, send email to cop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/coprhd/d51ed993-8998-417e-812f-11513c64d551%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Salman Riaz

unread,

Jul 28, 2016, 10:32:15 AM7/28/16

to coprHD, jim.d...@gmail.com

I'm facing the issue in 3-node cluster configuration. Can you please guide me in this regard.

I am also facing this error. I have tried to re-install these machines but not succeed. diagtool shows me following output. I have created three VMs (OpenSuse 13.2) on Citrix XenServer 6.5. Can you please guide me in this regard.

coprhd1:~ # /etc/diagtool -v

* Network interface: [OK]

network_ipaddr=10.10.50.3

network_netmask=255.255.255.248

network_ipaddr6=

network_rx_packets=102080

network_tx_packets=53627

number_of_errors=0

network_status=RUNNING

* IP uniqueness: [OK]

network_vip=10.10.50.2,[OK]

network_1_ipaddr=10.10.50.3,[OK]

network_2_ipaddr=10.10.50.4,[OK]

network_3_ipaddr=10.10.50.5,[OK]

* Network routing: [OK]

network_gw6=UNCONFIGURED

network_gw=10.10.50.1,REACHABLE

* Nodes connectivity: [REACHABLE]

vipr1=10.10.50.3,REACHABLE

vipr2=10.10.50.4,REACHABLE

vipr3=10.10.50.5,REACHABLE

* Network VIP: [IPV4_ONLY, REACHABLE]

ipv4_vip=10.10.50.2

ipv4_vip_status=REACHABLE

* VDC Status: [REACHABLE]

* Peer synchronization: [No peer exists]

* IP subnets: [SAME]

network_1_ipaddr=10.10.50.3, subnet is 10.10.50.0

network_2_ipaddr=10.10.50.4, subnet is 10.10.50.0

network_3_ipaddr=10.10.50.5, subnet is 10.10.50.0

gateway=10.10.50.1, subnet is 10.10.50.0

* Db connection: [UNREACHABLE]

* ZK connection: [OK]

* Firewall: [CONFIGURED, ACTIVE]

* DNS: [OK]

network_nameserver=8.8.8.8 [OK]

network_nameserver=8.8.4.4. [OK]

* NTP: [CONFIGURED, DEGRADED]

network_ntpserver=0.opensuse.pool.ntp.org [OK]

network_ntpserver=1.opensuse.pool.ntp.org [OK]

network_ntpserver=3.opensuse.pool.ntp.org [OK]

network_ntpserver=2.opensuse.pool.ntp.org [CONFIGURED, UNREACHABLE]

* EMC upgrade repository: [OK]

system_update_repo=https://colu.emc.com/soap/rpc

* connectEMC: [OK]

FTPS server=corpusfep3.emc.com, REACHABLE

SMTP server=NOT SPECIFIED

/etc/diagtool: line 489: let: avail_percentage=100 - : syntax error: operand expected (error token is "- ")

/etc/diagtool: line 491: [: -ge: unary operator expected

/etc/diagtool: line 494: [: -le: unary operator expected

/etc/diagtool: line 897: 10*/100: syntax error: operand expected (error token is "/100")

* Memory usage: [OK]

total=11781M

used=3079M

free=8702M

shared=1M

buffers=55M

cached=851M

* CPU usage: [OK]

cpu_util_percentage=2.4%

One thing more is that "diagtool -v" output shows all okay but when I run "diagtool -r" then I get the following outputs on each node:

coprhd1:~ # /etc/diagtool -r

* Resource allocation: [OK]

Resources details for vipr1(localhost) is:

memory size: 11781M

disk size: 125110568M

processor count: 4

total cpu frequency: 9178.46MHz

vipr2 is down

vipr3 is down

coprhd2:~ # /etc/diagtool -r

* Resource allocation: [OK]

Resources details for vipr2(localhost) is:

memory size: 11781M

disk size: 125110560M

processor count: 4

total cpu frequency: 9179.48MHz

vipr1 is down

vipr3 is down

coprhd3:~ # /etc/diagtool -r

* Resource allocation: [OK]

Resources details for vipr3(localhost) is:

memory size: 11781M

disk size: 125110576M

processor count: 4

total cpu frequency: 9179.1MHz

vipr1 is down

vipr2 is down

I am facing this error in apisvc.log:

2016-07-28 19:07:05,099 [main] WARN HostSupplierImpl.java (line 73) hostsupplier is empty. May be dbsvc hasn't started yet. waiting for 10000 msec

2016-07-28 19:07:15,100 [main] WARN HostSupplierImpl.java (line 107) no dbsvc instance running. Coordinator exception message: The coordinator cannot locate any service with path /sites/59597760-5493-11e6-8828-cbd944af77a6/service/dbsvc/3.5

2016-07-28 19:07:15,101 [main] WARN HostSupplierImpl.java (line 73) hostsupplier is empty. May be dbsvc hasn't started yet. waiting for 10000 msec

And receiving this in dbsvc.log:

2016-07-28 19:01:46,906 [main] INFO SchemaUtil.java (line 315) try scan and setup db ...

2016-07-28 19:01:46,906 [main] INFO SchemaUtil.java (line 335) keyspace exist already

2016-07-28 19:01:46,923 [main] INFO SchemaUtil.java (line 554) Current strategyOptions={vdc1=1}

2016-07-28 19:01:46,926 [main] INFO DrUtil.java (line 492) Cassandra DC Name is vdc1

2016-07-28 19:01:46,929 [main] INFO SchemaUtil.java (line 348) Current db schema version 3.5

2016-07-28 19:01:46,950 [main] INFO SchemaUtil.java (line 351) scan and setup db schema succeed

2016-07-28 19:01:46,950 [main] INFO StartupMode.java (line 116) DB schema validated

2016-07-28 19:01:46,956 [main] INFO DbServiceStatusChecker.java (line 132) Waiting for all cluster nodes to become state: joined : Timed Out

And page is redirected to https://VIP/maintenance?targetUrl=%2F

This XML file does not appear to have any style information associated with it. The document tree is shown below.

<error>
<code>6503</code>
<description>
Unable to connect to the service. The service is unavailable, try again later.
</description>
<details>
The service is currently unavailable because a connection failed to a core component. Please contact an administrator or try again later.
</details>
<retryable>true</retryable>
</error>



Kindly guide me in this regard.

Reply all

Reply to author

Forward

Error 8500: Database connection failed. (5 Node CoprHD Cluster)

Jim DeWaard

Application Error

Jim DeWaard

Rodrigo Oshiro

Jim DeWaard

Jim DeWaard

Rodrigo Oshiro

Jim DeWaard

Jim DeWaard

Jim DeWaard

John Mark Walker

Salman Riaz