Error 8500: Database connection failed. (5 Node CoprHD Cluster)

157 views
Skip to first unread message

Jim DeWaard

unread,
Mar 31, 2016, 2:46:59 PM3/31/16
to coprHD
Hello,
I stood up a 5 node CoprHD cluster which appears to be healthy.  However, if I take down any one node, I get the following message through the portal:

Application Error

Error 8500: Database connection failed. Please check the database services and network connectivity on all nodes


The cluster was deployed via importing OVF files into vSphere.  Attached is an example of one of the "ovfenv.properties" files and the output of the logs.  Please let me know if additional information is needed to troubleshoot this issue.  It seems the Cassandra cluster is up and functional, but perhaps the portal service is not configured properly.  In any case, I have not been able to track it down yet.


If anyone has any ideas on where I should look next, please let me know.  Thanks!


Jim D.

logs.zip
ovfenv.properties

Jim DeWaard

unread,
Mar 31, 2016, 3:30:49 PM3/31/16
to coprHD
Looking at this a bit further, I'm seeing the following in the authsvc.log:

com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=10.66.139.49(10.66.139.49):9260, latency=1(1), attempts=1]UnavailableException()

This seems to point to an issue with the replication factor. But the keyspace seems to indicate the RF is 5:

CREATE KEYSPACE "StorageOS" WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '5'
};

I'm not well versed in Cassandra, so maybe this is expected behavior.

Thanks in advance.

Rodrigo Oshiro

unread,
Apr 1, 2016, 10:30:42 AM4/1/16
to coprHD
Just wondering, if you take down 2 and leave an odd number of nodes will you still get this error?

Jim DeWaard

unread,
Apr 1, 2016, 10:58:36 AM4/1/16
to coprHD
I'll give it a try.  I'm trying to run a repair on each node at the moment via "/opt/storageos/nodetool repair".  The first node came online and was available in a 1+0 configuration before the remaining 4 nodes were deployed, so my theory is the RF was changed to 5 on the keyspace, but the nodes still need the repair command executed to appropriately distribute data across the new nodes.  I'm on my last node now, so I'll update the group when it's finished.  I'll also try your suggestion.

Thanks,
Jim DeWaard

Jim DeWaard

unread,
Apr 1, 2016, 8:52:01 PM4/1/16
to coprHD
No luck with either attempt.  But it seems to be tied to the individual nodes.  For instance, the Portal will remain available if I take down nodes 3 and/or 5, but will go down if I take down 1, 2 or 4.  I've been scouring configuration files to find a difference between these nodes, but haven't found anything yet.  I also saw a post suggesting that I change the Cassandra consistency level to LOCAL_QUORUM, but I can't seem to find this option in any of the config files (I believe this is a client-side setting).

If I don't make any progress tonight, I may try to redeploy a 2+1 cluster to see if the same issue exists.  Let me know if you have any additional suggestions.

Thanks again.
Jim D

Rodrigo Oshiro

unread,
Apr 1, 2016, 9:59:35 PM4/1/16
to coprHD
Hm, this tool here is not installed on the RPM, but its on source control:

You ran nodetool directly right? There are some other options on the script.

Jim DeWaard

unread,
Apr 4, 2016, 1:39:33 PM4/4/16
to coprHD
Hello Rodrigo,
Before I got a chance to run this utility, I had to take all nodes in the cluster offline for an unrelated issue.  When the cluster came back online, I did some additional testing and HA is functioning properly now.  Since I originally deployed the cluster, I hadn't taken all nodes down at one time.   So perhaps there is a cluster-wide configuration which cannot be updated while nodes are online.

This is an unfortunate situation as I wasn't able to determine actual root cause.  The only change I made before halting the nodes was to modify the RF on the "GeoStorageOS" keyspace to 5, where it was previously not defined.  I don't know if this had any impact, but it's worth mentioning.

Please let me know if you have any thoughts.

Thanks,
Jim DeWaard

Jim DeWaard

unread,
Apr 14, 2016, 4:23:17 PM4/14/16
to coprHD
I finally had a chance to deploy a second 3 node cluster and found the solution to this problem.  When deploying a 3 or 5 node cluster from the CoprHD OVA, it's not enough to just modify the /etc/ovfenv.properties file to configure CoprHD for high availability.  You also need to alter the GeoStorageOS keyspace.

Here's how you do it:
1.  Login to one of the CoprHD nodes via SSH as 'svcuser'.

2.  Connect to the GeoStorageOS instance via cqlsh.

sudo /opt/storageos/bin/cqlsh localhost 9260

3. In the cqlsh shell, output the schema for the GeoStorageOS keyspace.  This will output a ton of stuff, but focus on the first 5 lines or so.  Notice the 'vdc1' property is set to 1.  The 'vdc1' property defines a datacenter in Cassandra and the value defines the number of replicas which should exist in that specific datacenter.  Looking at the output, we have just 1 replica in the 'vdc1' datacenter which does will allow for a highly available configuration. 

cqlsh> DESCRIBE KEYSPACE GeoStorageOS 

CREATE KEYSPACE "GeoStorageOS" WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'vdc1': '1'
};

4.  So we need to alter the 'GeoStorageOS' keyspace and set the 'vdc1' property to 3 or 5 (depending on your desired cluster size).

cqlsh> ALTER KEYSPACE "GeoStorageOS" WITH REPLICATION = {
  'class' : 'NetworkTopologyStrategy',
  'vdc1' : 3
};

5.  Replicas are automatically created on the remaining nodes after a few minutes.  So no additional action is required.

I see this as a bug which will need to be fixed, but I wanted to get the solution documented for anyone else testing with HA configurations.  Please let me know if you have any feedback.

Thanks,
Jim D

Jim DeWaard

unread,
Apr 14, 2016, 9:12:56 PM4/14/16
to coprHD
Just a quick correction.

"Looking at the output, we have just 1 replica in the 'vdc1' datacenter which does NOT allow for a highly available configuration."

John Mark Walker

unread,
Apr 14, 2016, 10:44:17 PM4/14/16
to Jim DeWaard, coprHD

Thanks for looking into this - this sounds like a good topic for a blog post. Let me know if you'd like access to the coprhd blog.

--
You received this message because you are subscribed to the Google Groups "coprHD" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coprhd+un...@googlegroups.com.
To post to this group, send email to cop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/coprhd/d51ed993-8998-417e-812f-11513c64d551%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Salman Riaz

unread,
Jul 28, 2016, 10:32:15 AM7/28/16
to coprHD, jim.d...@gmail.com
I'm facing the issue in 3-node cluster configuration. Can you please guide me in this regard.

I am also facing this error. I have tried to re-install these machines but not succeed. diagtool shows me following output. I have created three VMs (OpenSuse 13.2) on Citrix XenServer 6.5. Can you please guide me in this regard.

coprhd1:~ # /etc/diagtool -v
* Network interface: [OK]
         network_ipaddr=10.10.50.3
         network_netmask=255.255.255.248
         network_ipaddr6=
         network_rx_packets=102080
         network_tx_packets=53627
         number_of_errors=0
         network_status=RUNNING
* IP uniqueness: [OK]
         network_vip=10.10.50.2,[OK]
         network_1_ipaddr=10.10.50.3,[OK]
         network_2_ipaddr=10.10.50.4,[OK]
         network_3_ipaddr=10.10.50.5,[OK]
* Network routing: [OK]
         network_gw6=UNCONFIGURED
         network_gw=10.10.50.1,REACHABLE
* Nodes connectivity: [REACHABLE]
         vipr1=10.10.50.3,REACHABLE
         vipr2=10.10.50.4,REACHABLE
         vipr3=10.10.50.5,REACHABLE
* Network VIP: [IPV4_ONLY, REACHABLE]
         ipv4_vip=10.10.50.2
         ipv4_vip_status=REACHABLE
* VDC Status: [REACHABLE]
* Peer synchronization: [No peer exists]
* IP subnets: [SAME]
         network_1_ipaddr=10.10.50.3, subnet is 10.10.50.0
         network_2_ipaddr=10.10.50.4, subnet is 10.10.50.0
         network_3_ipaddr=10.10.50.5, subnet is 10.10.50.0
         gateway=10.10.50.1, subnet is 10.10.50.0
* Db connection: [UNREACHABLE]
* ZK connection: [OK]
* Firewall: [CONFIGURED, ACTIVE]
* DNS: [OK]
         network_nameserver=8.8.8.8 [OK]
         network_nameserver=8.8.4.4. [OK]
* NTP: [CONFIGURED, DEGRADED]
         network_ntpserver=0.opensuse.pool.ntp.org [OK]
         network_ntpserver=1.opensuse.pool.ntp.org [OK]
         network_ntpserver=3.opensuse.pool.ntp.org [OK]
         network_ntpserver=2.opensuse.pool.ntp.org [CONFIGURED, UNREACHABLE]
* EMC upgrade repository: [OK]
         system_update_repo=https://colu.emc.com/soap/rpc
* connectEMC: [OK]
         FTPS server=corpusfep3.emc.com, REACHABLE
         SMTP server=NOT SPECIFIED
/etc/diagtool: line 489: let: avail_percentage=100 - : syntax error: operand expected (error token is "- ")
/etc/diagtool: line 491: [: -ge: unary operator expected
/etc/diagtool: line 494: [: -le: unary operator expected
/etc/diagtool: line 897: 10*/100: syntax error: operand expected (error token is "/100")
* Memory usage: [OK]
         total=11781M
         used=3079M
         free=8702M
         shared=1M
         buffers=55M
         cached=851M
* CPU usage: [OK]
         cpu_util_percentage=2.4%


One thing more is that "diagtool -v" output shows all okay but when I run "diagtool -r" then I get the following outputs on each node:


coprhd1:~ # /etc/diagtool -r
* Resource allocation: [OK]
         Resources details for vipr1(localhost) is:
         memory size: 11781M
         disk size: 125110568M
         processor count: 4
         total cpu frequency: 9178.46MHz

         vipr2 is down


         vipr3 is down

coprhd2:~ # /etc/diagtool -r
* Resource allocation: [OK]
         Resources details for vipr2(localhost) is:
         memory size: 11781M
         disk size: 125110560M
         processor count: 4
         total cpu frequency: 9179.48MHz

         vipr1 is down


         vipr3 is down

coprhd3:~ # /etc/diagtool -r
* Resource allocation: [OK]
         Resources details for vipr3(localhost) is:
         memory size: 11781M
         disk size: 125110576M
         processor count: 4
         total cpu frequency: 9179.1MHz

         vipr1 is down


         vipr2 is down


I am facing this error in apisvc.log:

2016-07-28 19:07:05,099 [main]  WARN  HostSupplierImpl.java (line 73) hostsupplier is empty. May be dbsvc hasn't started yet. waiting for 10000 msec
2016-07-28 19:07:15,100 [main]  WARN  HostSupplierImpl.java (line 107) no dbsvc instance running. Coordinator exception message: The coordinator cannot locate any service with path /sites/59597760-5493-11e6-8828-cbd944af77a6/service/dbsvc/3.5
2016-07-28 19:07:15,101 [main]  WARN  HostSupplierImpl.java (line 73) hostsupplier is empty. May be dbsvc hasn't started yet. waiting for 10000 msec


And receiving this in dbsvc.log:
2016-07-28 19:01:46,906 [main]  INFO  SchemaUtil.java (line 315) try scan and setup db ...
2016-07-28 19:01:46,906 [main]  INFO  SchemaUtil.java (line 335) keyspace exist already
2016-07-28 19:01:46,923 [main]  INFO  SchemaUtil.java (line 554) Current strategyOptions={vdc1=1}
2016-07-28 19:01:46,926 [main]  INFO  DrUtil.java (line 492) Cassandra DC Name is vdc1
2016-07-28 19:01:46,926 [main]  INFO  DrUtil.java (line 492) Cassandra DC Name is vdc1
2016-07-28 19:01:46,926 [main]  INFO  DrUtil.java (line 492) Cassandra DC Name is vdc1
2016-07-28 19:01:46,929 [main]  INFO  SchemaUtil.java (line 348) Current db schema version 3.5
2016-07-28 19:01:46,950 [main]  INFO  SchemaUtil.java (line 351) scan and setup db schema succeed
2016-07-28 19:01:46,950 [main]  INFO  StartupMode.java (line 116) DB schema validated
2016-07-28 19:01:46,956 [main]  INFO  DbServiceStatusChecker.java (line 132) Waiting for all cluster nodes to become state: joined : Timed Out

And page is redirected to https://VIP/maintenance?targetUrl=%2F

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<error>
<code>6503</code>
<description>
Unable to connect to the service. The service is unavailable, try again later.
</description>
<details>
The service is currently unavailable because a connection failed to a core component. Please contact an administrator or try again later.
</details>
<retryable>true</retryable>
</error>



Kindly guide me in this regard.
Reply all
Reply to author
Forward
0 new messages