Node in Rehab, DNS Lookup Bug?

94 views
Skip to first unread message

Johannes Rudolph

unread,
Jun 23, 2019, 5:13:06 AM6/23/19
to RavenDB - 2nd generation document database
We moved our databases to a new VM using dump export/import. The new VM has a new installation of RavenDB, new IP Address but got assigned the same DNS name. Now it appears our single-node cluster is in some wonky state.

The Cluster Observer Log has these entries:

2019-06-18T07:47:49.7093395Z,meshcloud-dev.kraken-api,3008,Node A is currently not responding (with status: None) and moved to rehab
2019-06-18T07:47:35.6646570Z,meshcloud-dev.kraken-worker,2980,Node A is currently not responding (with status: None) and moved to rehab

And the Studio shows this for all our databases on that node: 

Screenshot 2019-06-23 at 11.02.51.png

However, our application appears to process reads/writes just fine against the DBs. 

I can visit https://rvn-0.dev.meshcloud.io/info/tcp?tag=Supervisor just fine using my browser + client cert. So it appears that the DBs are online and service read/write requests from clients. Subscriptions however claim to have no supervisor node and appear not to be running. 

The admin log lists this error for the node. Could it be possible that there's some sort of a DNS Lookup bug? I'm asking this because it appears that RavenDB is still trying to contact itself using a cached, stale DNS record pointing to the old IP Address, which would explain this sort of error message. The DNS record is an A-Record and should have had a max TTL of 5min, the change of record is now > 48h ago so should have long percolated through all caches.

```text
2019-06-23T08:55:54.7512197Z, 22, Information, Heartbeats supervisor from A to A in term 4, Raven.Server.ServerWide.Maintenance.ClusterMaintenanceSupervisor+ClusterNode, Exception was thrown while collecting info from A, EXCEPTION: Raven.Client.Exceptions.RavenException: An exception occurred while contacting https://rvn-0.dev.meshcloud.io/info/tcp?tag=Supervisor.
System.Net.Http.HttpRequestException: Connection refused ---> System.Net.Sockets.SocketException: Connection refused
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Client\Http\RequestExecutor.cs:line 763.
The server at https://rvn-0.dev.meshcloud.io/info/tcp?tag=Supervisor responded with status code: ServiceUnavailable. ---> System.Net.Http.HttpRequestException: Connection refused ---> System.Net.Sockets.SocketException: Connection refused
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Client\Http\RequestExecutor.cs:line 763
   --- End of inner exception stack trace ---
   at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Client\Http\RequestExecutor.cs:line 875
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Client\Http\RequestExecutor.cs:line 784
   at Raven.Server.Utils.ReplicationUtils.GetTcpInfoAsync(String url, String databaseName, String tag, X509Certificate2 certificate, CancellationToken token) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Server\Utils\ReplicationUtils.cs:line 29
   at Raven.Client.Util.AsyncHelpers.RunSync[T](Func`1 task)
   at Raven.Server.Utils.ReplicationUtils.GetTcpInfo(String url, String databaseName, String tag, X509Certificate2 certificate, CancellationToken token) in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Server\Utils\ReplicationUtils.cs:line 18
   at Raven.Server.ServerWide.Maintenance.ClusterMaintenanceSupervisor.ClusterNode.ListenToMaintenanceWorker() in C:\Builds\RavenDB-Stable-4.2\42009\src\Raven.Server\ServerWide\Maintenance\ClusterMaintenanceSupervisor.cs:line 171
```

Johannes Rudolph

unread,
Jun 23, 2019, 8:29:20 AM6/23/19
to RavenDB - 2nd generation document database

I think I found out what’s going on. Might save others going through the same trouble by posting it here. I also have some recommendations
for RavenDB Team to improve operator experience & documentation at the end.

root@rvn-0:/opt/RavenDB/Server# cat /opt/RavenDB/config/settings.json 
{
    "ServerUrl": "https://0.0.0.0:8443",
    "ServerUrl.Tcp": "tcp://0.0.0.0:38888",
    "PublicServerUrl": "https://rvn-0.dev.meshcloud.io",
    "PublicServerUrl.Tcp": "tcp://rvn-0.dev.meshcloud.io:38888",

inside the docker container the world looks like this docker exec -it ravendb bash

# docker internal IP
root@rvn-0:/opt/RavenDB/Server# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 172.17.255.255

# nslookup, has split horizon DNS pointing to private IP (not the elastic IP)
root@rvn-0:/opt/RavenDB/Server# nslookup rvn-0.dev.meshcloud.io
Server:        10.0.0.2
Address:    10.0.0.2#53

Non-authoritative answer:
rvn-0.dev.meshcloud.io    canonical name = ec2-35-158-134-4.eu-central-1.compute.amazonaws.com.
Name:    ec2-35-158-134-4.eu-central-1.compute.amazonaws.com
Address: 10.0.11.236

# curl stays host-local 
root@rvn-0:/opt/RavenDB/Server# curl -kv https://rvn-0.dev.meshcloud.io/info/tcp?tag=Supervisor
*   Trying 172.17.0.2...
* TCP_NODELAY set
* connect to 172.17.0.2 port 443 failed: Connection refused

# because... hosts file
root@rvn-0:/opt/RavenDB/Server# cat /etc/hosts 
127.0.0.1    localhost
172.17.0.2    rvn-0.dev.meshcloud.io rvn-0

The hosts file entry is there due to: https://docs.docker.com/v17.09/engine/userguide/networking/default_network/configure-dns/
Turns out our ansible deployment sets this flag by default.

Here’s a few things that I believe would make this better:

  • if the exception raised by the cluster observer contained the resolved IP as well as the hostname, this could have
    sped up RCA/troubleshooting. Also a hint like “This may be a configuration issue. Verify network connectivity and wait for the cluster observer to retry (current observer interval: 60s)”.
  • Maybe add a screen to the Studio that makes GET /admin/cluster/maintenance-stats accessible, I’ve only found out about this after fixing our issue
  • I couldn’t find a manual way to “trigger” the cluster observer to check the node, I tried rvn admin-channel timer fire
    but that didn’t create any new log messages in the cluster observer log. This might be a documentation issue or I was not clever enough.
  • The hostname thing in docker appears to be a gotcha in this setup, which might be worth pointing out in the docs

Egor Shamanaev

unread,
Jul 2, 2019, 10:58:18 AM7/2/19
to rav...@googlegroups.com
Hi,
thanks for your feedback. We will consider your suggestions with our team.

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To post to this group, send email to rav...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/abe095de-0585-4396-87a8-31a5dd0b8062%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages