Hi,
A somewhat bloated mail due to included logs. Apologies.
I have a 3 node Cassandra cluster. When I restart one of them (sudo service cassandra restart), the connection to that node gets dropped from the clients (as is expected). I have specified ExponentialReconnectionPolicy that kicks in after the connections drop. However, reconnection attempts fail to a timeout. If I restart the whole process, connections are made successfully.
For the tests, I ran a single-node application with 1 uwsgi process and without threads for simplicity. I've configured the IPs of all the nodes into the address list given to Cluster.
When I run sudo service cassandra restart on one of the nodes, after couple of seconds I get:
2014-10-07 10:23:32,501 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'DOWN', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:23:32,501 - cassandra.cluster - WARNING - Host 10.0.10.17 has been marked down
2014-10-07 10:23:32,501 - cassandra.cluster - DEBUG - Removed connection pool for <Host: 10.0.10.17 DC1>
2014-10-07 10:23:32,502 - cassandra.cluster - DEBUG - Starting reconnector for host 10.0.10.17
2014-10-07 10:23:32,502 - cassandra.io.asyncorereactor - DEBUG - Closing connection (31103376) to 10.0.10.17
2014-10-07 10:23:32,502 - cassandra.io.asyncorereactor - DEBUG - Closed socket to 10.0.10.17
2014-10-07 10:23:32,503 - cassandra.io.asyncorereactor - DEBUG - Closing connection (31764752) to 10.0.10.17
2014-10-07 10:23:32,503 - cassandra.io.asyncorereactor - DEBUG - Closed socket to 10.0.10.17
Any idea why connection/socket is closed twice (or at least reported twice)?
After 10 seconds, it gives connection refused. As you would expect, but I noted that it doesn't actually log details about the reconnection attempt (only result).
2014-10-07 10:23:42,500 - cassandra.pool - WARNING - Error attempting to reconnect to 10.0.10.17, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('10.0.10.17', 9042)]. Last error: Connection refused
2014-10-07 10:23:42,501 - cassandra.pool - DEBUG - Reconnection error details
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/cassandra/pool.py", line 175, in run
conn = self.try_reconnect()
File "/usr/local/lib/python2.7/dist-packages/cassandra/pool.py", line 246, in try_reconnect
return self.connection_factory()
File "/usr/local/lib/python2.7/dist-packages/cassandra/io/asyncorereactor.py", line 162, in factory
conn = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/cassandra/io/asyncorereactor.py", line 195, in __init__
raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
error: [Errno 111] Tried connecting to [('10.0.10.17', 9042)]. Last error: Connection refused
After this, I get similar connection refused re-connection attempt (as you would expect) but timestamps are odd:
10:23:52 (scheduling retry in 8.0 seconds)
10:24:02 (scheduling retry in 16.0 seconds)
10:24:22 (scheduling retry in 30.0 seconds)
The timestamps don't quite follow the re-connection schedule (even if you add 10s timeouts), why is that?
When the service finally comes back up, I see:
2014-10-07 10:24:42,500 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'DOWN', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,500 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,500 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,501 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,501 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,501 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
2014-10-07 10:24:42,502 - cassandra.connection - DEBUG - Message pushed from server: <EventMessage(stream_id=-1, event_type=u'STATUS_CHANGE', event_args={'change_type': u'UP', 'address': ('10.0.10.17', 9042)}, trace_id=None)>
Any idea why it is announced multiple times?
Anyway, after 10 seconds (why 10 seconds?) of receiving that announcement, new connection attempt begins:
2014-10-07 10:24:52,500 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,500 - cassandra.cluster - DEBUG - Starting to handle up status of node 10.0.10.17
2014-10-07 10:24:52,500 - cassandra.cluster - INFO - Host 10.0.10.17 may be up; will prepare queries and open connection pool
2014-10-07 10:24:52,501 - cassandra.cluster - DEBUG - Now that host 10.0.10.17 is up, cancelling the reconnection handler
2014-10-07 10:24:52,501 - cassandra.cluster - DEBUG - Done preparing all queries for host 10.0.10.17,
2014-10-07 10:24:52,501 - cassandra.cluster - DEBUG - Signalling to load balancing policy that host 10.0.10.17 is up
2014-10-07 10:24:52,501 - cassandra.cluster - DEBUG - Signalling to control connection that host 10.0.10.17 is up
2014-10-07 10:24:52,501 - cassandra.cluster - DEBUG - Attempting to open new connection pools for host 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,500 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Another thread is already handling up status of node 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Another thread is already handling up status of node 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,502 - cassandra.cluster - DEBUG - Another thread is already handling up status of node 10.0.10.17
2014-10-07 10:24:52,503 - cassandra.cluster - DEBUG - Waiting to acquire lock for handling up status of node 10.0.10.17
2014-10-07 10:24:52,503 - cassandra.cluster - DEBUG - Another thread is already handling up status of node 10.0.10.17
Any idea why attempts to grab the lock in such a speedy fashion? Also, as I'm running only single process (without threads enabled on uwsgi), I don't know what else could be reserving the lock...
Anyway, connection actually starts immediately now (it did not report this before when it tried to re-connect through my ReconnectionPolicy):
2014-10-07 10:24:52,503 - cassandra.pool - DEBUG - Initializing new connection pool for host 10.0.10.17
2014-10-07 10:24:52,503 - cassandra.cluster - DEBUG - Another thread is already handling up status of node 10.0.10.17
2014-10-07 10:24:52,504 - cassandra.connection - DEBUG - Not sending options message for new connection(32560016) to 10.0.10.17 because compression is disabled and a cql version was not specified
2014-10-07 10:24:52,504 - cassandra.connection - DEBUG - Sending StartupMessage on <AsyncoreConnection(32560016) 10.0.10.17:9042> 2014-10-07 10:24:52,504 - cassandra.connection - DEBUG - Sent StartupMessage on <AsyncoreConnection(32560016) 10.0.10.17:9042> 2014-10-07 10:24:52,505 - cassandra.connection - DEBUG - Got AuthenticateMessage on new connection (32560016) from 10.0.10.17: org.apache.cassandra.auth.PasswordAuthenticator 2014-10-07 10:24:52,506 - cassandra.connection - DEBUG - Sending SASL-based auth response on <AsyncoreConnection(32560016) 10.0.10.17:9042>
But after 10 seconds, connection is closed to a timeout:
2014-10-07 10:25:02,500 - cassandra.io.asyncorereactor - DEBUG - Closing connection (32560016) to 10.0.10.17
2014-10-07 10:25:02,501 - cassandra.io.asyncorereactor - DEBUG - Closed socket to 10.0.10.17
2014-10-07 10:25:02,501 - cassandra.connection - DEBUG - Connection to 10.0.10.17 was closed during the authentication process
2014-10-07 10:25:02,501 - cassandra.cluster - WARNING - Failed to create connection pool for new host 10.0.10.17: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1493, in run_add_or_renew_pool
new_pool = HostConnectionPool(host, distance, self)
File "/usr/local/lib/python2.7/dist-packages/cassandra/pool.py", line 397, in __init__
for i in range(core_conns)]
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 632, in connection_factory
return self.connection_class.factory(address, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/cassandra/io/asyncorereactor.py", line 168, in factory
raise OperationTimedOut("Timed out creating connection")
OperationTimedOut: errors=Timed out creating connection, last_host=None
2014-10-07 10:25:02,502 - cassandra.cluster - WARNING - Host 10.0.10.17 has been marked down
2014-10-07 10:25:02,502 - cassandra.cluster - DEBUG - Starting reconnector for host 10.0.10.17
2014-10-07 10:25:02,502 - cassandra.cluster - DEBUG - Connection pool could not be created, not marking node 10.0.10.17 up
2014-10-07 10:25:02,503 - cassandra.cluster - DEBUG - Old host reconnector found for 10.0.10.17, cancelling
2014-10-07 10:25:02,503 - cassandra.cluster - DEBUG - Starting reconnector for host 10.0.10.17
So even though it was announced to be available, I got timeout.
10 seconds later (why 10 seconds, is it just hardcoded?) it starts to actually reconnect again:
2014-10-07 10:25:12,501 - cassandra.connection - DEBUG - Not sending options message for new connection(30358288) to 10.0.10.17 because compression is disabled and a cql version was not specified
2014-10-07 10:25:12,501 - cassandra.connection - DEBUG - Sending StartupMessage on <AsyncoreConnection(30358288) 10.0.10.17:9042> 2014-10-07 10:25:12,501 - cassandra.connection - DEBUG - Sent StartupMessage on <AsyncoreConnection(30358288) 10.0.10.17:9042> 2014-10-07 10:25:12,503 - cassandra.connection - DEBUG - Got AuthenticateMessage on new connection (30358288) from 10.0.10.17: org.apache.cassandra.auth.PasswordAuthenticator 2014-10-07 10:25:12,503 - cassandra.connection - DEBUG - Sending SASL-based auth response on <AsyncoreConnection(30358288) 10.0.10.17:9042>
And after 10 seconds, connection attempt is again closed due to timeout ("Timed out creating connection, last_host=None").
This loops until I get:
2014-10-07 11:06:32,499 - cassandra.pool - WARNING - Will not continue to retry reconnection attempts due to an exhausted retry schedule
As said, if I restart the process (at any point after the node is back up), connections are made successfully without timeouts. All in all, connection works just ok (ensured also with cqlsh and even telnet).
Any pointers?
Tuukka