HectorTransportException thrown when retry expected

375 views

Skip to first unread message

Lorrin

unread,

Oct 12, 2011, 4:08:22 PM10/12/11

to hector-users

I'm a little confused about what the expected failover behavior is.

I have a 3 node cluster. I have not set a failover policy, so
ON_FAIL_TRY_ALL_AVAILABLE should be in effect. I have reduced
cassandraThriftSocketTimeout to 3000 ms. I would imagine that if I
drop a node out of the cluster, then client requests may be delayed by
up to 3000 ms while Hector times out on the unavailable node and then
tries a different node. Concurrent requests might all see this delay,
but subsequent requests should just use the good nodes immediately.

Instead I always see an exception bubbling up to the application at
least once and sometimes they occur sporadically for longer. Here's an
example where the exception bubbled out but did not recur. I haven't
yet been able to get a case of sporadic recurrences without having
multiple readers and writers running at the same time, which makes it
more difficult to get a clear picture in the log files. Will post more
once I've got one.

logged:
[WARN ] HConnectionManager : Could not fullfill request on this
host CassandraClient<10.0.20.1:9160-43> -- 10:41:49,315
[399470108@qtp-2009579234-53]
[WARN ] HConnectionManager : Exception: -- 10:41:49,315
[399470108@qtp-2009579234-53]
me.prettyprint.hector.api.exceptions.HTimedOutException:
org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out
at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:
35)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:97)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:90)
at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:
101)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:
232)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:
131)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
102)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
108)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:222)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:219)
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:
20)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:
85)
at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:
219)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:
127)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:
162)
at
me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:
85)
<my app>
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:
129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:
129)
at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:
101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:
378)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:
297)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:
204)
at org.apache.cassandra.thrift.Cassandra
$Client.recv_batch_mutate(Cassandra.java:1025)
at org.apache.cassandra.thrift.Cassandra
$Client.batch_mutate(Cassandra.java:1009)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:95)
... 54 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:
127)
... 64 more

But then the following is thrown up to my app:
me.prettyprint.hector.api.exceptions.HectorTransportException: Unable
to open transport to 10.0.20.1(10.0.20.1):9160 ,
java.net.SocketTimeoutException: connect timed out
at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
129)
at
me.prettyprint.cassandra.connection.ConcurrentHClientPool.releaseClient(ConcurrentHClientPool.java:
221)
at
me.prettyprint.cassandra.connection.HConnectionManager.releaseClient(HConnectionManager.java:
347)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:
289)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:
131)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
102)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
108)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:222)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:219)
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:
20)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:
85)
at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:
219)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:
127)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:
162)
at
me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:
85)
<my app>
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: connect timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
at
org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:
81)
at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
123)
... 54 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:
213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
at java.net.Socket.connect(Socket.java:529)
at org.apache.thrift.transport.TSocket.open(TSocket.java:178)
... 56 more

Then it logs:
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-45> -- 10:41:56,393
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
at
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:
147)
at
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:
156)
at
me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:
85)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:
245)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:
131)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
102)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
108)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:222)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:219)
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:
20)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:
85)
at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:
219)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:
127)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:
162)
at
me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:
85)
<my app>
(AbstractResourceMethodDispatchProvider.java:205)
at
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:
75)
at
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:
288)
at
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:
147)
at
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:
108)
at
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:
147)
at
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:
84)
at
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:
1469)
at
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:
1400)
at
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:
1349)
at
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:
1339)
at
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:
416)
at
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:
537)
at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:
895)
at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:
843)
at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:
804)
at org.mortbay.jetty.servlet.ServletHandler
$CachedChain.doFilter(ServletHandler.java:1148)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:
387)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:
216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:
181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:
765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:
417)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:
230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:
114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:
152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:
535)
at org.mortbay.jetty.HttpConnection
$RequestHandler.content(HttpConnection.java:880)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
409)
at org.mortbay.thread.QueuedThreadPool
$PoolThread.run(QueuedThreadPool.java:520)
Caused by: java.net.SocketException: Host is down
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:
92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:
145)
... 54 more

and detects the host down:
[ERROR] HConnectionManager : MARK HOST AS DOWN TRIGGERED for host
10.0.20.1(10.0.20.1):9160 -- 10:41:56,395
[399470108@qtp-2009579234-53]
[ERROR] HConnectionManager : Pool state on shutdown:
<ConcurrentCassandraClientPoolByHost>:{10.0.20.1(10.0.20.1):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 14; NumBeforeExhausted:
49 -- 10:41:56,395 [399470108@qtp-2009579234-53]
[INFO ] ConcurrentHClientPool : Shutdown triggered on
<ConcurrentCassandraClientPoolByHost>:{10.0.20.1(10.0.20.1):9160} --
10:41:56,395 [399470108@qtp-2009579234-53]

After that it logs a large number of could-not-flush-transport errors.
(Is all this normal??) I've snipped the stack traces from these.

[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-55> -- 10:41:56,396
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-52> -- 10:41:56,398
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-42> -- 10:41:56,399
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-47> -- 10:41:56,400
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-48> -- 10:41:56,401
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-46> -- 10:41:56,402
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-51> -- 10:41:56,404
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-53> -- 10:41:56,405
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-41> -- 10:41:56,406
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-61> -- 10:41:56,407
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-44> -- 10:41:56,409
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-49> -- 10:41:56,410
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-54> -- 10:41:56,411
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
[ERROR] HThriftClient : Could not flush transport (to be
expected if the pool is shutting down) in close for client:
CassandraClient<10.0.20.1:9160-40> -- 10:41:56,412
[399470108@qtp-2009579234-53]
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down

..and, finally:
[INFO ] ConcurrentHClientPool : Shutdown complete on
<ConcurrentCassandraClientPoolByHost>:{10.0.20.1(10.0.20.1):9160} --
10:41:56,413 [399470108@qtp-2009579234-53]
[INFO ] CassandraHostRetryService: Host detected as down was added to
retry queue: 10.0.20.1(10.0.20.1):9160 -- 10:41:56,413
[399470108@qtp-2009579234-53]

After that, it does what I would have expected in the first place.
Here it's logging a failure but then no exception reaches the
application:

[WARN ] HConnectionManager : Could not fullfill request on this
host CassandraClient<10.0.20.1:9160-45> -- 10:41:56,413
[399470108@qtp-2009579234-53]
[WARN ] HConnectionManager : Exception: -- 10:41:56,413
[399470108@qtp-2009579234-53]
me.prettyprint.hector.api.exceptions.HectorTransportException:
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:
37)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:97)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:90)
at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:
101)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:
232)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:
131)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
102)
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:
108)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:222)
at me.prettyprint.cassandra.model.MutatorImpl
$3.doInKeyspace(MutatorImpl.java:219)
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:
20)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:
85)
at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:
219)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:
127)
at
me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:
162)
at
me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:
85)
<my app>
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketException: Host is down
at
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:
147)
at
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:
156)
at org.apache.cassandra.thrift.Cassandra
$Client.send_batch_mutate(Cassandra.java:1020)
at org.apache.cassandra.thrift.Cassandra
$Client.batch_mutate(Cassandra.java:1008)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl
$1.execute(KeyspaceServiceImpl.java:95)
... 54 more
Caused by: java.net.SocketException: Host is down
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:
92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:
145)
... 58 more
[WARN ] CassandraHostRetryService: Downed 10.0.20.1(10.0.20.1):9160
host still appears to be down: Unable to open transport to
10.0.20.1(10.0.20.1):9160 , java.net.ConnectException: Host is down --
10:41:56,413 [pool-1-thread-1]
[INFO ] HConnectionManager : Client
CassandraClient<10.0.20.1:9160-45> released to inactive or dead pool.
Closing. -- 10:41:56,415 [399470108@qtp-2009579234-53]

On a somewhat related note, has there been any discussion of toning
down the Hector logging? NodeAutoDiscoverService INFO is nice for
changes in the known set of hosts but excessive for the frequently
repeating "using existing hosts" message. The 14 HThriftClient Could
not flush transport ERRORs with stack trace are a bit much. When the
CassandraHostRetryService checks a host it not only logs its own WARN
(fine) but causes HConnectionManager to log the following ERROR with
stacktrace, which seems excessive.

[ERROR] HConnectionManager : Transport exception host to
HConnectionManager: 10.0.20.1(10.0.20.1):9160 -- 10:45:40,512 [pool-3-
thread-1]
me.prettyprint.hector.api.exceptions.HectorTransportException: Unable
to open transport to 10.0.20.1(10.0.20.1):9160 ,
java.net.SocketTimeoutException: connect timed out
at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
129)
at
me.prettyprint.cassandra.connection.ConcurrentHClientPool.<init>(ConcurrentHClientPool.java:
43)
at
me.prettyprint.cassandra.connection.RoundRobinBalancingPolicy.createConnection(RoundRobinBalancingPolicy.java:
68)
at
me.prettyprint.cassandra.connection.HConnectionManager.addCassandraHost(HConnectionManager.java:
103)
at
me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:
68)
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService
$QueryRing.run(NodeAutoDiscoverService.java:53)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
441)
at java.util.concurrent.FutureTask
$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: connect timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
at
org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:
81)
at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
123)
... 14 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:
213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
at java.net.Socket.connect(Socket.java:529)
at org.apache.thrift.transport.TSocket.open(TSocket.java:178)
... 16 more

Cheers!
-Lorrin

Lorrin

unread,

Oct 12, 2011, 4:11:23 PM10/12/11

to hector-users

p.s. Hector 0.8.0-2, Cassandra 0.8.6

Nate McCall

unread,

Oct 12, 2011, 4:37:54 PM10/12/11

to hector...@googlegroups.com

Some of this has been toned down a bit in 0.8.0 tip and master. If you
could grab the latest source and build, and give this a try that would
be great.

In general, we went for verbosity as it is difficult to distinguish at
which layer (thrift, coordinator node, raw socket) an error came from.
This has been cleaned up a good deal on the cassandra side, so it
could do with a specific going over to tone down for sure.

If you want to open on Github issue with specifics even that would be
all the more helpful.

I'm open to other folks suggestions here as well - it's a known wart,
basically.

Patricio Echagüe

unread,

Oct 12, 2011, 4:46:15 PM10/12/11

to hector...@googlegroups.com

I fixed that issue. It seems like it is the exception during the release client. Also, I see a SocketTimeout Exception. Probably unrelated.

Lorrin

unread,

Oct 12, 2011, 6:43:11 PM10/12/11

to hector-users

Ah, great, that patch looks relevant. Glad to hear there have been
some logging tweaks. Is there a maven repo with nightlies anywhere?
-Lorrin

> ...
>
> read more »

Patricio Echagüe

unread,

Oct 12, 2011, 6:57:57 PM10/12/11

to hector...@googlegroups.com

yes.

http://rantav.github.com/hector/build/html/index.html -> (Cloudbees maven repo with nightly snapshots)

Lorrin

unread,

Oct 12, 2011, 8:30:35 PM10/12/11

to hector-users

Sweet. I think things are indeed a little better with the 0.8.0-3-
SNAPSHOT!

In the log, it seems like NodeAutoDiscoveryService and
CassandraHostRetryService are fighting each other. I filed
https://github.com/rantav/hector/issues/301

I do still see sporadic recurring exceptions surfaced to my app. My
app is periodically given clients to a host Hector should already know
is down. Eventually I see the following and then everything calms
down:

HConnectionManager 2011-10-13 00:11:49,344 -- ERROR -- MARK HOST AS
DOWN TRIGGERED for host 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:49,344 -- ERROR -- Pool state on
shutdown: <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}; IsActive?: true; Active: 1;
Blocked: 0; Idle: 0; NumBeforeExhausted: 49
ConcurrentHClientPool 2011-10-13 00:11:49,344 -- INFO -- Shutdown
triggered on <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}
ConcurrentHClientPool 2011-10-13 00:11:49,344 -- INFO -- Shutdown
complete on <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}

I'd rather get that within a couple seconds than a couple minutes. Is
there a knob I can twiddle in CassandraHostConfigurator?

Briefly looking at the code, here's a hypothesis:
* HostTimeoutTracker not relevant because in my simulation the node
isn't timing out.
* The right thing happens when HConnectionManager gets a
HectorTransportException. (Wave hands re: why this eventually happens)
* For most of the time I receive java.net.NoRouteToHostException,
which ExceptionsTranslatorImpl does not (but should?) map to
HectorTransportException.

What do you think? Here's the log with the stacktraces stripped. I've
also stripped some of the NodeAutoDiscover lines for brevity. I'm
confused by the contradictory suspend/unsuspend messages.

HConnectionManager 2011-10-13 00:09:07,442 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.2.0:9160-247>
HConnectionManager 2011-10-13 00:09:09,223 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:09:09,223 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:09,224 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-347>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
HConnectionManager 2011-10-13 00:09:10,447 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.2.0:9160-30>
HConnectionManager 2011-10-13 00:09:14,491 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.253:9160-11>
NodeAutoDiscoverService 2011-10-13 00:09:17,594 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:09:17,594 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:27,779 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:09:30,837 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:09:30,837 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:30,838 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-348>
HConnectionManager 2011-10-13 00:09:31,817 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:31,817 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:31,818 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-349>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
NodeAutoDiscoverService 2011-10-13 00:09:47,597 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:09:47,597 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:47,780 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:09:51,165 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:09:51,165 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:51,166 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-350>
HConnectionManager 2011-10-13 00:09:52,000 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:52,000 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:09:52,001 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-351>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
HConnectionManager 2011-10-13 00:10:07,780 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:10,935 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:10,935 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:10,936 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-336>
HConnectionManager 2011-10-13 00:10:12,184 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:12,184 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:12,185 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-337>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
NodeAutoDiscoverService 2011-10-13 00:10:17,600 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:10:17,600 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
Gossiper 2011-10-13 00:10:21,135 -- INFO -- InetAddress /10.183.1.254
is now dead.
HConnectionManager 2011-10-13 00:10:27,781 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:30,895 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:30,895 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:30,896 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-338>
HConnectionManager 2011-10-13 00:10:32,372 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:32,372 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:32,373 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-339>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
NodeAutoDiscoverService 2011-10-13 00:10:47,604 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:10:47,604 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:47,781 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:50,893 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:10:50,893 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:50,894 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-340>
HConnectionManager 2011-10-13 00:10:51,809 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:51,809 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:10:51,810 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-341>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
HConnectionManager 2011-10-13 00:11:07,782 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:11:10,846 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:11:10,846 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:10,847 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-342>
HConnectionManager 2011-10-13 00:11:11,972 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:11,973 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:11,974 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-343>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
NodeAutoDiscoverService 2011-10-13 00:11:17,607 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:11:17,607 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:27,782 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:11:31,314 -- INFO -- Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:11:31,315 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:31,316 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-344>
--->Exception thrown to app: Unable to open transport to
10.183.1.254(10.183.1.254):9160 , java.net.NoRouteToHostException:
Network is unreachable
HConnectionManager 2011-10-13 00:11:31,735 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:31,735 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:31,736 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-345>
NodeAutoDiscoverService 2011-10-13 00:11:47,611 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:11:47,611 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:47,782 -- INFO -- UN-Suspend
operation status was true for CassandraHost 10.183.1.254(10.183.1.254):
9160
HConnectionManager 2011-10-13 00:11:49,344 -- ERROR -- MARK HOST AS
DOWN TRIGGERED for host 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:49,344 -- ERROR -- Pool state on
shutdown: <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}; IsActive?: true; Active: 1;
Blocked: 0; Idle: 0; NumBeforeExhausted: 49
ConcurrentHClientPool 2011-10-13 00:11:49,344 -- INFO -- Shutdown
triggered on <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}
ConcurrentHClientPool 2011-10-13 00:11:49,344 -- INFO -- Shutdown
complete on <ConcurrentCassandraClientPoolByHost>:
{10.183.1.254(10.183.1.254):9160}
CassandraHostRetryService 2011-10-13 00:11:49,344 -- INFO -- Host
detected as down was added to retry queue: 10.183.1.254(10.183.1.254):
9160
CassandraHostRetryService 2011-10-13 00:11:49,345 -- WARN -- Downed
10.183.1.254(10.183.1.254):9160 host still appears to be down: Unable
to open transport to 10.183.1.254(10.183.1.254):9160 ,
java.net.NoRouteToHostException: Network is unreachable
HConnectionManager 2011-10-13 00:11:49,345 -- WARN -- Could not
fullfill request on this host null
HConnectionManager 2011-10-13 00:11:50,800 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:50,800 -- INFO -- Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:11:50,801 -- WARN -- Could not
fullfill request on this host CassandraClient<10.183.1.254:9160-346>
HConnectionManager 2011-10-13 00:11:50,802 -- INFO -- Client
CassandraClient<10.183.1.254:9160-346> released to inactive or dead
pool. Closing.
CassandraHostRetryService 2011-10-13 00:11:50,841 -- WARN -- Downed
10.183.1.254(10.183.1.254):9160 host still appears to be down: Unable
to open transport to 10.183.1.254(10.183.1.254):9160 ,
java.net.NoRouteToHostException: Network is unreachable
CassandraHostRetryService 2011-10-13 00:11:50,841 -- INFO -- Downed
Host retry status false with host: 10.183.1.254(10.183.1.254):9160
CassandraHostRetryService 2011-10-13 00:12:00,842 -- WARN -- Downed
10.183.1.254(10.183.1.254):9160 host still appears to be down: Unable
to open transport to 10.183.1.254(10.183.1.254):9160 ,
java.net.NoRouteToHostException: Network is unreachable
CassandraHostRetryService 2011-10-13 00:12:00,842 -- INFO -- Downed
Host retry status false with host: 10.183.1.254(10.183.1.254):9160
HConnectionManager 2011-10-13 00:12:07,783 -- INFO -- UN-Suspend
operation status was false for CassandraHost
10.183.1.254(10.183.1.254):9160
CassandraHostRetryService 2011-10-13 00:12:10,842 -- WARN -- Downed
10.183.1.254(10.183.1.254):9160 host still appears to be down: Unable
to open transport to 10.183.1.254(10.183.1.254):9160 ,
java.net.NoRouteToHostException: Network is unreachable
CassandraHostRetryService 2011-10-13 00:12:10,842 -- INFO -- Downed
Host retry status false with host: 10.183.1.254(10.183.1.254):9160
NodeAutoDiscoverService 2011-10-13 00:12:17,614 -- INFO -- Addding
found host 10.183.1.254(10.183.1.254):9160 to pool
HConnectionManager 2011-10-13 00:12:17,614 -- ERROR -- Transport
exception host to HConnectionManager: 10.183.1.254(10.183.1.254):9160
CassandraHostRetryService 2011-10-13 00:12:20,843 -- WARN -- Downed
10.183.1.254(10.183.1.254):9160 host still appears to be down: Unable
to open transport to 10.183.1.254(10.183.1.254):9160 ,
java.net.NoRouteToHostException: Network is unreachable
CassandraHostRetryService 2011-10-13 00:12:20,843 -- INFO -- Downed
Host retry status false with host: 10.183.1.254(10.183.1.254):9160

On Oct 12, 3:57 pm, Patricio Echagüe <patric...@gmail.com> wrote:
> yes.
>
> http://rantav.github.com/hector/build/html/index.html-> (Cloudbees maven
> repo with nightly
> snapshots<https://repository-hector-dev.forge.cloudbees.com/snapshot>
> )

> ...
>
> read more »

Patricio Echagüe

unread,

Oct 12, 2011, 9:35:51 PM10/12/11

to hector...@googlegroups.com

HTransportException is for TTransportException.

NoRouterToHostEx according to Java API:

Signals that an error occurred while attempting to connect a socket to a remote address and port. Typically, the remote host cannot be reached because of an intervening firewall, or if an intermediate router is down

Can you check that your servers are reachable ? or it happens when you shut down that node?

Nate McCall

unread,

Oct 12, 2011, 9:49:28 PM10/12/11

to hector...@googlegroups.com

Thrift should be wrapping up all the IOException derivatives into
appropriately typed TException impls. It should not be bubbling out
that far.

Can you tell me more about the network topology setup and conditions
that generates this message? From GH #301, it looks like you are
shutting the thrift interface (network in this case) and leaving
gossip up?

2011/10/12 Patricio Echagüe <patr...@gmail.com>:

l...@lorrin.org

unread,

Oct 13, 2011, 12:34:12 AM10/13/11

to hector...@googlegroups.com

Hi --

I'm trying to simulate a node failure. I'm open to suggestions of how to
best do this. (Of course, ideally Cassandra and Hector would handle all
all varieties of network weirdness!) My topology is: 3 Linux nodes in a
data center. Each node has a public and a private IP. Machines should
only use the private IPs for Cassandra. Initially everyone can talk to
everyone and things work smoothly. Then I simulate a node outage with:

on each good node:
ip route add blackhole <"bad" nodet>

on "bad" node, for each good node:
ip route add blackhole <good node>

I would expect that to take out Thrift and Gossip. Hmm.

-Lorrin

On 10/12/2011 06:49 PM, Nate McCall wrote:
> Thrift should be wrapping up all the IOException derivatives into
> appropriately typed TException impls. It should not be bubbling out
> that far.
>
> Can you tell me more about the network topology setup and conditions
> that generates this message? From GH #301, it looks like you are
> shutting the thrift interface (network in this case) and leaving
> gossip up?
>

> 2011/10/12 Patricio Echag�e<patr...@gmail.com>:

>>>> read more �
>>

Patricio Echagüe

unread,

Oct 13, 2011, 12:37:57 AM10/13/11

to hector...@googlegroups.com

It should.

Are the 3 nodes in the same machine?

Can you just bring one down?

Sent from my Android

On Oct 12, 2011 9:34 PM, <l...@lorrin.org> wrote:

Hi --

I'm trying to simulate a node failure. I'm open to suggestions of how to best do this. (Of course, ideally Cassandra and Hector would handle all all varieties of network weirdness!) My topology is: 3 Linux nodes in a data center. Each node has a public and a private IP. Machines should only use the private IPs for Cassandra. Initially everyone can talk to everyone and things work smoothly. Then I simulate a node outage with:

on each good node:
ip route add blackhole <"bad" nodet>

on "bad" node, for each good node:
ip route add blackhole <good node>

I would expect that to take out Thrift and Gossip. Hmm.

-Lorrin

On 10/12/2011 06:49 PM, Nate McCall wrote:

Thrift should be wrapping up all the IOException derivatives into
appropriately typed TException impls. It should not be bubbling out
that far.

Can you tell me more about the network topology setup and conditions
that generates this message? From GH #301, it looks like you are
shutting the thrift interface (network in this case) and leaving
gossip up?

2011/10/12 Patricio Echagüe<patr...@gmail.com>:

On Oct 12, 3:57 pm, Patricio Echagüe<patric...@gmail.com> wrote:

yes.

http://rantav.github.com/hector/build/html/index.html-> (Cloudbees maven
repo with nightly
snapshots<https://repository-hector-dev.forge.cloudbees.com/snapshot>
)

On Wed, Oct 12, 2011 at 3:43 PM, Lorrin<l...@lorrin.org> wrote:

Ah, great, that patch looks relevant. Glad to hear there have been
some logging tweaks. Is there a maven repo with nightlies anywhere?
-Lorrin

On Oct 12, 1:46 pm, Patricio Echagüe<patric...@gmail.com> wrote:

...

read more »

Reply all

Reply to author

Forward

0 new messages