Not sure exactly why - it seems that the seed node assigns itself a hostid and some tokens, and then when a new node attempts to join the cluster the same hostid is assigned, all the tokens change ownership, and then a java.io.FileNotFoundException is thrown when (I assume the seed node) the CassandraDaemon attempts to read a system-local-ka-*-Data.db file. If I use ./nodetool status to inspect the ring on the seed node, the daemon went down. Then, when I do the same on the newer node, it has the host ID that both of them attempted to hold.
ElasticSearch appeared to have the same issue in one particular configuration setup, but I believe I decided that was an error in how I had set up the configuration - otherwise, it seems to form a cluster just fine using the same hostnames. The gremlin-server log and cassandra log seems to capture the essence of the issue:
gremlin-server.log ---
1780 [main] INFO com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration - Set default timestamp provider MICRO
1785 [main] INFO com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager - Closed Thrift connection pooler.
1790 [main] INFO com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration - Generated unique-instance-id=ac1e30208123-uhn-compute321
1806 [main] INFO com.thinkaurelius.titan.diskstorage.Backend - Configuring index [search]
1907 [main] INFO org.elasticsearch.plugins - [Magneto] loaded [], sites []
.2463 [main] INFO com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex - Configured remote host: uhn-compute31 : 9300
2575 [main] INFO com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex - Configured remote host: uhn-compute32 : 9300
3619 [main] INFO com.thinkaurelius.titan.diskstorage.Backend - Initiated backend operations thread pool of size 24
3639 [main] WARN org.apache.tinkerpop.gremlin.server.GremlinServer - Graph [graph] configured at [conf/gremlin-server/titan-cassandra-es-server.properties] could not be instantiated and will not be available in Gremlin Server. Gr
aphFactory message: GraphFactory could not instantiate this Graph implementation [class com.thinkaurelius.titan.core.TitanFactory]
java.lang.RuntimeException: GraphFactory could not instantiate this Graph implementation [class com.thinkaurelius.titan.core.TitanFactory]
at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:82)
at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:70)
at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:104)
at org.apache.tinkerpop.gremlin.server.GraphManager.lambda$new$27(GraphManager.java:50)
at org.apache.tinkerpop.gremlin.server.GraphManager$$Lambda$9/
2025269734.accept(Unknown Source)
at java.util.LinkedHashMap$LinkedEntrySet.forEach(LinkedHashMap.java:663)
at org.apache.tinkerpop.gremlin.server.GraphManager.<init>(GraphManager.java:48)
at org.apache.tinkerpop.gremlin.server.util.ServerGremlinExecutor.<init>(ServerGremlinExecutor.java:94)
at org.apache.tinkerpop.gremlin.server.GremlinServer.<init>(GremlinServer.java:89)
at org.apache.tinkerpop.gremlin.server.GremlinServer.main(GremlinServer.java:301)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:78)
... 9 more
Caused by: com.thinkaurelius.titan.core.TitanException: Could not initialize backend
at com.thinkaurelius.titan.diskstorage.Backend.initialize(Backend.java:301)
at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.getBackend(GraphDatabaseConfiguration.java:1806)
at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.<init>(StandardTitanGraph.java:123)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:74)
... 14 more
Caused by: com.thinkaurelius.titan.diskstorage.PermanentBackendException: Permanent failure in storage backend
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.ensureColumnFamilyExists(CassandraThriftStoreManager.java:528)
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.ensureColumnFamilyExists(CassandraThriftStoreManager.java:500)
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.openDatabase(CassandraThriftStoreManager.java:317)
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.openDatabase(CassandraThriftStoreManager.java:55)
at com.thinkaurelius.titan.diskstorage.keycolumnvalue.KeyColumnValueStoreManager.openDatabase(KeyColumnValueStoreManager.java:29)
at com.thinkaurelius.titan.diskstorage.Backend.initialize(Backend.java:225)
... 18 more
Caused by: InvalidRequestException(why:Keyspace 'titan' does not exist)
at org.apache.cassandra.thrift.Cassandra$set_keyspace_result$set_keyspace_resultStandardScheme.read(Cassandra.java:9076)
at org.apache.cassandra.thrift.Cassandra$set_keyspace_result$set_keyspace_resultStandardScheme.read(Cassandra.java:9062)
at org.apache.cassandra.thrift.Cassandra$set_keyspace_result.read(Cassandra.java:9012)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_set_keyspace(Cassandra.java:608)
at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:595)
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:54)
at com.thinkaurelius.titan.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:21)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1220)
cassandra.log ---
INFO [main] 2015-10-10 02:37:26,215 YamlConfigurationLoader.java (line 92) Loading settings from file:/fs1/home/dbethune/lib/titan/conf/cassandra/cassandra.yaml
........................
(loads config)
........................
WARN [main] 2015-10-10 02:37:26,229 SystemKeyspace.java (line 745) No host ID found, created 1fdd92d0-fcf3-4d00-8e77-ef49a29f693d (Note: This should happen exactly once per node).
INFO [main] 2015-10-10 02:37:26,231 StorageService.java (line 796) Starting up server gossip
------------------------------------------------- (seems to startup just fine) -----------------------------------------------------
INFO [main] 2015-10-10 02:37:27,083 StorageService.java (line 1665) Node /
172.30.48.31 state jump to normal
INFO [main] 2015-10-10 02:37:27,091 CassandraDaemon.java (line 584) Waiting for gossip to settle before accepting client requests...
INFO [CompactionExecutor:1] 2015-10-10 02:37:27,214 CompactionTask.java (line 274) Compacted 4 sstables to [db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5,]. 831 bytes to 640 (~77% of original) in 133ms = 0.004589MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, }
INFO [main] 2015-10-10 02:37:35,092 CassandraDaemon.java (line 616) No gossip backlog; proceeding
INFO [main] 2015-10-10 02:37:35,233 Server.java (line 151) Netty using native Epoll event loop
INFO [main] 2015-10-10 02:37:35,279 Server.java (line 181) Using Netty Version: [netty-buffer=netty-buffer-4.0.25.Final.087db82, netty-codec=netty-codec-4.0.25.Final.087db82, netty-codec-http=netty-codec-http-4.0.25.Final.087db82, netty-codec-socks=netty-codec-socks-4.0.25.Final.087db82, netty-common=netty-common-4.0.25.Final.087db82, netty-handler=netty-handler-4.0.25.Final.087db82, netty-tcnative=netty-tcnative-1.1.32.Fork1.9e735d2, netty-transport=netty-transport-4.0.25.Final.087db82, netty-transport-native-epoll=netty-transport-native-epoll-4.0.25.Final.087db82, netty-transport-rxtx=netty-transport-rxtx-4.0.25.Final.087db82, netty-transport-sctp=netty-transport-sctp-4.0.25.Final.087db82, netty-transport-udt=netty-transport-udt-4.0.25.Final.087db82]
INFO [main] 2015-10-10 02:37:35,279 Server.java (line 182) Starting listening for CQL clients on /172.30.48.31:9042...
INFO [main] 2015-10-10 02:37:35,357 ThriftServer.java (line 119) Binding thrift service to /
172.30.48.31:9160 INFO [Thread-5] 2015-10-10 02:37:35,363 ThriftServer.java (line 136) Listening for thrift clients...
INFO [Service Thread] 2015-10-10 02:37:36,850 GCInspector.java (line 252) ConcurrentMarkSweep GC in 326ms. Code Cache: 6784064 -> 7127808; Compressed Class Space: 2883832 -> 3268432; Metaspace: 24192032 -> 27467232; Par Eden Space: 1006632960 -> 229366608; Par Survivor Space: 0 -> 22429344
..........................................................
(now the other node starts, and assassinates the seed node)
..........................................................
INFO [main] 2015-10-10 02:42:16,337 StorageService.java (line 963) Using saved tokens [-9032811332102042200, 6301385909530277003, 6785165862433782584, 9100944392500201247]
INFO [main] 2015-10-10 02:42:16,338 ColumnFamilyStore.java (line 939) Enqueuing flush of local: 716 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1] 2015-10-10 02:42:16,338 Memtable.java (line 347) Writing Memtable-local@1345153288(0.109KiB serialized bytes, 4 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1] 2015-10-10 02:42:16,352 Memtable.java (line 382) Completed flushing db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-8-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1444459335272, position=93400)
INFO [main] 2015-10-10 02:42:16,420 ColumnFamilyStore.java (line 939) Enqueuing flush of local: 1087 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:2] 2015-10-10 02:42:16,421 Memtable.java (line 347) Writing Memtable-local@1294716931(0.146KiB serialized bytes, 7 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:2] 2015-10-10 02:42:16,422 Memtable.java (line 382) Completed flushing db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-9-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1444459335272, position=93697)
INFO [CompactionExecutor:2] 2015-10-10 02:42:16,429 CompactionTask.java (line 141) Compacting [SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-8-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-7-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-6-Data.db')]
INFO [main] 2015-10-10 02:42:16,503 StorageService.java (line 1665) Node /
172.30.48.32 state jump to normal
INFO [main] 2015-10-10 02:42:16,512 CassandraDaemon.java (line 584) Waiting for gossip to settle before accepting client requests...
INFO [CompactionExecutor:2] 2015-10-10 02:42:16,580 CompactionTask.java (line 274) Compacted 4 sstables to [db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-10,]. 1,205 bytes to 641 (~53% of original) in 148ms = 0.004130MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, }
INFO [GossipTasks:1] 2015-10-10 02:42:17,295 OutboundTcpConnection.java (line 97) OutboundTcpConnection using coalescing strategy DISABLED
INFO [HANDSHAKE-uhn-compute31/
172.30.48.31] 2015-10-10 02:42:17,306 OutboundTcpConnection.java (line 485) Handshaking version with uhn-compute31/
172.30.48.31 INFO [GossipStage:1] 2015-10-10 02:42:17,321 OutboundTcpConnection.java (line 97) OutboundTcpConnection using coalescing strategy DISABLED
INFO [HANDSHAKE-/
172.30.48.32] 2015-10-10 02:42:17,331 OutboundTcpConnection.java (line 485) Handshaking version with /
172.30.48.32 INFO [GossipStage:1] 2015-10-10 02:42:17,338 Gossiper.java (line 1019) Node /
172.30.48.32 is now part of the cluster
INFO [SharedPool-Worker-1] 2015-10-10 02:42:17,349 Gossiper.java (line 984) InetAddress /
172.30.48.32 is now UP
WARN [GossipStage:1] 2015-10-10 02:42:17,350 StorageService.java (line 1681) Not updating host ID 1fdd92d0-fcf3-4d00-8e77-ef49a29f693d for /172.30.48.32 because it's mine INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /
172.30.48.32 and /
172.30.48.31 have the same token -9032811332102042200. /
172.30.48.32 is the new owner
INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /
172.30.48.32 and /
172.30.48.31 have the same token 6301385909530277003. /
172.30.48.32 is the new owner
INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /
172.30.48.32 and /
172.30.48.31 have the same token 6785165862433782584. /
172.30.48.32 is the new owner
INFO [GossipStage:1] 2015-10-10 02:42:17,352 StorageService.java (line 1735) Nodes /
172.30.48.32 and /
172.30.48.31 have the same token 9100944392500201247. /
172.30.48.32 is the new owner
WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 9100944392500201247 changing ownership from /
172.30.48.31 to /
172.30.48.32 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token -9032811332102042200 changing ownership from /
172.30.48.31 to /
172.30.48.32 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 6301385909530277003 changing ownership from /
172.30.48.31 to /
172.30.48.32 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 6785165862433782584 changing ownership from /
172.30.48.31 to /
172.30.48.32 INFO [HANDSHAKE-/
172.30.48.32] 2015-10-10 02:42:18,304 OutboundTcpConnection.java (line 485) Handshaking version with /
172.30.48.32 WARN [GossipTasks:1] 2015-10-10 02:42:18,369 FailureDetector.java (line 249) Not marking nodes down due to local pause of 292355616685 > 5000000000
INFO [main] 2015-10-10 02:42:24,512 CassandraDaemon.java (line 616) No gossip backlog; proceeding
....................................................................................
(bunch of replays and compactions, then trouble)
....................................................................................
INFO [MigrationStage:1] 2015-10-10 02:42:32,960 ColumnFamilyStore.java (line 363) Initializing titan.system_properties
WARN [Thrift:2] 2015-10-10 02:42:33,084 FBUtilities.java (line 308) Trigger directory doesn't exist, please create it and try again.
INFO [BatchlogTasks:1] 2015-10-10 02:42:46,325 ColumnFamilyStore.java (line 939) Enqueuing flush of batchlog: 7273 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1] 2015-10-10 02:42:46,326 Memtable.java (line 347) Writing Memtable-batchlog@
2075607740(1.548KiB serialized bytes, 45 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1] 2015-10-10 02:42:46,328 Memtable.java (line 393) Completed flushing db/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-1-Data.db; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1444459335272, position=102659)
ERROR [OptionalTasks:1] 2015-10-10 02:43:05,101 CassandraDaemon.java (line 223) Exception in thread Thread[OptionalTasks:1,5,main]
java.lang.RuntimeException: java.io.FileNotFoundException: /fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db (No such file or directory)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:52)
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.createPooledReader(CompressedPoolingSegmentedFile.java:95)
at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:62)
at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1890)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:57)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:42)
at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:246)
at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:270)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:2033)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1876)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:357)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:85)
at org.apache.cassandra.cql3.statements.SelectStatement.readLocally(SelectStatement.java:309)
at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:328)
at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:67)
at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:317)
at org.apache.cassandra.db.SystemKeyspace.getSavedTokens(SystemKeyspace.java:634)
at org.apache.cassandra.service.StorageService.getLocalTokens(StorageService.java:2210)
at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:67)
at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:64)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:70)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:48)
... 29 more
Sorry if the formatting is whacky - first post of this nature. In some earlier efforts with DEBUG logging levels passed to Cassandra/ES/Gremlin-Server it seems as though multiget_slice is the call in Cassandra that breaks things. I am not sure what's going on - based on nothing in particular, I assume it is an issue with Thrift, so I am going to try without cassandrathrift and just use cassandra, setting the requisite "don't use thrift" settings in cassandra.yaml? I have not found a ringer for this bug anywhere in particular yet, but this link below seems to be quite close to what is going on with my case - excepting that I can't even get one damn node to join to the cluster yet :(
In the post above, it is mentioned that one workable solution is to "Fix the host id in the peers column family in system keyspace. Generate a new UUID and update the row for newly added peer. This need to be done on all existing Cassandra nodes." so I will try that too and see if I get the same thing happening. Attached are config files and relevant logs - not at DEBUG level, though. I will enable that for what I said my next steps are - uhn-compute31.log is the seed node bin/titan.sh -v start, and uhn-compute32.log is the joining node titan.sh -v start output.
If you guys have any advice on how to get this working I would be very, very much indebted - my goal is to get Titan running distributed across our cluster's nodes for lab members - if a different configuration will give me less of a headache, I'm willing to use it and drop the bin/titan.sh C* + ES + GremlinServer setup.
Thanks!