Having issues getting Cassandra to run on more than one node

2,482 views
Skip to first unread message

Dylan Bethune-Waddell

unread,
Oct 10, 2015, 10:41:38 AM10/10/15
to Aurelius
Not sure exactly why - it seems that the seed node assigns itself a hostid and some tokens, and then when a new node attempts to join the cluster the same hostid is assigned, all the tokens change ownership, and then a java.io.FileNotFoundException is thrown when (I assume the seed node) the CassandraDaemon attempts to read a system-local-ka-*-Data.db file. If I use ./nodetool status to inspect the ring on the seed node, the daemon went down. Then, when I do the same on the newer node, it has the host ID that both of them attempted to hold.

ElasticSearch appeared to have the same issue in one particular configuration setup, but I believe I decided that was an error in how I had set up the configuration - otherwise, it seems to form a cluster just fine using the same hostnames. The gremlin-server log and cassandra log seems to capture the essence of the issue:

gremlin-server.log ---
1780 [main] INFO  com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration  - Set default timestamp provider MICRO
1785 [main] INFO  com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager  - Closed Thrift connection pooler.
1790 [main] INFO  com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration  - Generated unique-instance-id=ac1e30208123-uhn-compute321
1806 [main] INFO  com.thinkaurelius.titan.diskstorage.Backend  - Configuring index [search]
1907 [main] INFO  org.elasticsearch.plugins  - [Magneto] loaded [], sites []
.2463 [main] INFO  com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex  - Configured remote host: uhn-compute31 : 9300
2575 [main] INFO  com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex  - Configured remote host: uhn-compute32 : 9300
3619 [main] INFO  com.thinkaurelius.titan.diskstorage.Backend  - Initiated backend operations thread pool of size 24
3639 [main] WARN  org.apache.tinkerpop.gremlin.server.GremlinServer  - Graph [graph] configured at [conf/gremlin-server/titan-cassandra-es-server.properties] could not be instantiated and will not be available in Gremlin Server.  Gr
aphFactory message: GraphFactory could not instantiate this Graph implementation [class com.thinkaurelius.titan.core.TitanFactory]
java.lang.RuntimeException: GraphFactory could not instantiate this Graph implementation [class com.thinkaurelius.titan.core.TitanFactory]
        at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:82)
        at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:70)
        at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:104)
        at org.apache.tinkerpop.gremlin.server.GraphManager.lambda$new$27(GraphManager.java:50)
        at org.apache.tinkerpop.gremlin.server.GraphManager$$Lambda$9/2025269734.accept(Unknown Source)
        at java.util.LinkedHashMap$LinkedEntrySet.forEach(LinkedHashMap.java:663)
        at org.apache.tinkerpop.gremlin.server.GraphManager.<init>(GraphManager.java:48)
        at org.apache.tinkerpop.gremlin.server.util.ServerGremlinExecutor.<init>(ServerGremlinExecutor.java:94)
        at org.apache.tinkerpop.gremlin.server.GremlinServer.<init>(GremlinServer.java:89)
        at org.apache.tinkerpop.gremlin.server.GremlinServer.main(GremlinServer.java:301)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.tinkerpop.gremlin.structure.util.GraphFactory.open(GraphFactory.java:78)
        ... 9 more
Caused by: com.thinkaurelius.titan.core.TitanException: Could not initialize backend
        at com.thinkaurelius.titan.diskstorage.Backend.initialize(Backend.java:301)
        at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.getBackend(GraphDatabaseConfiguration.java:1806)
        at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.<init>(StandardTitanGraph.java:123)
        at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94)
        at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:74)
        ... 14 more
Caused by: com.thinkaurelius.titan.diskstorage.PermanentBackendException: Permanent failure in storage backend
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.ensureColumnFamilyExists(CassandraThriftStoreManager.java:528)
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.ensureColumnFamilyExists(CassandraThriftStoreManager.java:500)
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.openDatabase(CassandraThriftStoreManager.java:317)
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.CassandraThriftStoreManager.openDatabase(CassandraThriftStoreManager.java:55)
        at com.thinkaurelius.titan.diskstorage.keycolumnvalue.KeyColumnValueStoreManager.openDatabase(KeyColumnValueStoreManager.java:29)
        at com.thinkaurelius.titan.diskstorage.Backend.initialize(Backend.java:225)
        ... 18 more
Caused by: InvalidRequestException(why:Keyspace 'titan' does not exist)
        at org.apache.cassandra.thrift.Cassandra$set_keyspace_result$set_keyspace_resultStandardScheme.read(Cassandra.java:9076)
        at org.apache.cassandra.thrift.Cassandra$set_keyspace_result$set_keyspace_resultStandardScheme.read(Cassandra.java:9062)
        at org.apache.cassandra.thrift.Cassandra$set_keyspace_result.read(Cassandra.java:9012)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.cassandra.thrift.Cassandra$Client.recv_set_keyspace(Cassandra.java:608)
        at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:595)
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:54)
        at com.thinkaurelius.titan.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:21)
        at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1220)

cassandra.log ---
INFO [main] 2015-10-10 02:37:26,215 YamlConfigurationLoader.java (line 92) Loading settings from file:/fs1/home/dbethune/lib/titan/conf/cassandra/cassandra.yaml
........................
(loads config)
........................
 WARN [main] 2015-10-10 02:37:26,229 SystemKeyspace.java (line 745) No host ID found, created 1fdd92d0-fcf3-4d00-8e77-ef49a29f693d (Note: This should happen exactly once per node).
 INFO [main] 2015-10-10 02:37:26,231 StorageService.java (line 796) Starting up server gossip
------------------------------------------------- (seems to startup just fine) -----------------------------------------------------
 INFO [main] 2015-10-10 02:37:27,083 StorageService.java (line 1665) Node /172.30.48.31 state jump to normal
 INFO [main] 2015-10-10 02:37:27,091 CassandraDaemon.java (line 584) Waiting for gossip to settle before accepting client requests...
 INFO [CompactionExecutor:1] 2015-10-10 02:37:27,214 CompactionTask.java (line 274) Compacted 4 sstables to [db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5,].  831 bytes to 640 (~77% of original) in 133ms = 0.004589MB/s.  4 total partitions merged to 1.  Partition merge counts were {4:1, }
 INFO [main] 2015-10-10 02:37:35,092 CassandraDaemon.java (line 616) No gossip backlog; proceeding
 INFO [main] 2015-10-10 02:37:35,233 Server.java (line 151) Netty using native Epoll event loop
 INFO [main] 2015-10-10 02:37:35,279 Server.java (line 181) Using Netty Version: [netty-buffer=netty-buffer-4.0.25.Final.087db82, netty-codec=netty-codec-4.0.25.Final.087db82, netty-codec-http=netty-codec-http-4.0.25.Final.087db82, netty-codec-socks=netty-codec-socks-4.0.25.Final.087db82, netty-common=netty-common-4.0.25.Final.087db82, netty-handler=netty-handler-4.0.25.Final.087db82, netty-tcnative=netty-tcnative-1.1.32.Fork1.9e735d2, netty-transport=netty-transport-4.0.25.Final.087db82, netty-transport-native-epoll=netty-transport-native-epoll-4.0.25.Final.087db82, netty-transport-rxtx=netty-transport-rxtx-4.0.25.Final.087db82, netty-transport-sctp=netty-transport-sctp-4.0.25.Final.087db82, netty-transport-udt=netty-transport-udt-4.0.25.Final.087db82]
 INFO [main] 2015-10-10 02:37:35,279 Server.java (line 182) Starting listening for CQL clients on /172.30.48.31:9042...
 INFO [main] 2015-10-10 02:37:35,357 ThriftServer.java (line 119) Binding thrift service to /172.30.48.31:9160
 INFO [Thread-5] 2015-10-10 02:37:35,363 ThriftServer.java (line 136) Listening for thrift clients...
 INFO [Service Thread] 2015-10-10 02:37:36,850 GCInspector.java (line 252) ConcurrentMarkSweep GC in 326ms.  Code Cache: 6784064 -> 7127808; Compressed Class Space: 2883832 -> 3268432; Metaspace: 24192032 -> 27467232; Par Eden Space: 1006632960 -> 229366608; Par Survivor Space: 0 -> 22429344
.......................................................... 
(now the other node starts, and assassinates the seed node) 
..........................................................
 INFO [main] 2015-10-10 02:42:16,337 StorageService.java (line 963) Using saved tokens [-9032811332102042200, 6301385909530277003, 6785165862433782584, 9100944392500201247]
 INFO [main] 2015-10-10 02:42:16,338 ColumnFamilyStore.java (line 939) Enqueuing flush of local: 716 (0%) on-heap, 0 (0%) off-heap
 INFO [MemtableFlushWriter:1] 2015-10-10 02:42:16,338 Memtable.java (line 347) Writing Memtable-local@1345153288(0.109KiB serialized bytes, 4 ops, 0%/0% of on/off-heap limit)
 INFO [MemtableFlushWriter:1] 2015-10-10 02:42:16,352 Memtable.java (line 382) Completed flushing db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-8-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1444459335272, position=93400)
 INFO [main] 2015-10-10 02:42:16,420 ColumnFamilyStore.java (line 939) Enqueuing flush of local: 1087 (0%) on-heap, 0 (0%) off-heap
 INFO [MemtableFlushWriter:2] 2015-10-10 02:42:16,421 Memtable.java (line 347) Writing Memtable-local@1294716931(0.146KiB serialized bytes, 7 ops, 0%/0% of on/off-heap limit)
 INFO [MemtableFlushWriter:2] 2015-10-10 02:42:16,422 Memtable.java (line 382) Completed flushing db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-9-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1444459335272, position=93697)
 INFO [CompactionExecutor:2] 2015-10-10 02:42:16,429 CompactionTask.java (line 141) Compacting [SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-8-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-7-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db'), SSTableReader(path='/fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-6-Data.db')]
 INFO [main] 2015-10-10 02:42:16,503 StorageService.java (line 1665) Node /172.30.48.32 state jump to normal
 INFO [main] 2015-10-10 02:42:16,512 CassandraDaemon.java (line 584) Waiting for gossip to settle before accepting client requests...
 INFO [CompactionExecutor:2] 2015-10-10 02:42:16,580 CompactionTask.java (line 274) Compacted 4 sstables to [db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-10,].  1,205 bytes to 641 (~53% of original) in 148ms = 0.004130MB/s.  4 total partitions merged to 1.  Partition merge counts were {4:1, }
 INFO [GossipTasks:1] 2015-10-10 02:42:17,295 OutboundTcpConnection.java (line 97) OutboundTcpConnection using coalescing strategy DISABLED
 INFO [HANDSHAKE-uhn-compute31/172.30.48.31] 2015-10-10 02:42:17,306 OutboundTcpConnection.java (line 485) Handshaking version with uhn-compute31/172.30.48.31
 INFO [GossipStage:1] 2015-10-10 02:42:17,321 OutboundTcpConnection.java (line 97) OutboundTcpConnection using coalescing strategy DISABLED
 INFO [HANDSHAKE-/172.30.48.32] 2015-10-10 02:42:17,331 OutboundTcpConnection.java (line 485) Handshaking version with /172.30.48.32
 INFO [GossipStage:1] 2015-10-10 02:42:17,338 Gossiper.java (line 1019) Node /172.30.48.32 is now part of the cluster
 INFO [SharedPool-Worker-1] 2015-10-10 02:42:17,349 Gossiper.java (line 984) InetAddress /172.30.48.32 is now UP
 WARN [GossipStage:1] 2015-10-10 02:42:17,350 StorageService.java (line 1681) Not updating host ID 1fdd92d0-fcf3-4d00-8e77-ef49a29f693d for /172.30.48.32 because it's mine
 INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /172.30.48.32 and /172.30.48.31 have the same token -9032811332102042200.  /172.30.48.32 is the new owner
 INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /172.30.48.32 and /172.30.48.31 have the same token 6301385909530277003.  /172.30.48.32 is the new owner
 INFO [GossipStage:1] 2015-10-10 02:42:17,351 StorageService.java (line 1735) Nodes /172.30.48.32 and /172.30.48.31 have the same token 6785165862433782584.  /172.30.48.32 is the new owner
 INFO [GossipStage:1] 2015-10-10 02:42:17,352 StorageService.java (line 1735) Nodes /172.30.48.32 and /172.30.48.31 have the same token 9100944392500201247.  /172.30.48.32 is the new owner
 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 9100944392500201247 changing ownership from /172.30.48.31 to /172.30.48.32
 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token -9032811332102042200 changing ownership from /172.30.48.31 to /172.30.48.32
 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 6301385909530277003 changing ownership from /172.30.48.31 to /172.30.48.32
 WARN [GossipStage:1] 2015-10-10 02:42:17,354 TokenMetadata.java (line 198) Token 6785165862433782584 changing ownership from /172.30.48.31 to /172.30.48.32
 INFO [HANDSHAKE-/172.30.48.32] 2015-10-10 02:42:18,304 OutboundTcpConnection.java (line 485) Handshaking version with /172.30.48.32
 WARN [GossipTasks:1] 2015-10-10 02:42:18,369 FailureDetector.java (line 249) Not marking nodes down due to local pause of 292355616685 > 5000000000
 INFO [main] 2015-10-10 02:42:24,512 CassandraDaemon.java (line 616) No gossip backlog; proceeding
....................................................................................
(bunch of replays and compactions, then trouble)
....................................................................................
INFO [MigrationStage:1] 2015-10-10 02:42:32,960 ColumnFamilyStore.java (line 363) Initializing titan.system_properties
 WARN [Thrift:2] 2015-10-10 02:42:33,084 FBUtilities.java (line 308) Trigger directory doesn't exist, please create it and try again.
 INFO [BatchlogTasks:1] 2015-10-10 02:42:46,325 ColumnFamilyStore.java (line 939) Enqueuing flush of batchlog: 7273 (0%) on-heap, 0 (0%) off-heap
 INFO [MemtableFlushWriter:1] 2015-10-10 02:42:46,326 Memtable.java (line 347) Writing Memtable-batchlog@2075607740(1.548KiB serialized bytes, 45 ops, 0%/0% of on/off-heap limit)
 INFO [MemtableFlushWriter:1] 2015-10-10 02:42:46,328 Memtable.java (line 393) Completed flushing db/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-1-Data.db; nothing needed to be retained.  Commitlog position was ReplayPosition(segmentId=1444459335272, position=102659)
ERROR [OptionalTasks:1] 2015-10-10 02:43:05,101 CassandraDaemon.java (line 223) Exception in thread Thread[OptionalTasks:1,5,main]
java.lang.RuntimeException: java.io.FileNotFoundException: /fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db (No such file or directory)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:52)
        at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.createPooledReader(CompressedPoolingSegmentedFile.java:95)
        at org.apache.cassandra.io.util.PoolingSegmentedFile.getSegment(PoolingSegmentedFile.java:62)
        at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:1890)
        at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:57)
        at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
        at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:42)
        at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:246)
        at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
        at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:270)
        at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
        at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:2033)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1876)
        at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:357)
        at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:85)
        at org.apache.cassandra.cql3.statements.SelectStatement.readLocally(SelectStatement.java:309)
        at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:328)
        at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:67)
        at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:317)
        at org.apache.cassandra.db.SystemKeyspace.getSavedTokens(SystemKeyspace.java:634)
        at org.apache.cassandra.service.StorageService.getLocalTokens(StorageService.java:2210)
        at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:67)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /fs1/home/dbethune/lib/titan/db/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-5-Data.db (No such file or directory)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:64)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:70)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:48)
        ... 29 more


Sorry if the formatting is whacky - first post of this nature. In some earlier efforts with DEBUG logging levels passed to Cassandra/ES/Gremlin-Server it seems as though multiget_slice is the call in Cassandra that breaks things. I am not sure what's going on - based on nothing in particular, I assume it is an issue with Thrift, so I am going to try without cassandrathrift and just use cassandra, setting the requisite "don't use thrift" settings in cassandra.yaml? I have not found a ringer for this bug anywhere in particular yet, but this link below seems to be quite close to what is going on with my case - excepting that I can't even get one damn node to join to the cluster yet :(

In the post above, it is mentioned that one workable solution is to "Fix the host id in the peers column family in system keyspace. Generate a new UUID and update the row for newly added peer. This need to be done on all existing Cassandra nodes." so I will try that too and see if I get the same thing happening. Attached are config files and relevant logs - not at DEBUG level, though. I will enable that for what I said my next steps are - uhn-compute31.log is the seed node bin/titan.sh -v start, and uhn-compute32.log is the joining node titan.sh -v start output.

If you guys have any advice on how to get this working I would be very, very much indebted - my goal is to get Titan running distributed across our cluster's nodes for lab members - if a different configuration will give me less of a headache, I'm willing to use it and drop the bin/titan.sh C* + ES + GremlinServer setup.

Thanks!
 
cassandra.log
uhn-compute32.log
cassandra.yaml
cassandra-env.sh
elasticsearch.log
elasticsearch.yml
gremlin-server.log
gremlin-server.yaml
titan-cassandra-es-server.properties
uhn-compute31.log

Jason Plurad

unread,
Oct 10, 2015, 2:36:30 PM10/10/15
to Aurelius
You might want to consider running Cassandra and Elasticsearch as individual clusters, instead of embedded in Titan.
See this related thread I just answered https://groups.google.com/d/msg/aureliusgraphs/wdLPf789ouA/FWE8vzQIAwAJ

Dylan Bethune-Waddell

unread,
Oct 10, 2015, 4:16:46 PM10/10/15
to Aurelius
Hi Jason,

I will give a completely separate C* download/cluster setup a shot - but isn't the titan.sh script firing up two separate C* and ES clusters, or did you mean just using ES/C* that doesn't ship with Titan? There are definitely three java processes running on each of the nodes, one gremlin-server, the other two C* and ES respectively as far as I can tell - and ES seems to form a cluster on its own just fine according to the logs, which gremlin-server recognizes. I am about to try editing the C* system keyspace as suggested here:

https://groups.google.com/forum/#!search/Cassandra$20"because$20it$27s$20mine"/nosql-databases/rDgXHgt03QE/X0FEmpg8OMoJ

When I execute titan.sh on a new node after changing the {cassandra,elasticsearch,gremlin-server}.yaml/yml files to reflect the hostname of the node I am ssh'd to, or trying cassandra.yaml listen/rpc_interface set to either our infiniband or ethernet interfaces, C* barfs. It thinks the "System Keyspace" Host ID claimed by the seed node (booted first) should actually belong to the node that is joining the cluster, causing: 
(1) All the tokens that were auto-created for the seed node change ownership;
(2) The new node claims the seed node's C* internal host ID, dropping the seed node out of the ring on nodetool inspection though the Daemon is still running on the seed's hosting node.
(3) multiget_slice calls and compactions that expect certain files to exist can no longer find them. 

I'm quite stoked on Titan and TinkerPop3 - and trying to integrate with the Spark cluster I just set up, which has no issues connecting the master to nodes - but I just can't get Titan running distributed with what ships out of the box. Any guidance you guys can provide from your experience setting up Titan to run distributed instances in whatever configuration would be greatly appreciated.

I am running RHEL6/CentOS6 on a bunch of Xeons, if that means anything - I will keep posting here as I get info back from trying out different configs and solutions (if the first post is a big deterrent with all the unnecessary log output, maybe I should start a new thread?).

Thanks,
Dylan

Dylan Bethune-Waddell

unread,
Oct 12, 2015, 1:44:28 PM10/12/15
to Aurelius
Fixed - the issues seems to be using NAS/NFS systems, which is an anti-pattern for Cassandra's storage model - changing the storage/commitlog/saved-caches directories in cassandra.yaml to /tmp versus ~/<path-to-titan>/db got things to work. This page suggested it:


This was a RTFM thing - but what about adding to the docs that the $TITAN_HOME/db/* directory is not appropriate if titan is installed is on your cluster's NAS/NFS filesystem, and if you are getting host ID collisions in Cassandra system keyspace then sed -i -r  's/db\//tmp\//g' cassandra.yaml could solve it?

...

David

unread,
Oct 15, 2015, 6:01:34 PM10/15/15
to Aurelius
>> This was a RTFM thing - but what about adding to the docs that the $TITAN_HOME/db/* directory is not appropriate if titan is installed is on your cluster's NAS/NFS filesystem, and if you are getting host ID collisions in Cassandra system keyspace then sed -i -r  's/db\//tmp\//g' cassandra.yaml could solve it?

Sounds like a pull request...although I agree with Jason that you should consider setting up the ES and C* installations separately.
...
Reply all
Reply to author
Forward
0 new messages