We are deploying Titan, 1.0, on a Cassandra cluster, dse-4.8.6, that is currently idle and are seeing vertex and edge addition times of up to a second, with a significant proportion in the 200ms range. We did just complete a large test data load where we were seeing decent performance. We then dropped (via cqlsh) and rebuilt the keyspace and schema. Performance has been horrendous since.
The cluster is spec'ed as follows
Cassandra Cluster: 2 x DC, 3 node / DC, SSD, 64GB
cassandra service: 16g
gremlin-server: 6g co-hosted on each node
The following configs are being used for the gremlin-server processes
gremlin-server.yaml
host: 0.0.0.0
port: 8182
threadPoolWorker: 16
gremlinPool: 64
scriptEvaluationTimeout: 30000
serializedResponseTimeout: 30000
channelizer: org.apache.tinkerpop.gremlin.server.channel.WebSocketChannelizer
graphs: {
graph: conf/titan.properties}
plugins:
- aurelius.titan
scriptEngines: {
gremlin-groovy: {
imports: [java.lang.Math],
staticImports: [java.lang.Math.PI],
scripts: [scripts/empty-sample.groovy]},
nashorn: {
imports: [java.lang.Math],
staticImports: [java.lang.Math.PI]}}
serializers:
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { useMapperFromGraph: graph, bufferSize: 8192000 }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true, bufferSize: 81920 }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { useMapperFromGraph: graph }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { useMapperFromGraph: graph }}
processors:
- { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
metrics: {
graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 20000
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
enabled: false}
with the following titan properties
gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
storage.backend=cassandra
storage.cassandra.keyspace=my_keyspace
storage.hostname=10.1.2.3
storage.username=user
storage.password=password
storage.cassandra.astyanax.local-datacenter=DC2
storage.cassandra.read-consistency-level=LOCAL_QUORUM
storage.cassandra.write-consistency-level=LOCAL_QUORUM
ids.block-size=200000
storage.buffer-size=102400
query.fast-property=true
cache.db-cache=true
The keyspace is defined as
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
At the gremlin console, local to the gremlin-server, the following actions each take a noticeable amount of time,
gremlin> :> graph.addVertex('a','b')
==>v[12472]
gremlin> :> graph.addVertex('a','c')
==>v[8232]
gremlin> :> g.V(12472).next().addEdge('e',g.V(8232).next())
==>e[2s7-9mg-27th-6co][12472-e->8232]
As you can see from the ids this is on an empty graph.
We are using only composite indexes, one unique, others not. The indexes are built like
...
if (mgmt.getPropertyKey("a") == null) {
name = mgmt.makePropertyKey("a").dataType(String.class).make();
namei = mgmt.buildIndex("a", Vertex.class).addKey(name).unique().buildCompositeIndex();
mgmt.setConsistency(namei, ConsistencyModifier.DEFAULT);
}
...
mgmt.commit();
While building the schema we did see the following warning
7353 [main] WARN com.thinkaurelius.titan.diskstorage.locking.consistentkey.ConsistentKeyLocker - Lock write succeeded but took too long: duration PT0.13S exceeded limit PT0.1S
Breaking the replication to the remote DC didn't make a difference, so it seems to be a local issue. Yet we see the same behavior on each DC.
Any help, or pointers to how to debug the issue would be greatly appreciated.
Kieran.