system_properties table understanding

295 views
Skip to first unread message

Antriksh Shah

unread,
Apr 4, 2018, 6:36:23 AM4/4/18
to JanusGraph users
Hello everyone,

We wanted to understand the system_properties table entries.

Currently we are facing storage backend failure after 2 hours of real time ingestion of data. We have cassandra 1.5.1 as the backend, we are connecting with thrift, JGv0.1.1.
In our cassandra graph, we are seeing a spike in read write request timeout.

One theory we observed was reseting the system_properties table makes the system succeed for sometime. 

When we converted the hex entires of system_properties table to ascii we see a ton of entries of the form 
key:"system-registration.0ae08f8c16712-xxxxxx.startup-time".
value: Á€ZÄ€ 3 3 (Non readable)€

We wanted to know are these system-registration entries needed? If these entries get accumulated over time, would it slow down the system?
Is there any workaround to increase the read/write request timeout value from JanusGraph config?

David Pitera

unread,
Apr 4, 2018, 11:57:39 AM4/4/18
to JanusGraph users
Hey,

> When we converted the hex entires of system_properties table to ascii we see a ton of entries of the form 
key:"system-registration.0ae08f8c16712-xxxxxx.startup-time".

This means that you have alot of "graphs" connected to the same Cassandra keyspace. I tend to think of "graphs" as databases that would naturally have their own keyspace. So, in this sense, we have a lot of "instances" of the same graph (because they are all connected to the same keyspace). I like to think of an instance as an instance of a JanusGraph node (since it is horizontally scalable). 

So to clarify, we have JanusGraph nodes, and we can open different graphs on different JanusGraph instances. This is important because it seems you are opening the same graph (lets just say denoted by the keyspace name), on a lot of different instances. 

There is a possibility that you actually have "a ton" of JanusGraph nodes in your cluster, in which case all of these instances are valid. I suspect, however, that this is not the case (that your JanusGraph cluster is not that large), and I suspect what I call "phantom instances". Whenever a graph on a given JanusGraph node is closed we remove its instance from the table above https://github.com/JanusGraph/janusgraph/blob/v0.1.1/janusgraph-core/src/main/java/org/janusgraph/graphdb/database/StandardJanusGraph.java#L202

However, if the graph or server or instance is not shutdown properly (say for example, the server is restart for operational processes using a kill -9), then it will never get removed from the list, and you will have "phantom instances" running around. That would explain why that list is super super long, and it can be detrimental to your read/write latencies because every time a graph is committed, it reads that list: https://github.com/JanusGraph/janusgraph/blob/v0.1.1/janusgraph-core/src/main/java/org/janusgraph/graphdb/database/management/ManagementSystem.java#L239 (getOpenInstancesInternal() will read that list).

I have written a bit about phantom instances before because it also can get in the way of index management: https://stackoverflow.com/questions/40585417/titan-db-ignoring-index/40591478#40591478 (#5 in my answer). Note that I have pushed changes to JanusGraph that should help deal with phantom instances: https://github.com/JanusGraph/janusgraph/pull/937/files . This alone will not fix the problem, because JanusGraph cannot decide for itself what is a "valid" instance or not, so you can write a daemon living outside the server that keeps track of valid instances and ensures the list is correct.


The other possibility is that you do indeed have "a ton" of JanusGraph nodes in your cluster, and all these instances are valid, in which case you can be slowing down Cassandra by 1. using that read every commit similarly as above and 2. by eating up resources in Cassandra by having "a ton" of valid open connections to it at the same time.

Antriksh Shah

unread,
Apr 4, 2018, 12:37:15 PM4/4/18
to JanusGraph users
Hey thanks a ton for the prompt reply.

I am trying to summarise the problem with the info you sent above, please correct me if I am wrong.

I have 50 executors each using a common janus graph object, opening a janusgraph transaction and closing them in batch of say 5 minutes. And I am running the job for say 30 minutes. In all there would be 50*6 transactions opened. But at all points of time I should only have 50 entries of connection in the system_properties table. If I see the entries accumulating then I have a bug that I need to fix. 

Also is there a way through gremlin I can figure out number of connections (phantom or valid) that are present on JG at a given moment?

David Pitera

unread,
Apr 4, 2018, 12:43:47 PM4/4/18
to Antriksh Shah, JanusGraph users
I have 50 executors each using a common janus graph object

Based on this description, I _believe_ that you should only have _one_ entry in that table. AFAIK the list is only added to when a StandardJanusGraph is instantiated, not when a transaction is instantiated, but you are welcome to read the relevant code and check for yourself :)

Also is there a way through gremlin I can figure out number of connections (phantom or valid) that are present on JG at a given moment?

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a817259-cac2-4150-a01f-f86cf6632a89%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Antriksh Shah

unread,
Apr 5, 2018, 9:27:22 AM4/5/18
to JanusGraph users
Hey I tried to drill down the issue. Your answer helped to understand the error.

We have 50 executor each having its own janus object. (My bad when I mentioned earlier the same janus object. Janus objects are not serialisable hence could not share across executors). The reason why we were having a ton of connection entries was because whenever two janus connection attempted to commit at the exact same second, both got storage backed error. And this error handling was not done because of which the phantom instances were adding up.

Now I want to resolve why two janus object, each having its own transaction cannot write to Cassandra at the same exact second. Any inputs on this, please do share 

Thanks again for your help.

David Pitera

unread,
Apr 5, 2018, 9:32:11 AM4/5/18
to Antriksh Shah, JanusGraph users
What is the full error? My guess is that you are modifying the schema concurrently and that won't leave Cassandra happy. This would happen if you create elements with propertyKeys that have not been created and committed in previous transactions.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.

Antriksh Shah

unread,
Apr 5, 2018, 12:05:43 PM4/5/18
to JanusGraph users


Below is the error stack trace. What we are doing here is:

1. Create a general schema for Janus.
2. Bulk upload data: Distribute data to executors and each executor would initiate transactions to push data into cassandra via Janus.
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:159)
	at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:153)
	at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:130)
	at org.janusgraph.diskstorage.configuration.ModifiableConfiguration.set(ModifiableConfiguration.java:40)
	at org.janusgraph.graphdb.database.StandardJanusGraph.<init>(StandardJanusGraph.java:159)
	at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:107)
	at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:97)
	at org.janusgraph.core.JanusGraphFactory$Builder.open(JanusGraphFactory.java:152)
	at com.walmart.cpid.graph.UploadVertex$1.call(UploadVertex.java:102)
	at com.walmart.cpid.graph.UploadVertex$1.call(UploadVertex.java:98)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$35.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$35.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Permanent failure in storage backend
	at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.convertException(CassandraThriftKeyColumnValueStore.java:263)
	at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager.mutateMany(CassandraThriftStoreManager.java:315)
	at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.mutateMany(CassandraThriftKeyColumnValueStore.java:258)
	at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.mutate(CassandraThriftKeyColumnValueStore.java:254)
	at org.janusgraph.diskstorage.locking.consistentkey.ExpectedValueCheckingStore.mutate(ExpectedValueCheckingStore.java:79)
	at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:158)
	at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:153)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:148)
	at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:162)
	at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:55)
	... 22 more
Caused by: TimedOutException(acknowledged_by:1, acknowledged_by_batchlog:false)
	at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29624)
	at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29592)
	at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result.read(Cassandra.java:29526)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
	at org.apache.cassandra.thrift.Cassandra$Client.recv_atomic_batch_mutate(Cassandra.java:1108)
	at org.apache.cassandra.thrift.Cassandra$Client.atomic_batch_mutate(Cassandra.java:1094)
	at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager.mutateMany(CassandraThriftStoreManager.java:310) 
... 31 more 

David Pitera

unread,
Apr 5, 2018, 12:09:34 PM4/5/18
to Antriksh Shah, JanusGraph users
Looks like a time out exception at the Cassandra level; please ensure Cassandra is functioning properly (CPU/memory utilization and GC patterns) during your work, and perhaps look into increasing timeouts/ tuning configuration options if necessary.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.

Antriksh Shah

unread,
Apr 5, 2018, 12:34:10 PM4/5/18
to JanusGraph users
We are using JG 0.1.1 and cassandra 1.5.1 with thrift. On server side our read/write requset timeout is 2ms. Since it is a manged service we are unable to increase that value. From the client side(through Janus) we could not find any configuration that allows us to override that value. We tried going into the code with a hope to hardcode the value in the JG code and recompile it, but we were not able to find the correct place inside the thrift code.

Do you have any suggestions on how can we increase the read/write request timeout? 
Reply all
Reply to author
Forward
0 new messages