ERROR org.janusgraph.diskstorage.log.kcvs.KCVSLog - Reader thread pool for KCVSLog systemlog ...

619 views
Skip to first unread message

tschu...@gmail.com

unread,
May 6, 2018, 8:39:35 AM5/6/18
to JanusGraph users
Greetings all!

Currently running JanusGraph 0.2.0 on top of ScyllaDB + Elastic (without issue for quite some time) on a single EC2 instance.  

However recently, I've been noticing Gremlin Server crashing with the following stacktrace (below).  It first appeared when I was loading a large set of edges.  This instance, it occurred some time after I issued a query which had to iterate over the entire graph:  "g.V().project('t').by('type').collect().groupCount()"

Any ideas as to what could be causing this?  Which configuration options I should be looking at?  Or am I just pushing the limits of my current single-node setup?  There's only about 50,000 nodes currently.

Many thanks!

08:10:16 ERROR org.janusgraph.diskstorage.log.kcvs.KCVSLog  - Reader thread pool for KCVSLog systemlog did not shut down in time - could not clean up or set read markers
08:10:18 WARN  org.janusgraph.diskstorage.log.kcvs.KCVSLog  - Could not read messages for timestamp [2018-05-06T12:05:43.250Z] (this read will be retried)
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:159)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller.run(KCVSLog.java:717)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Permanent failure in storage backend
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.convertException(CassandraThriftKeyColumnValueStore.java:263)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getNamesSlice(CassandraThriftKeyColumnValueStore.java:162)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getNamesSlice(CassandraThriftKeyColumnValueStore.java:105)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getSlice(CassandraThriftKeyColumnValueStore.java:94)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller$1.call(KCVSLog.java:783)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller$1.call(KCVSLog.java:780)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:148)
        at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:162)
        at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:55)
        ... 9 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
        at org.apache.cassandra.thrift.Cassandra$Client.recv_set_keyspace(Cassandra.java:608)
        at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:595)
        at org.janusgraph.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:76)
        at org.janusgraph.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:43)
        at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1220)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.getNamesSlice(CassandraThriftKeyColumnValueStore.java:141)
        ... 17 more
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        ... 31 more
08:10:27 WARN  org.janusgraph.util.system.IOUtils  - Failed closing org.janusgraph.diskstorage.Backend@79df028c
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:159)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog.writeSetting(KCVSLog.java:838)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog.close(KCVSLog.java:307)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLogManager.close(KCVSLogManager.java:246)
        at org.janusgraph.diskstorage.Backend.close(Backend.java:548)
        at org.janusgraph.util.system.IOUtils.closeQuietly(IOUtils.java:63)
        at org.janusgraph.graphdb.database.StandardJanusGraph.closeInternal(StandardJanusGraph.java:232)
        at org.janusgraph.graphdb.database.StandardJanusGraph.access$600(StandardJanusGraph.java:89)
        at org.janusgraph.graphdb.database.StandardJanusGraph$ShutdownThread.start(StandardJanusGraph.java:822)
        at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:102)
        at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        at java.lang.Terminator$1.handle(Terminator.java:52)
        at sun.misc.Signal$1.run(Signal.java:212)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Permanent failure in storage backend
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.convertException(CassandraThriftKeyColumnValueStore.java:263)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager.mutateMany(CassandraThriftStoreManager.java:315)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.mutateMany(CassandraThriftKeyColumnValueStore.java:258)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftKeyColumnValueStore.mutate(CassandraThriftKeyColumnValueStore.java:254)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$4.call(KCVSLog.java:841)
        at org.janusgraph.diskstorage.log.kcvs.KCVSLog$4.call(KCVSLog.java:838)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:148)
        at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:162)
        at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:55)
        ... 17 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
        at org.apache.cassandra.thrift.Cassandra$Client.recv_set_keyspace(Cassandra.java:608)
        at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:595)
        at org.janusgraph.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:76)
        at org.janusgraph.diskstorage.cassandra.thrift.thriftpool.CTConnectionFactory.makeObject(CTConnectionFactory.java:43)
        at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1220)
        at org.janusgraph.diskstorage.cassandra.thrift.CassandraThriftStoreManager.mutateMany(CassandraThriftStoreManager.java:307)
        ... 25 more
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        ... 39 more

HadoopMarc

unread,
May 7, 2018, 3:30:39 PM5/7/18
to JanusGraph users
Hi Tim,

Your query "g.V().project('t').by('type').collect().groupCount()" is not valid gremlin. See my notes below on the modern graph:

gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin
> g.V().project('t').by('name')
==>[t:marko]
==>[t:vadas]
==>[t:lop]
==>[t:josh]
==>[t:ripple]
==>[t:peter]

gremlin
> g.V().project('t').by('name').collect().groupCount()
No signature of method: java.util.ArrayList.groupCount() is applicable for argument types: () values: []
Type ':help' or ':h' for help.
Display stack trace? [yN]n

gremlin
> g.V().project('t').by('name').groupCount()
==>[[t:ripple]:1,[t:peter]:1,[t:vadas]:1,[t:lop]:1,[t:marko]:1,[t:josh]:1]

gremlin
> g.V().project('t').by('name').getClass()
==>class org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal

gremlin
> g.V().project('t').by('name').collect().getClass()
==>class java.util.ArrayList

HTH,    Marc



Op zondag 6 mei 2018 14:39:35 UTC+2 schreef Tim Schultz:

tschu...@gmail.com

unread,
May 21, 2018, 3:00:27 PM5/21/18
to JanusGraph users
Hi Marc,

Thank you kindly for the response.  Yes, I was transcribing the query from memory and mixed it up with another I was testing at the time.

Although, I'm still running into this issue, but I don't think it's any one specific query causing the issue.

Was hoping someone had some insights into: 

1.) What exactly is being stored in the systemlog table?
2.) Why would you such an exception result in a complete and catastrophic failure of the backend?  When I run into this error, the server instance locks up completely - I can't even SSH into it.
3.) Whether anyone has experienced similar issues with ScyllaDB and JanusGraph on Amazon EC2 before.

I'm using the default ScyllaDB setup, so maybe something needs to be tweaked on that end.

Thanks!

Regards,

Tim

Ted Wilmes

unread,
May 23, 2018, 5:11:42 PM5/23/18
to JanusGraph users
Hi Tim,
JanusGraph instances use the systemlog table to coordinate index state transitions between each other and evictions of graphs from the configured graph factory cache. Every 5 seconds, a thread polls the systemlog for updates from other nodes. JanusGraph instances only communicate to each other through the storage backend, not directly, and this is an example of that.

This is only a guess, but the fact that you're seeing an error related to the systemlog is really just because it happens to be attempting a read while the single Scylla instance is overloaded. As far as the server instance locking up, I know Scylla generally pegs the CPUs, but if it's overloaded enough that it isn't responding to read requests, perhaps that is the source of the lockup, or perhaps the JanusGraph JVM is also pegging the CPU with a never-ending GC.

When you say that you are still running into this issue, but it isn't any specific query, is it happening on queries that are also long running or even small ones?

Here's a few other things to try. Scylla will use all of your processors unless you tell it not to. This could lead to issues when you're running two other applications on the same box (ES and Janus). See the CPU section on this page for how to change that behavior. Alternatively, you could look at moving Janus and ES off onto their own one or two other instances.

--Ted

tschu...@gmail.com

unread,
Jun 4, 2018, 9:09:02 AM6/4/18
to JanusGraph users
Hi Ted,

Thank you for the reply.  This has some good ideas about where to look for the source of the behavior.  It's a bit hard to diagnose since the entire system becomes unresponsive, but some good direction here.

I also forwarded to the ScyllaDB team as well to have a look.

I was using an older version of the ScyllaDB Docker image - I pulled to the latest version and built JanusGraph 0.3.0 - so far, I'm unable to replicate this behavior with the new stack.

Thanks again!

Regards,

Tim
Reply all
Reply to author
Forward
0 new messages