Batch querying with Gremlin graph traversals

738 views
Skip to first unread message

Corey Nolet

unread,
Feb 1, 2014, 8:53:07 PM2/1/14
to gremli...@googlegroups.com
Hello,

I've written a sharded entity/triple-store in Accumulo and I'm adding graph traversal componentry to it currently. The column-oriented design of Accumulo seems to align very well with the property graph and edge traversal model. I'm most interested in running path traversal algorithms where I'm starting a node and finding all possible paths that exist to another node. The interfaces for blueprints are very clean and well thought out. Before I can start using this to solve real problems, there are two things of which I am currently concerned. I'm using tinkerpop 2.5.0 so it'd be great if the new blueprints3 had fixes for this (I'd also be willing to work on fixes for the baseline if necessary).

1. Unless I'm missing something, My intuition is telling me that gremlin crawls the graph from isolated node to isolated node instead of batching level to level. In order words, I notice that gremlin will visit a node, find all the edges that match some filter, then visit each node on the other end of the edges and repeat the process given the pipeline conditions that have been specified. If I'm correct that it works this way, that means I'm making a lot of small calls down to Accumulo batch scanners. I've sharded my data so the scans are very fast, but they are also resource intensive and making a lot of small scans isn't as optimal as making a couple large ones. Does gremlin have any possibility of doing batching of edges/vertices throughout the propagation of levels in the graphs? It would be really easy to implement in Accumulo the logic where I could start at a vertex, find all the edges, grab all the vertices on the other side of the edges in one large batch (or partition and batch in the case where # of edges may be an unscalable sie), then grab all the matching edges for those vertices in a batch and repeat. I could easily implement this logic on my own for more custom analytics but i like the ease of use that gremlin brings.

2. Accumulo's base client for performing scans needs to be closed when it is done being used. Specifically in the case where I'm trying to talk to many tablets in parallel, it can be resource intensive and can very quickly overload a client if the Scanners aren't being closed when they are done. I've written an AutoCloseIterable that I can decorate the scanners with that will close them upon the last item being iterated, but I'm not sure if gremlin guarantees that every Iterable<Edge> and Iterable<Vertex> traversed will be iterated fully. In fact,

Thanks alot!

Marko Rodriguez

unread,
Feb 2, 2014, 10:25:54 AM2/2/14
to gremli...@googlegroups.com
Hi,


I've written a sharded entity/triple-store in Accumulo and I'm adding graph traversal componentry to it currently. The column-oriented design of Accumulo seems to align very well with the property graph and edge traversal model.

Great. Have you seen, by chance, http://titan.thinkaurelius.com ?

I'm most interested in running path traversal algorithms where I'm starting a node and finding all possible paths that exist to another node. The interfaces for blueprints are very clean and well thought out.

Thank you.

Before I can start using this to solve real problems, there are two things of which I am currently concerned. I'm using tinkerpop 2.5.0 so it'd be great if the new blueprints3 had fixes for this (I'd also be willing to work on fixes for the baseline if necessary).

1. Unless I'm missing something, My intuition is telling me that gremlin crawls the graph from isolated node to isolated node instead of batching level to level. In order words, I notice that gremlin will visit a node, find all the edges that match some filter, then visit each node on the other end of the edges and repeat the process given the pipeline conditions that have been specified. If I'm correct that it works this way, that means I'm making a lot of small calls down to Accumulo batch scanners. I've sharded my data so the scans are very fast, but they are also resource intensive and making a lot of small scans isn't as optimal as making a couple large ones. Does gremlin have any possibility of doing batching of edges/vertices throughout the propagation of levels in the graphs? It would be really easy to implement in Accumulo the logic where I could start at a vertex, find all the edges, grab all the vertices on the other side of the edges in one large batch (or partition and batch in the case where # of edges may be an unscalable sie), then grab all the matching edges for those vertices in a batch and repeat. I could easily implement this logic on my own for more custom analytics but i like the ease of use that gremlin brings.

Please see http://faunus.thinkaurelius.com which will have most of its functionality merged into TinkerPop3 as Gremlin OLAP. You can read more about Gremlin OLAP on this list here:

2. Accumulo's base client for performing scans needs to be closed when it is done being used. Specifically in the case where I'm trying to talk to many tablets in parallel, it can be resource intensive and can very quickly overload a client if the Scanners aren't being closed when they are done. I've written an AutoCloseIterable that I can decorate the scanners with that will close them upon the last item being iterated, but I'm not sure if gremlin guarantees that every Iterable<Edge> and Iterable<Vertex> traversed will be iterated fully. In fact,

It does not. We have CloseableIterable in Blueprints2, but not in Blueprints3… One way people have gotten around this is that they make their iterator implementation check if there is hasNext() after the next() and if not, close().

Thanks for your contributions to the graph space Corey,
Marko.

Corey Nolet

unread,
Feb 2, 2014, 9:44:49 PM2/2/14
to gremli...@googlegroups.com
Marko,

Awesome on Faunus and the Gremlin OLAP! I believe I saw you speaking about this on an interview you did for YOW!2012. Backing the Gremlin OLAP / Blueprints3 graph with Accumulo would be perfect for my needs- if it could be possible. What do you think?

I've checked out the HBase/Accumulo back-ends for Titan and my main complaint thus far seems to be that Accumulo's back-end was a direct copy from the HBase back-end (which seems to be lacking the ability to take advantage of Accumulo's added features). Specifically, I take advantage here of Accumulo's ability to run unions/intersections of properties on the server-side via an iterator stack that allows custom code to seek through the files on each tablet at their whim. This significantly speeds up query time and reduces the need for any joins or post-process filtering to occur on the client. 

Is there any intention for Blueprints3 to support CloseableIterable in the future? My AutoCloseIterable does just what you've mentioned but resources can still leak if I can't guarantee full iteration in every case. I'll use Blueprints2's CloseableIterable for now.


Thanks again!

Marko Rodriguez

unread,
Feb 3, 2014, 10:40:03 AM2/3/14
to gremli...@googlegroups.com
Hi,

Awesome on Faunus and the Gremlin OLAP! I believe I saw you speaking about this on an interview you did for YOW!2012. Backing the Gremlin OLAP / Blueprints3 graph with Accumulo would be perfect for my needs- if it could be possible. What do you think?

Yea, you could definitely do that. TinkerPop3 will be coming out Summer 2014 so please keep an eye on its developments. We issue numerous RFCs so people can help shape the APIs, functionality, etc. 

I've checked out the HBase/Accumulo back-ends for Titan and my main complaint thus far seems to be that Accumulo's back-end was a direct copy from the HBase back-end (which seems to be lacking the ability to take advantage of Accumulo's added features). Specifically, I take advantage here of Accumulo's ability to run unions/intersections of properties on the server-side via an iterator stack that allows custom code to seek through the files on each tablet at their whim. This significantly speeds up query time and reduces the need for any joins or post-process filtering to occur on the client. 

You can bring that up on the Aurelius mailing list at this location:


Is there any intention for Blueprints3 to support CloseableIterable in the future? My AutoCloseIterable does just what you've mentioned but resources can still leak if I can't guarantee full iteration in every case. I'll use Blueprints2's CloseableIterable for now.

No. In Gremlin, people usually do not reference their iterators (Gremlin is one big iterator). They typically iterate their data and move onto the next query (asking for a close() is too much to ask). In situations where they only iterated a subset of the data, they would need reference to the GraphQuery/VertexQuery iterator to close it and these objects are typically not "seen" when working with Gremlin. Is there a way that you could solve this without having explicit iterator shutdown at the Blueprints level? Thoughts from others?

Marko.




Thanks again!

On Sunday, February 2, 2014 10:25:54 AM UTC-5, Marko A. Rodriguez wrote:
Hi,


I've written a sharded entity/triple-store in Accumulo and I'm adding graph traversal componentry to it currently. The column-oriented design of Accumulo seems to align very well with the property graph and edge traversal model.

Great. Have you seen, by chance, http://titan.thinkaurelius.com ?

I'm most interested in running path traversal algorithms where I'm starting a node and finding all possible paths that exist to another node. The interfaces for blueprints are very clean and well thought out.

Thank you.

Before I can start using this to solve real problems, there are two things of which I am currently concerned. I'm using tinkerpop 2.5.0 so it'd be great if the new blueprints3 had fixes for this (I'd also be willing to work on fixes for the baseline if necessary).

1. Unless I'm missing something, My intuition is telling me that gremlin crawls the graph from isolated node to isolated node instead of batching level to level. In order words, I notice that gremlin will visit a node, find all the edges that match some filter, then visit each node on the other end of the edges and repeat the process given the pipeline conditions that have been specified. If I'm correct that it works this way, that means I'm making a lot of small calls down to Accumulo batch scanners. I've sharded my data so the scans are very fast, but they are also resource intensive and making a lot of small scans isn't as optimal as making a couple large ones. Does gremlin have any possibility of doing batching of edges/vertices throughout the propagation of levels in the graphs? It would be really easy to implement in Accumulo the logic where I could start at a vertex, find all the edges, grab all the vertices on the other side of the edges in one large batch (or partition and batch in the case where # of edges may be an unscalable sie), then grab all the matching edges for those vertices in a batch and repeat. I could easily implement this logic on my own for more custom analytics but i like the ease of use that gremlin brings.

Please see http://faunus.thinkaurelius.com which will have most of its functionality merged into TinkerPop3 as Gremlin OLAP. You can read more about Gremlin OLAP on this list here:

2. Accumulo's base client for performing scans needs to be closed when it is done being used. Specifically in the case where I'm trying to talk to many tablets in parallel, it can be resource intensive and can very quickly overload a client if the Scanners aren't being closed when they are done. I've written an AutoCloseIterable that I can decorate the scanners with that will close them upon the last item being iterated, but I'm not sure if gremlin guarantees that every Iterable<Edge> and Iterable<Vertex> traversed will be iterated fully. In fact,

It does not. We have CloseableIterable in Blueprints2, but not in Blueprints3… One way people have gotten around this is that they make their iterator implementation check if there is hasNext() after the next() and if not, close().

Thanks for your contributions to the graph space Corey,
Marko.


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matthias Broecheler

unread,
Feb 4, 2014, 1:15:30 AM2/4/14
to gremli...@googlegroups.com

No. In Gremlin, people usually do not reference their iterators (Gremlin is one big iterator). They typically iterate their data and move onto the next query (asking for a close() is too much to ask). In situations where they only iterated a subset of the data, they would need reference to the GraphQuery/VertexQuery iterator to close it and these objects are typically not "seen" when working with Gremlin. Is there a way that you could solve this without having explicit iterator shutdown at the Blueprints level? Thoughts from others?


The way Titan does this is registering the iterator with the transactional context and closing all open ones when the transaction is committed or rolled back.

Cheers,
Matthias



--
Matthias Broecheler
http://www.matthiasb.com

Corey Nolet

unread,
Feb 4, 2014, 7:54:37 AM2/4/14
to gremli...@googlegroups.com
I was also considering using finalize() in the iterable to close it if it hasn't already been closed. That just seems so hacky to me, though.


--
You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/tOfjbwOnky8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.

Marko Rodriguez

unread,
Feb 4, 2014, 3:25:03 PM2/4/14
to gremli...@googlegroups.com
Hi Matthias,

The way Titan does this is registering the iterator with the transactional context and closing all open ones when the transaction is committed or rolled back.

Ah! Thats smart. We should do the same for the Neo4j implementation.

Question --- this seems like a general solution to this problem for all graph databases. Is this correct? For non-transactional databases, is there ever a time when closing an iterator would be necessary? (TinkerGraph doesn't require it as the GC will clean up dangling Iterators).

Thoughts?,
Marko.

Matthias Broecheler

unread,
Feb 8, 2014, 9:01:07 PM2/8/14
to gremli...@googlegroups.com
You would only need to explicitly close an iterator if the iterator holds on to some sort of resource which in the case of a graph database will mostly likely be I/O related. However, I think this is orthogonal from the issue of transactions. TinkerGraph doesn't need it because there is no I/O involved.

Corey Nolet

unread,
Feb 9, 2014, 10:41:53 AM2/9/14
to gremli...@googlegroups.com
Agreed,

In my case, for each query, I'm initiating a batch queueing client that will be pulling data from multiple tablet servers at once into a local queue. The client has a thread pool used to maintain the sockets and it needs to be closed when it is done. My graph is read-only currently so there's no need for write transactions at the graph layer.




You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/tOfjbwOnky8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.

Corey Nolet

unread,
Feb 9, 2014, 8:43:36 PM2/9/14
to gremli...@googlegroups.com

I just wanted a confirmation that iterators being controlled @ the gremlin layer would be exhausted completely. I'm okay telling clients to cast their iterables into closeable, when they are returned from gremlin, and reclaiming the resources when they are done. However, if I'm running a complex traversal in gremlin that requires using several different iterables "behind the scenes", the clients won't even be exposed to the iterables to be able to close them. That's the case I'm worried about.

More specifically, I think you've pretty much answered my question that there is no guarantee that gremlin will be closing all the iterables when pipes like the loop() closure determine a stopping point before a specific iterable is exhausted, right? In that case, i'll need to determine how I know that gremlin no longer needs the iterable so that I can tell it to reclaim the resources that it's holding- maybe it's a simple finalize() method on the iterable that'll call close() upon the next garbage collection. Or, if possible, maybe I could attach couple of lines to gremlin underneath to do an "instanceof" and close the iterable when it knows it's done.


Reply all
Reply to author
Forward
0 new messages