How To : Build Implied Edges : SQL Join

83 views
Skip to first unread message

Fred Eisele

unread,
Mar 13, 2019, 1:25:38 PM3/13/19
to Gremlin-users
I have a set of Package nodes (i.e. their label is "Package") each having properties "id" and "parent".
The "parent" property value references the "id" of parent node.

In SQL parlance the "Package:parent" is a foreign-key to the "Package:id".

SELECT * FROM Package AS p, Package AS c WHERE p.id = c.parent;

I wish to construct edges, labeled "has_parent", from the child nodes to their parents.

I have tried the following:

gts
 
.V().hasLabel("Package").has("package_id").as("parent")
 
.V().hasLabel("Package").has("parent_id").as("child")
 
.where("parent", eq("child")).by("package_id").by("parent_id")
 
.addE("has_parent").from("child").to("parent")
 
.iterate();

...but it does not work.


Fred Eisele

unread,
Mar 13, 2019, 2:10:34 PM3/13/19
to Gremlin-users

...I lied it does work but it is slower than I would like.
Any advice on how to improve the performance?

Daniel Kuppitz

unread,
Mar 13, 2019, 6:20:26 PM3/13/19
to gremli...@googlegroups.com
There's probably no faster way to do it after the data has been imported. If the data is coming from a SQL database, then I would consider to reimport everything with pre-determined edges.

Cheers,
Daniel


On Wed, Mar 13, 2019 at 11:10 AM Fred Eisele <fredric...@gmail.com> wrote:

...I lied it does work but it is slower than I would like.
Any advice on how to improve the performance?

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/0b54cc7e-04e9-4006-b8d1-4cbd2cec7d1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Josh Perryman

unread,
Mar 14, 2019, 3:48:59 PM3/14/19
to Gremlin-users
A SQL-style of join is a very poor use of a graph db engine. Like Daniel suggests joins should be pre-computed and the edges added at write-time / data I jest time.

This is by necessity and design. Edges are basically materialized joins. Graph databases are optimized for them, a disk or cache read operation. Relational databases are optimized for joins, a query-time computer operation.

It is usually significantly cheaper to pre-compute the edges in a separate engine before loading data than to do so after data is loaded into the graph. The exception to this is when an edge is determined based on a multi-hop path through the graph. For that use case a graph dB is best.

Stephen Mallette

unread,
Mar 15, 2019, 7:42:00 AM3/15/19
to gremli...@googlegroups.com
I think that this issue comes up enough that it should be documented as an anti-pattern in the Gremlin Recipes:


Josh, I hope you don't mind if we copy/paste a bit of what you said into our documentation but I think that your comments provide the basis for most of what needs to be said on the matter. If you'd like to submit a pull request, please let us know, otherwise I can just get it in there for the 3.3.6/3.4.1 release for next week.

Thanks

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Fred Eisele

unread,
Mar 15, 2019, 9:02:13 AM3/15/19
to Gremlin-users
So, the recommendation is to use the SQL engine to perform the join.
I see that this helps a bit as one of the vertices, call it 'parent', can be created at the same time as the edge 'has_parent'.
Is there a way to keep the other vertex 'child' handy?

It seem like the graph database should perform about as well as the SQL database if there are indices on the fields being joined?

pieter martin

unread,
Mar 15, 2019, 9:42:59 AM3/15/19
to gremli...@googlegroups.com
I don't agree that it is a anti pattern. All that is happening here is the creating of edges. Saying it needs to be pre-calculated on import is saying the graph is a data warehouse.

In the system I work on it is a frequent requirement to association and unassociate disparate vertices, runtime.

It is however a performance killer as both the vertices have to be loaded into memory before the edge can be created. If some bulk feature is not supplied for doing this the latency can kill the performance and loading all of vertices can be heavy on memory.

Not sure how you are connecting to the graph but perhaps try loading the vertices into memory first and then add the edge, using some bulk load feature if possible. It could be that your gremlin query itself is unoptimized. For Sqlg at least the suggested manner would be a order of magnitude faster than the gremlin query as you can use custom bulk features to deal with this scenario.

If its embedded Neo4j is generally super fast when all you are doing is adding edges.

Cheers
Pieter

Stephen Mallette

unread,
Mar 15, 2019, 9:57:02 AM3/15/19
to gremli...@googlegroups.com
I think of it as an anti-pattern in the sense that folks tend to expect the join operation to occur as fast as a SQL-join and it never does. Sure, if you have a TinkerGraph with a few thousand edges all sitting in memory you can execute the join in Gremlin and get a result but that's almost never the case people are looking for. Typically, it involves graphs with millions or billions of edges and to attempt that with an OLTP traversal just never satisfies. Many times the need to do this sort of operation on graphs of this size are expressed to be part of the data loading phase where the raw data exists in a joinable state in a relational database and yet the attempt is still made to do an edgeless join in the graph. My mind goes to anti-pattern for that sort of use case. 

There are other use cases, like a local edge inference, that might be attempted in OLTP style with some Gremlin, but the case needs to be outlined clearly so that instead of having this conversation repeatedly we can just copy/paste a link. If folks think "anti-pattern" is too strong a term then perhaps it's just a new recipe that has some things to watch out for listed in it. 

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Josh Perryman

unread,
Mar 15, 2019, 10:35:27 AM3/15/19
to Gremlin-users
A lot of this is specific to the db engine that is being used. Different implementations have different capabilities.

To that point, most of the graph databases that I've worked with don't do in-memory joins anywhere near as efficiently as the relational databases I've worked with. (Pieter's point about Neo4j being the main exception.) The relational database join operation isn't just a matter of having an index on the columns joined, there's a lot more involved and the implementations are specific to how the database handles the on-disk persistence, caching, join strategies, etc. 

Without knowing which graph database you are using, it is hard to express more than generalities. In general, graph database engines are not optimized for in-memory joins in the same way that relational database engines are. Therefore, a helpful optimization when bulk importing data is to extract it from the source in such a way that the obvious edges are pre-computed. 

This assumes that the source is a relational database where the expense of such join operations is very small. If, however, the source were a document database with no join functionality, then forming the edges in the graph database may be the better option. (That isn't to say that document databases can't have join functionality, only that some don't.)

I think that those are the safest general guidelines. Building a transactional pipeline may be different and the cost of the join operation within the graph may fall within the target performance window. The particular mix of data engines in the persistence layer, and also general staff capabilities, or even access to other systems could impact my opinions on the matter. 

I recently rebuilt a set of data-import CSVs using command line join in a bash shell because: 1) I changed the partitioning scheme 2) I didn't have access to the source systems 3) setting up a local relational db in that env was more of a hassle than using the command line.  I would never recommend that approach as a first option, and in fact more recently I had to do the same type of operation, but had Postgres already in place so used it. 

Hope that helps.

-Josh

Josh Perryman

unread,
Mar 15, 2019, 10:42:21 AM3/15/19
to Gremlin-users
Stephen, feel free to pull in any of my comments here and edit as you see best. I'm not expecting to have time to do so before the releases. Though I'm leaning toward Pieter's point that the term anti-pattern may be too strong. But given the structure of that page, it is probably the best place to put it. 

Perhaps in a couple of weeks I can submit a PR with this and other learnings. Seems like a safe place for me to make my first contribution to the TinkerPop code base. 

Thanks!

-Josh 

Stephen Mallette

unread,
Mar 15, 2019, 10:52:41 AM3/15/19
to gremli...@googlegroups.com
> Perhaps in a couple of weeks I can submit a PR with this and other learnings. Seems like a safe place for me to make my first contribution to the TinkerPop code base. 

If you're interested in writing up something on this topic (and possibly others) I won't try to stuff anything in there for release. the docs are pretty easy to work with as the are just all asciidocs:


Just yell if you have any questions - Thanks

Reply all
Reply to author
Forward
0 new messages