Refreshing Neo4j graphs from GraphML using Gremlin

97 views
Skip to first unread message

Jim C

unread,
Jun 7, 2010, 11:36:35 AM6/7/10
to Gremlin-users
I am a newbie -- please speak slowly and use small words.

We have a data problem which seems to lend itself to a graph database.
We will be receiving new data plus regular updates to existing data in
XML (or a format which can be easily converted to XML). I got pointed
to Neo4j (cool), found Gremlin (VERY cool), found out about GraphML
(YEESSH!), but then discovered (as documented by Gremlin) that Neo4j
ignores IDs in the GraphML file. So if one loads a GraphML file via
Gremlin with IDs that match IDs in an existing Neo4j graph, new graph
elements will be created.

What I would really like is what TinkerGraph does, which is to update
graph elements when IDs match vs. always creating new graph elements.
Is there a way to convince Gremlin/Neo4j to do this? For my
application, I would be willing to have a required property for every
node and relationship called (for example) "id." I would also be
willing to ensure "id" is unique across my data (or suffer the
consequences if it is not). Performance and capacity are not huge
concerns for me.

Given this, could the load functionality for Neo4j graphs be hacked
such that when loading a GraphML file it first tries to look up a node
with the same value for the "id" property for update, but if not found
creates a new one? Would this require a new "graph framework" using
Blueprints (per Gremlin documentation)? Or am I thinking about this
all wrong? Any vague direction you can provide would be helpful.

Marko Rodriguez

unread,
Jun 7, 2010, 12:42:58 PM6/7/10
to gremli...@googlegroups.com
Hi Jim,

> I am a newbie -- please speak slowly and use small words.

Regardless if I write slow or fast, I believe, given this channel of communication, its up to you to read slow.

> We have a data problem which seems to lend itself to a graph database.
> We will be receiving new data plus regular updates to existing data in
> XML (or a format which can be easily converted to XML). I got pointed
> to Neo4j (cool), found Gremlin (VERY cool), found out about GraphML
> (YEESSH!), but then discovered (as documented by Gremlin) that Neo4j
> ignores IDs in the GraphML file. So if one loads a GraphML file via
> Gremlin with IDs that match IDs in an existing Neo4j graph, new graph
> elements will be created.

Yes! You are correct. Unfortunately, there is no way to control the IDs in Neo4j. It is of my opinion that Neo4j should not even expose the notion of ID on their API. And thus, users (such as yourself) would make use of their own ID framework through the properties of vertices and edges.

> What I would really like is what TinkerGraph does, which is to update
> graph elements when IDs match vs. always creating new graph elements.
> Is there a way to convince Gremlin/Neo4j to do this? For my
> application, I would be willing to have a required property for every
> node and relationship called (for example) "id." I would also be
> willing to ensure "id" is unique across my data (or suffer the
> consequences if it is not). Performance and capacity are not huge
> concerns for me.

It is very easy to write a GraphML reader/writer for your particular use case. Please see:

http://github.com/tinkerpop/blueprints/tree/master/src/main/java/com/tinkerpop/blueprints/pgm/parser/

You will note how I get around the Neo4j ID issue when reading by storing IDs to an ever growing hashmap :(. As graphs grow in size, this will cause a OutOfMemoryException, but I can't think of another way to do it..... For your particular situation, simply adapt this code to search for a particular vertex with a particular "id" property.


> Given this, could the load functionality for Neo4j graphs be hacked
> such that when loading a GraphML file it first tries to look up a node
> with the same value for the "id" property for update, but if not found
> creates a new one? Would this require a new "graph framework" using
> Blueprints (per Gremlin documentation)? Or am I thinking about this
> all wrong? Any vague direction you can provide would be helpful.


As I said before, for your particular use case with an "id" property as a unique identifier, just adapt the Blueprints GraphML reader/writer code (in your project) to handle your situation. The code is small, simple (hopefully), and your solution will be tailored to your particular needs.

Finally, if your graph is on the order of (~1million vertices/1million edges), you can always use TinkerGraph as your "graph db". It will be in-memory, but it will be fast and you can use the GraphML Reader/Writer as your serialization when you load/shutdown your application... ??

Good luck and please feel free to continue to ask questions on such matters...

Take care,
Marko.

http://markorodriguez.com
http://tinkerpop.com

Jim C

unread,
Jun 28, 2010, 1:15:20 PM6/28/10
to Gremlin-users
Marko,

Thanks for the reply -- after some weeks of fooling with this (and
learning about Maven and doing other unrelated things) I have been
able to get this to work using your suggestion. I took advantage of
your hashmap and I now just read the whole Neo4j graph into the
hashmap before starting to parse the XML, and most of the rest of the
existing logic works. This is of course ugly as now the entire graph
is read into memory -- this will likely work for my application where
I am anticipating a small graph database on a big machine, but it does
not seem like a good general purpose solution. I also needed to do
something similar for edges, so I created another hashmap for the
edges and used similar logic. But a question: why isn't there a Graph
method called "getEdge(Object id)" just like there is a getVertex?
Because of the lack of this, my edges hashmap currently stores the
edges themselves vs. the id of the edges which is really ugly, but I
could not figure out how to easily get to an edge from its id. So it
works functionally, but there are memory concerns, and I have not yet
tried it on larger graphs.

I also updated the GraphMLWriter.java to write out my ids using the
existing id fields for nodes and edges, i.e., I do not write out my
"hidden" property that stores my version of the id. This allows the
GraphML files (I think) to be used also with TinkerGraph and achieve
the same behavior as with Neo4j, and also allows the use of arbitrary
string ids for nodes and edges vs. just those that can be converted to
integers.

I ran into a few problems in doing this which I posted under separate
threads since they seem to be independent from the problem I was
asking about in this thread. I appreciate your help and am encouraged
by my progress so far -- this is looking more and more like an answer
for our problem.

Thanks,
Jim
Reply all
Reply to author
Forward
0 new messages