> I am a newbie -- please speak slowly and use small words.
Regardless if I write slow or fast, I believe, given this channel of communication, its up to you to read slow.
> We have a data problem which seems to lend itself to a graph database.
> We will be receiving new data plus regular updates to existing data in
> XML (or a format which can be easily converted to XML). I got pointed
> to Neo4j (cool), found Gremlin (VERY cool), found out about GraphML
> (YEESSH!), but then discovered (as documented by Gremlin) that Neo4j
> ignores IDs in the GraphML file. So if one loads a GraphML file via
> Gremlin with IDs that match IDs in an existing Neo4j graph, new graph
> elements will be created.
Yes! You are correct. Unfortunately, there is no way to control the IDs in Neo4j. It is of my opinion that Neo4j should not even expose the notion of ID on their API. And thus, users (such as yourself) would make use of their own ID framework through the properties of vertices and edges.
> What I would really like is what TinkerGraph does, which is to update
> graph elements when IDs match vs. always creating new graph elements.
> Is there a way to convince Gremlin/Neo4j to do this? For my
> application, I would be willing to have a required property for every
> node and relationship called (for example) "id." I would also be
> willing to ensure "id" is unique across my data (or suffer the
> consequences if it is not). Performance and capacity are not huge
> concerns for me.
It is very easy to write a GraphML reader/writer for your particular use case. Please see:
You will note how I get around the Neo4j ID issue when reading by storing IDs to an ever growing hashmap :(. As graphs grow in size, this will cause a OutOfMemoryException, but I can't think of another way to do it..... For your particular situation, simply adapt this code to search for a particular vertex with a particular "id" property.
> Given this, could the load functionality for Neo4j graphs be hacked
> such that when loading a GraphML file it first tries to look up a node
> with the same value for the "id" property for update, but if not found
> creates a new one? Would this require a new "graph framework" using
> Blueprints (per Gremlin documentation)? Or am I thinking about this
> all wrong? Any vague direction you can provide would be helpful.
As I said before, for your particular use case with an "id" property as a unique identifier, just adapt the Blueprints GraphML reader/writer code (in your project) to handle your situation. The code is small, simple (hopefully), and your solution will be tailored to your particular needs.
Finally, if your graph is on the order of (~1million vertices/1million edges), you can always use TinkerGraph as your "graph db". It will be in-memory, but it will be fast and you can use the GraphML Reader/Writer as your serialization when you load/shutdown your application... ??
Good luck and please feel free to continue to ask questions on such matters...
Take care,
Marko.