I would like to draw attention to a new utility for Blueprints which
was motivated as follows. Lately, I have been faced with the problem
of trying to synchronize graph-y data between a mobile phone and a
desktop application. This is hard not only because the data model I
had in mind, RDF, is complicated, but also by some basic requirements
(on top of just getting the data to look the same on both devices):
1) it should be possible to load only a portion of the data on the
phone, and to push and pull changes to that portion without corrupting
the overall graph
2) it should be easy to revert changes, and it would be nice to be
able to branch
3) collaborators should be able to contribute changes to the graph, as well
This morning, it occurred to me that we could have these features in
Blueprints if we just serialize graphs in a way which plays well with
Git. I then spent all day coding, and the result is GitGraph, a
persistent Graph implementation (currently layered on top of
TinkerGraph) which stores its data in a hierarchy of canonically
ordered, diff-friendly plain text files. You can check a GitGraph
directory into GitHub, fork, edit and merge it just as you would a
piece of software. Also cool:
1) you can load subdirectories of a GitGraph as standalone graphs, and
edit them independently of the rest of the graph
2) placing two or more GitGraphs in the same directory creates a
super-GitGraph which you can load as one graph. You can then create
edges which span the two graphs and create new top-level vertices.
You can go back to a view of the individual graphs at any time.
3) no additional API, apart from the GitGraph constructor
I have created a sandbox graph here:
https://github.com/tinkerpop/gitgraph-sandbox
Take a look at the README for a usage example. Pull requests are
welcome :-) At the moment, the GitGraph source is available in a
feature/gitgraph branch of Blueprints:
https://github.com/tinkerpop/blueprints/tree/feature/gitgraph
It will be merged into Blueprints proper or made into a separate
project after we have had some time to experiment with it.
Best,
Josh
Josh,
You crank it! This is a very cool approach to offline long running concurrent edits. If you then can define a baseline below which you transform things into a resulting performant graph from the files, it rocks.
Sent from my phone.
On Thu, Apr 14, 2011 at 6:05 AM, Peter Neubauer
<neubaue...@gmail.com> wrote:
> Josh,
> You crank it! This is a very cool approach to offline long running
> concurrent edits.
Thanks!
> If you then can define a baseline below which you
> transform things into a resulting performant graph from the files, it rocks.
Right now, only TinkerGraph is supported as the actual graph
implementation. However, you'll notice a second GitGraph constructor
(currently private) which lets you pass in any IndexableGraph.
If/when GitGraph can be made to scale (e.g. by replacing its current
memory-intensive operations with disk-based ones), you will be able to
use another persistent graph such as Neo4jGraph for moment-to-moment
storage and only load or save via GitGraph when you want to pull or
push changes.
NOTE: I have moved the sandbox graph into Tinkubator, where I should
have put it in the first place:
https://github.com/tinkerpop/tinkubator/tree/master/gitgraph-sandbox
The other repo will be going away shortly.
Josh
/peter
Question -- What are you pushing -- GraphML or Java serialization of the TinkerGraph object? -- or your own format?
Marko.
Right now, GitGraph uses a plain text format with one vertex, edge, or
property definition per line. That makes it easy to put the lines of
each file in a well-defined order (by sorting) so as to keep diffs
neat. GraphML should also possible (and would be nice), though more
complicated code-wise. We would need a couple of indices to provide
in-order (based on a Comparator which takes the element ID hierarchy
into account) traversals of vertices and edges (easy) and a customized
GraphMLWriter which orders and formats the XML tree in a deterministic
fashion (seems doable).
Josh
Marko.
Could do that. Note that it wouldn't necessarily write to a single
file, though; it creates a directory tree based on the IDs of vertices
and edges. For example, an edge with the ID "42" would be defined in
the top-level directory, but it might reference a vertex "misc/13"
which is defined in a subdirectory called "misc". If you load the
subdirectory as a GitGraph, that vertex appears as "13" and the
top-level edge is not visible.
That way, an application which only cares about the graph data in
"misc" doesn't need to deal with a monolithic file for the entire
graph. When it is done making changes, it just writes back to "misc"
and its children, ignoring the rest of the tree.
Josh