Restructuring a Production Graph?

James Thornton

unread,

Sep 3, 2011, 9:57:39 AM9/3/11

to gremli...@googlegroups.com

What are the recommended methods for restructuring graphs in production?

For example, say you store raw input from users as nodes, and you run algorithms on the raw input to slice and dice the data, make inferences and associations based on your initial algorithms.

Over time you improve your algorithms and find better ways of structuring the relationships. What are the best ways to restructure the graph on a live system?

One approach would be to keep an event log for each user action and replay it to rebuild the graph (http://martinfowler.com/bliki/MemoryImage.html). You could keep two DBs and switch over to the new version.

Or is it better to just prune the induced graph?

- James

Daniel Quest

unread,

Sep 4, 2011, 7:07:49 PM9/4/11

to gremli...@googlegroups.com

Hey James,

This also interests me. Here are some ideas/topics that are related:
1) Security - who can read,write, and so on. User groups?
2) Users - is there a way to have nodes and edges owned by a user?
Algorithm 1 could be initiated by user 1, algo 2 by user 2 -- would
this solve your problem?
3) Enforce date stamps on every node/edge then just pick the most
recent... perhaps even cache the last n versions.
4) Defer all decisions to the underlying graph db.
5) Implement the idea of a 'multi-graph'. I will define a multigraph
as a set of graphs that can share nodes/edges. UserX,UserY, and UserZ
may have interest in different graphs than User1,User2, and User3.

Sorry I am not really answering your question directly, I just feel
that production use has many issues that you don't really see in a
proof of concept.

-Daniel

Joshua Shinavier

unread,

Sep 5, 2011, 12:29:07 PM9/5/11

to gremli...@googlegroups.com

Hi guys,

Graph DBs based on event logging are quite impressive. IIRC, Freebase
[1] uses such a DB and is able to recreate the momentary state of its
graph at any point in time using an exhaustive history of user
actions. These logs also happen to be very useful for social network
analysis. It certainly would be interesting to explore something like
this in Blueprints.

What we do have, at the present time, is the ability to export
individual graphs to version-controlled files, and to roll back to
previous versions of the graph by importing. Graph diffs and merges
are also possible (though some of the relevant code is currently in a
Blueprints branch [2]), so in James's example, you might be able to
remove the "inferred" elements while keeping any subsequent
manually-added elements... possibly even more straightforward than it
would be using event logs.

On Sun, Sep 4, 2011 at 7:07 PM, Daniel Quest <daniel...@gmail.com> wrote:
[...]

> This also interests me. Here are some ideas/topics that are related:
> 1) Security - who can read,write, and so on. User groups?

Yep. I would also like to see user authentication, at the Rexster
level. User identity could then be used for access control to
specific graphs, and in extension-specific ways.

> 2) Users - is there a way to have nodes and edges owned by a user?

That sounds like a whole lot of extra metadata. Graph-level metadata
makes more sense to me.

> Algorithm 1 could be initiated by user 1, algo 2 by user 2 -- would
> this solve your problem?
> 3) Enforce date stamps on every node/edge then just pick the most
> recent... perhaps even cache the last n versions.

That would be a nice (and quick) way to provide time-constrained views
of the graph structure. Of course, it wouldn't work for property
values.

> 4) Defer all decisions to the underlying graph db.
> 5) Implement the idea of a 'multi-graph'. I will define a multigraph
> as a set of graphs that can share nodes/edges. UserX,UserY, and UserZ
> may have interest in different graphs than User1,User2, and User3.

Yes. This goes back to the idea of graph views or derived/virtual
graphs. Again, I would tend to think in terms of graph-level metadata
and filters. For example, in a project of mine [4], I give each
vertex a "weight" and a "sharability" property value, which range from
0 to 1. If you're not interested in vertices below a given weight, or
you're not authorized to see vertices below a given sharability value,
then the excluded vertices are simply invisible to your application.
It would probably be worthwhile to start incorporating this sort of
logic into a Graph implementation (FilterGraph?). One other
possibility for a "multigraph" is a Graph impl which wraps multiple
subordinate graphs, providing unified iterators over their vertices
and edges and also joining elements using some sort of identity
criterion (e.g. having the same value for a specific property, like
"uri").

Josh

[1] http://www.freebase.com/
[2] https://github.com/tinkerpop/blueprints/tree/graphml-ids
[3] http://groups.google.com/group/gremlin-users/browse_thread/thread/1b8342d09f8502d4/68abefe30c07356d
[4] https://github.com/joshsh/myotherbrain

James Thornton

unread,

Sep 5, 2011, 1:06:08 PM9/5/11

to gremli...@googlegroups.com

> Graph diffs and merges
> are also possible (though some of the relevant code is currently in a
> Blueprints branch [2]), so in James's example, you might be able to
> remove the "inferred" elements while keeping any subsequent
> manually-added elements... possibly even more straightforward than it
> would be using event logs.

What if you had multiple graph versions? -- a base graph that users write to that only stores raw/manual inputs, and one or more algorithmically-enhanced graphs that users read from.

This would allow you to test different versions of your algorithms on a subset of users and switch over to new versions as your algorithms improve. For this to work, you would need two sequence sources -- one for the manual elements and one for the algorithmically-enhanced elements.

- James

Joshua Shinavier

unread,

Sep 5, 2011, 1:24:32 PM9/5/11

to gremli...@googlegroups.com

On Mon, Sep 5, 2011 at 1:06 PM, James Thornton <james.t...@gmail.com> wrote:
[...]

> What if you had multiple graph versions? -- a base graph that users write to
> that only stores raw/manual inputs, and one or more algorithmically-enhanced
> graphs that users read from.

Sure, that's another solution.

> This would allow you to test different versions of your algorithms on a
> subset of users and switch over to new versions as your algorithms improve.
> For this to work, you would need two sequence sources -- one for the manual
> elements and one for the algorithmically-enhanced elements.

What this suggests to me is a write-to-all API which wraps multiple
graphs, while individual graphs can still be read from or written to
individually. Do you agree?

Josh

> - James
>

James Thornton

unread,

Sep 5, 2011, 1:45:33 PM9/5/11

to gremli...@googlegroups.com

> What this suggests to me is a write-to-all API which wraps multiple
> graphs, while individual graphs can still be read from or written to
> individually. Do you agree?

Maybe, esp for a generic solution, or if you're using Neo4j, you could use the High Availability server:

"With this [High Availability] milestone, we are proud to for the first time release a version of the Highly Available Neo4j cluster to the community. This component turns Neo4j into a system with a Write-Master and a number of Read-Slaves. Upon mutating operations, at least the master and one slave are committing the transaction before propagating out the changes to the other slaves. In case of the write-master failing, the remaining slaves will automatically elect a new master. Also, the whole cluster infrastructure is transparent to the code that uses the Neo4j API, so that all slaves and masters can be written to and read from transparently. Your code does not need to be aware of the type of HA setup it is running on" (http://news.neo4j.org/2010/12/neo4j-12-milestone-5-reference-manual.html).

Marko and Stephen have recently added Neo4j High Availability support to Blueprints and Rexster so a Neo4j-only approach might be to configure read-only mirrors that have different algorithms/stored procedures.

- James

Peter Neubauer

unread,

Sep 5, 2011, 1:50:05 PM9/5/11

to gremli...@googlegroups.com

Guys,

that sounds like a very interesting experiment. Also, we are working on adding read-only slaves to the configs that cannot be written to, especially to support this kind of setup, where you have production parts of the cluster, and some other slaves that are used for OLAP etc.

Cheers,

/peter neubauer

GTalk: neubauer.peter
Skype    peter.neubauer
Phone    +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter http://twitter.com/peterneubauer

http://www.neo4j.org    - Your high performance graph database.
http://startupbootcamp.org/ - Öresund - Innovation happens HERE.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.

James Thornton

unread,

Sep 5, 2011, 2:51:31 PM9/5/11

to gremli...@googlegroups.com

For either the write-to-all API or the HA approach to work, Rexster would need the ability to execute stored procedures and triggers (https://github.com/tinkerpop/rexster/issues/86).

Stephen, do you have ideas on how to implement triggers in Rexster?

- James

stephen mallette

unread,

Sep 5, 2011, 2:57:05 PM9/5/11

to gremli...@googlegroups.com

Making EventGraph available in rexster might help do the job of triggers. There's an issue out there for that.

Russell Jurney

unread,

Sep 5, 2011, 3:03:15 PM9/5/11

to gremli...@googlegroups.com

I store derived graphs as json in Voldedmort. Sometimes as objects in memcached.

Russell Jurney

twitter.com/rjurney

russell...@gmail.com

datasyndrome.com

James Thornton

unread,

Sep 5, 2011, 3:05:28 PM9/5/11

to gremli...@googlegroups.com

On Monday, September 5, 2011 12:24:32 PM UTC-5, Joshua Shinavier wrote:

What this suggests to me is a write-to-all API which wraps multiple
graphs, while individual graphs can still be read from or written to
individually. Do you agree?

A few weeks ago we were discussing using RabbitMQ to write to all instances simultaneously (https://groups.google.com/d/topic/gremlin-users/oNCiAGmRivA/discussion).

I would be down to working on that API if others want to go that route.

- James

Russell Jurney

unread,

Sep 5, 2011, 3:07:16 PM9/5/11

to gremli...@googlegroups.com

I do this, too - only write to graphs from an individual worker that owns that graph for writing. Reads changes from a queue. Works well.

Russell Jurney

twitter.com/rjurney

russell...@gmail.com

datasyndrome.com

Russell Jurney

unread,

Sep 5, 2011, 3:14:23 PM9/5/11

to gremli...@googlegroups.com

You point out the interesting questions in distributed graph
applications, especially regarding multigraphs. I think the answer is
commonly, that you have many representations of graphs flying around -
some per user, some subgraphs shared between users, some global
graphs. Some are updated in realtime, some are updated in batch. I
see global, batch generated, summarized graphs acting as indexes for
atomic level user graphs.

Think of a database with tables and indexes and joins. In their
place, I see atomic graphs and aggregate graphs and walks (between
graphs).

Russell Jurney
twitter.com/rjurney
russell...@gmail.com
datasyndrome.com

James Thornton

unread,

Sep 5, 2011, 7:12:31 PM9/5/11

to gremli...@googlegroups.com

On Monday, September 5, 2011 1:57:05 PM UTC-5, stephen mallette wrote:

Making EventGraph available in rexster might help do the job of triggers. There's an issue out there for that.

I added a ticket for triggers here: https://github.com/tinkerpop/rexster/issues/160

- James

Reply all

Reply to author

Forward