Hi, I'm interested in providing a persistent implementation of the underlying graph representation for networkx. I'm not talking about serializing a graph to and from storage. I'm talking about storing all the node/edge/property information in a storage implementation, transparent to networkx, such that when networkx algorithms traverse nodes, edges, etc. That information is retrieved from storage/cache using my dictionary implementation.
On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com> wrote: > Hi, > I'm interested in providing a persistent implementation of the underlying > graph representation for networkx. > I'm not talking about serializing a graph to and from storage. I'm talking > about storing all the node/edge/property information in > a storage implementation, transparent to networkx, such that when networkx > algorithms traverse nodes, edges, etc. > That information is retrieved from storage/cache using my dictionary > implementation.
> Is it possible? Any tips appreciated!
It's definitely possible and we have discussed it before.
More generally it would make sense to use one of the graph databases. I've asked the community to suggest an appropriate one, but so far nobody has replied. I am willing to consider re-engineering parts the NetworkX core to make it work with a persistent database solution if that is necessary.
- Tinkerpop has a general interface to graph databases and Bulbs (bulbflow) is a python API to the REST interfaces of the graph server. However, this is my last suggestion since it involves a lot of moving parts. - Another option I thought was to use a memcached or memcachedb server with python client as the basis for a persistent dictionary used by networkx as its data structure. The graph is represented using adjacency matrices instead of direct references. - Separate from or underneath memcached can be mongodb, which autoshards across a network and autoroutes. wordnic has implemented graphing functions over mongodb. And mongodb scales horizontally.
It would be very interesting to see if any of these robust, auto-scaling mechanisms can underly networkx structures transparently such that retrieval from persistence and caching are handled automatically. Yet avoids bottlenecked performance issues (both memcached and mongo scale well).
This would be a relatively easy implementation I think. All of the above have python bindings.
On Tuesday, March 20, 2012 11:22:13 AM UTC-4, A Hagberg wrote:
> On Tue, Mar 20, 2012 at 8:55 AM, project2501 wrote: > > Hi, > > I'm interested in providing a persistent implementation of the > underlying > > graph representation for networkx. > > I'm not talking about serializing a graph to and from storage. I'm > talking > > about storing all the node/edge/property information in > > a storage implementation, transparent to networkx, such that when > networkx > > algorithms traverse nodes, edges, etc. > > That information is retrieved from storage/cache using my dictionary > > implementation.
> > Is it possible? Any tips appreciated!
> It's definitely possible and we have discussed it before.
> More generally it would make sense to use one of the graph databases. > I've asked the community to suggest an appropriate one, but so far > nobody has replied. I am willing to consider re-engineering parts the > NetworkX core to make it work with a persistent database solution if > that is necessary.
I've had great luck with SQLite and using it as storage for NetworkX. Obviously, I currently write the entire graph structure to and from SQLite and RAM and all the algorithms occur in RAM. I think it would be great to have a persistent implementation as discussed below.
On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg <aric.hagb...@gmail.com> wrote: > On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com> wrote: >> Hi, >> I'm interested in providing a persistent implementation of the underlying >> graph representation for networkx. >> I'm not talking about serializing a graph to and from storage. I'm talking >> about storing all the node/edge/property information in >> a storage implementation, transparent to networkx, such that when networkx >> algorithms traverse nodes, edges, etc. >> That information is retrieved from storage/cache using my dictionary >> implementation.
>> Is it possible? Any tips appreciated!
> It's definitely possible and we have discussed it before.
> More generally it would make sense to use one of the graph databases. > I've asked the community to suggest an appropriate one, but so far > nobody has replied. I am willing to consider re-engineering parts the > NetworkX core to make it work with a persistent database solution if > that is necessary.
> Aric
> -- > You received this message because you are subscribed to the Google Groups "networkx-discuss" group. > To post to this group, send email to networkx-discuss@googlegroups.com. > To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/networkx-discuss?hl=en.
Yeah, because for some scales of graph, they won't fit in RAM anymore
or on a single machine.
That's the size I'm pursuing code for.
The general logic behind the data structure would go something like
this:
1. When a node/edge or whatever is to be de-referenced by its id, the
local memory cache is consulted.
2. If it doesn't reside in memory cache, then its queried from offline
storage (say mongodb). Mongo will take care of
identifying the server it actually resides on.
3. That node/edge is pulled into the cache where its properties are
available.
So consider a depth first search algorithm. It would only need the
starting node in the cache, along with
its out and in vertex information (edges). From there, as networkx
consults its internal matrix, the de-referencing
process causes lookups to occur transparently. This can mean
consulting a large DHT or routing to other machines,
i.e. crawling.
networkx only sees its usual internal data structure.
Which classes should I look at to implement this? I'm not all that
knowledgable on networkx code.
On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote:
> I've had great luck with SQLite and using it as storage for NetworkX.
> Obviously, I currently write the entire graph structure to and from
> SQLite and RAM and all the algorithms occur in RAM. I think it would
> be great to have a persistent implementation as discussed below.
> Tim
> On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg <aric.hagb...@gmail.com> wrote:
> > On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com> wrote:
> >> Hi,
> >> I'm interested in providing a persistent implementation of the underlying
> >> graph representation for networkx.
> >> I'm not talking about serializing a graph to and from storage. I'm talking
> >> about storing all the node/edge/property information in
> >> a storage implementation, transparent to networkx, such that when networkx
> >> algorithms traverse nodes, edges, etc.
> >> That information is retrieved from storage/cache using my dictionary
> >> implementation.
> >> Is it possible? Any tips appreciated!
> > It's definitely possible and we have discussed it before.
> > More generally it would make sense to use one of the graph databases.
> > I've asked the community to suggest an appropriate one, but so far
> > nobody has replied. I am willing to consider re-engineering parts the
> > NetworkX core to make it work with a persistent database solution if
> > that is necessary.
> > Aric
> > --
> > You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
> > To post to this group, send email to networkx-discuss@googlegroups.com.
> > To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/networkx-discuss?hl=en.
> Which classes should I look at to implement this? I'm not all that > knowledgable on networkx code.
The first place to look would be the Graph class: networkx/classes/graph.py Right now the API and the data storage are fairly interlaced for improved performance. We are experimenting with partial separations of the graph interface from the data structure, e.g. using a scipy sparse matrix representation instead of a dict-of-dict-of-dict.
> The general logic behind the data structure would go something like > this:
> 1. When a node/edge or whatever is to be de-referenced by its id, the > local memory cache is consulted. > 2. If it doesn't reside in memory cache, then its queried from offline > storage (say mongodb). Mongo will take care of > identifying the server it actually resides on. > 3. That node/edge is pulled into the cache where its properties are > available.
It sounds like you might want to replace the dicts used for adj lists with a subclass of dict that overrides __getitem__ with a method that checks the internal storage and if nothing is found goes to the database. Something like a dict-of-DB_adj_dict-of-DB_egdedata_dict
The idea of a persistent storage for networkx graphs is a good one, and has been brought up many times- for many people this will mean just writing a large graph into some format which is easy to store and transport to other machines or applications. SQLite would seem to be ideal for this job, and I'm sure several people have already written their own readers and writers for sqlite.
To change the backend of networkx from a dict of dicts to something else, which implements persistent storage and is scalable across multiple processors is another job altogether. It has been suggested to implement a native backend in for eg. boost or igraph or something, but it seems that you won't get much of a speedup doing that, unless you also rewrite the algorithms. Since the beauty of networkx is the python algorithms- you may as well then just use another package from the outset. I've been looking at neo4j (http://neo4j.org/) for very large graphs, and while it certainly scales very well, it is really missing good tools for network analysis. There are python bindings for neo4j ( http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would be an easy job to put together readers and writers which would load and store networkx graphs in neo4j... probably a few have already done this?
I think it would also be possible to use neo4j as a backend for networkx- neo4j has Nodes and Relationships which would map nicely to nodes and edges in networkx. The Nodes and Relationships have property maps which could store any attributes of nodes/edges. Neo4j would handle the memory caching itself, so you would just send an access request (modify __getitem__ in graph.py) for the nodes/edges and let neo4j consult its stores and figure out how to return the data to you. While this might work for small graphs- its not really how the scalable graph databases are designed to work. For example: every write operation has to be wrapped in a transaction- this will kill your performance if you do it for every individual write, but is manageable if you batch the transactions together. This will likely require most of the algorithms to be reworked. The advantages of neo4j probably come from its indexes- you would have to figure out how to use these efficiently in the networkx algorithms. Many algorithms in networkx start with a list [] of nodes- this is not going to work either and would likely have to be modified to use the Traversal framework ( http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). You might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) to get an idea for an efficient way of doing network analysis on distributed graphs.
It would be nice to see how far you can get with a neo4j (or similar backend). Modifying the graph class in networkx to use a different backend should be straightforward enough, but ensuring all (or most) the algorithms still work, and work efficiently will be the hard part.
On Tue, Mar 20, 2012 at 5:57 PM, project2501 <darreng5...@gmail.com> wrote: > Yeah, because for some scales of graph, they won't fit in RAM anymore > or on a single machine. > That's the size I'm pursuing code for.
> The general logic behind the data structure would go something like > this:
> 1. When a node/edge or whatever is to be de-referenced by its id, the > local memory cache is consulted. > 2. If it doesn't reside in memory cache, then its queried from offline > storage (say mongodb). Mongo will take care of > identifying the server it actually resides on. > 3. That node/edge is pulled into the cache where its properties are > available.
> So consider a depth first search algorithm. It would only need the > starting node in the cache, along with > its out and in vertex information (edges). From there, as networkx > consults its internal matrix, the de-referencing > process causes lookups to occur transparently. This can mean > consulting a large DHT or routing to other machines, > i.e. crawling.
> networkx only sees its usual internal data structure.
> Which classes should I look at to implement this? I'm not all that > knowledgable on networkx code.
> On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote: > > I've had great luck with SQLite and using it as storage for NetworkX. > > Obviously, I currently write the entire graph structure to and from > > SQLite and RAM and all the algorithms occur in RAM. I think it would > > be great to have a persistent implementation as discussed below.
> > Tim
> > On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg <aric.hagb...@gmail.com> > wrote: > > > On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com> > wrote: > > >> Hi, > > >> I'm interested in providing a persistent implementation of the > underlying > > >> graph representation for networkx. > > >> I'm not talking about serializing a graph to and from storage. I'm > talking > > >> about storing all the node/edge/property information in > > >> a storage implementation, transparent to networkx, such that when > networkx > > >> algorithms traverse nodes, edges, etc. > > >> That information is retrieved from storage/cache using my dictionary > > >> implementation.
> > >> Is it possible? Any tips appreciated!
> > > It's definitely possible and we have discussed it before.
> > > More generally it would make sense to use one of the graph databases. > > > I've asked the community to suggest an appropriate one, but so far > > > nobody has replied. I am willing to consider re-engineering parts the > > > NetworkX core to make it work with a persistent database solution if > > > that is necessary.
> > > Aric
> > > -- > > > You received this message because you are subscribed to the Google > Groups "networkx-discuss" group. > > > To post to this group, send email to networkx-discuss@googlegroups.com > . > > > To unsubscribe from this group, send email to > networkx-discuss+unsubscribe@googlegroups.com. > > > For more options, visit this group athttp:// > groups.google.com/group/networkx-discuss?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "networkx-discuss" group. > To post to this group, send email to networkx-discuss@googlegroups.com. > To unsubscribe from this group, send email to > networkx-discuss+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/networkx-discuss?hl=en.
In fact, I came from the whole neo4j,orientdb graph realm before
considering this idea.
Couple observations:
1. They don't support native algorithms, so I would have to write my
own or patch a set of them over the database.
So that would be starting with the database and finding/writing the
algorithms (which I'm less adept at).
2. Any of those databases could be used as the long-term storage
backend for what I'm describing above.
Some study of mapping algorithm code to efficient database queries
would need to be done (outside my scope).
So, the inverse of that approach is starting with the algorithm
package and layering the persistence underneath.
networkx not only has algorithms I need, but its python, and also has
export to javascript and SVG. two critical pieces for me.
Since I'm fairly decent with persistence, it _seems_ easier for me to
add that piece into networkx.
Of course, it won't perform the same as memory, but only on first
calls as I intend to cache the data in local memory,
backed by memcache (which is distributed and fast), backed by mongo
(long-term storage). If I implement the Graph.py
class correctly, networkx algorithms will solve the problem of what
queries to issue, in what order and when to solve
graph theoretic problems over the persistence layers. Both mongodb and
memcache scale horizontally and perform
exceptionally well. So even though it won't compare to memory, when
you're operating on graphs of enormous size
in "ad hoc" ways. Which is to say, user-driven request/response modes.
Its not possible to have ALL the graph
data in memory waiting for them.
Having said all that, I think a good first step (for study or general
use) is to see how this approach works or not.
:)
On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote:
> The idea of a persistent storage for networkx graphs is a good one, and has
> been brought up many times- for many people this will mean just writing a
> large graph into some format which is easy to store and transport to other
> machines or applications. SQLite would seem to be ideal for this job, and
> I'm sure several people have already written their own readers and writers
> for sqlite.
> To change the backend of networkx from a dict of dicts to something else,
> which implements persistent storage and is scalable across multiple
> processors is another job altogether. It has been suggested to implement a
> native backend in for eg. boost or igraph or something, but it seems that
> you won't get much of a speedup doing that, unless you also rewrite the
> algorithms. Since the beauty of networkx is the python algorithms- you may
> as well then just use another package from the outset. I've been looking at
> neo4j (http://neo4j.org/) for very large graphs, and while it certainly
> scales very well, it is really missing good tools for network analysis.
> There are python bindings for neo4j (http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need
> Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would be
> an easy job to put together readers and writers which would load and store
> networkx graphs in neo4j... probably a few have already done this?
> I think it would also be possible to use neo4j as a backend for networkx-
> neo4j has Nodes and Relationships which would map nicely to nodes and edges
> in networkx. The Nodes and Relationships have property maps which could
> store any attributes of nodes/edges. Neo4j would handle the memory caching
> itself, so you would just send an access request (modify __getitem__ in
> graph.py) for the nodes/edges and let neo4j consult its stores and figure
> out how to return the data to you. While this might work for small graphs-
> its not really how the scalable graph databases are designed to work. For
> example: every write operation has to be wrapped in a transaction- this
> will kill your performance if you do it for every individual write, but is
> manageable if you batch the transactions together. This will likely require
> most of the algorithms to be reworked. The advantages of neo4j probably
> come from its indexes- you would have to figure out how to use these
> efficiently in the networkx algorithms. Many algorithms in networkx start
> with a list [] of nodes- this is not going to work either and would likely
> have to be modified to use the Traversal framework (http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). You
> might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) to
> get an idea for an efficient way of doing network analysis on distributed
> graphs.
> It would be nice to see how far you can get with a neo4j (or similar
> backend). Modifying the graph class in networkx to use a different backend
> should be straightforward enough, but ensuring all (or most) the algorithms
> still work, and work efficiently will be the hard part.
> regards,
> On Tue, Mar 20, 2012 at 5:57 PM, project2501 <darreng5...@gmail.com> wrote:
> > Yeah, because for some scales of graph, they won't fit in RAM anymore
> > or on a single machine.
> > That's the size I'm pursuing code for.
> > The general logic behind the data structure would go something like
> > this:
> > 1. When a node/edge or whatever is to be de-referenced by its id, the
> > local memory cache is consulted.
> > 2. If it doesn't reside in memory cache, then its queried from offline
> > storage (say mongodb). Mongo will take care of
> > identifying the server it actually resides on.
> > 3. That node/edge is pulled into the cache where its properties are
> > available.
> > So consider a depth first search algorithm. It would only need the
> > starting node in the cache, along with
> > its out and in vertex information (edges). From there, as networkx
> > consults its internal matrix, the de-referencing
> > process causes lookups to occur transparently. This can mean
> > consulting a large DHT or routing to other machines,
> > i.e. crawling.
> > networkx only sees its usual internal data structure.
> > Which classes should I look at to implement this? I'm not all that
> > knowledgable on networkx code.
> > On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote:
> > > I've had great luck with SQLite and using it as storage for NetworkX.
> > > Obviously, I currently write the entire graph structure to and from
> > > SQLite and RAM and all the algorithms occur in RAM. I think it would
> > > be great to have a persistent implementation as discussed below.
> > > Tim
> > > On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg <aric.hagb...@gmail.com>
> > wrote:
> > > > On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com>
> > wrote:
> > > >> Hi,
> > > >> I'm interested in providing a persistent implementation of the
> > underlying
> > > >> graph representation for networkx.
> > > >> I'm not talking about serializing a graph to and from storage. I'm
> > talking
> > > >> about storing all the node/edge/property information in
> > > >> a storage implementation, transparent to networkx, such that when
> > networkx
> > > >> algorithms traverse nodes, edges, etc.
> > > >> That information is retrieved from storage/cache using my dictionary
> > > >> implementation.
> > > >> Is it possible? Any tips appreciated!
> > > > It's definitely possible and we have discussed it before.
> > > > More generally it would make sense to use one of the graph databases.
> > > > I've asked the community to suggest an appropriate one, but so far
> > > > nobody has replied. I am willing to consider re-engineering parts the
> > > > NetworkX core to make it work with a persistent database solution if
> > > > that is necessary.
> > > > Aric
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups "networkx-discuss" group.
> > > > To post to this group, send email to networkx-discuss@googlegroups.com
> > .
> > > > To unsubscribe from this group, send email to
> > networkx-discuss+unsubscribe@googlegroups.com.
> > > > For more options, visit this group athttp://
> > groups.google.com/group/networkx-discuss?hl=en.
> > --
> > You received this message because you are subscribed to the Google Groups
> > "networkx-discuss" group.
> > To post to this group, send email to networkx-discuss@googlegroups.com.
> > To unsubscribe from this group, send email to
> > networkx-discuss+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/networkx-discuss?hl=en.
> In fact, I came from the whole neo4j,orientdb graph realm before > considering this idea.
> Couple observations:
> 1. They don't support native algorithms, so I would have to write my > own or patch a set of them over the database. > So that would be starting with the database and finding/writing the > algorithms (which I'm less adept at). > 2. Any of those databases could be used as the long-term storage > backend for what I'm describing above. > Some study of mapping algorithm code to efficient database queries > would need to be done (outside my scope).
> So, the inverse of that approach is starting with the algorithm > package and layering the persistence underneath. > networkx not only has algorithms I need, but its python, and also has > export to javascript and SVG. two critical pieces for me.
> Since I'm fairly decent with persistence, it _seems_ easier for me to > add that piece into networkx.
I just want to highlight for you that, as Aric mentioned before, many of the algorithms access and modify the dictionary structure of the Graph class directly. This is done for performance gains but means that you cannot simply define a class that exposes the same interface but you actually need to mimic some of the internals as well.
To see where such access of internals occurs would actually be interesting, imho, so that possibly those code sections could be altered to use "public" methods and attributes only.
> Of course, it won't perform the same as memory, but only on first > calls as I intend to cache the data in local memory, > backed by memcache (which is distributed and fast), backed by mongo > (long-term storage). If I implement the Graph.py > class correctly, networkx algorithms will solve the problem of what > queries to issue, in what order and when to solve > graph theoretic problems over the persistence layers. Both mongodb and > memcache scale horizontally and perform > exceptionally well. So even though it won't compare to memory, when > you're operating on graphs of enormous size > in "ad hoc" ways. Which is to say, user-driven request/response modes. > Its not possible to have ALL the graph > data in memory waiting for them.
> Having said all that, I think a good first step (for study or general > use) is to see how this approach works or not.
> :)
> On Mar 22, 10:20 am, tcb<thecolourblu...@gmail.com> wrote: >> Hi,
>> Its perhaps worth separating the two ideas...
>> The idea of a persistent storage for networkx graphs is a good one, and has >> been brought up many times- for many people this will mean just writing a >> large graph into some format which is easy to store and transport to other >> machines or applications. SQLite would seem to be ideal for this job, and >> I'm sure several people have already written their own readers and writers >> for sqlite.
>> To change the backend of networkx from a dict of dicts to something else, >> which implements persistent storage and is scalable across multiple >> processors is another job altogether. It has been suggested to implement a >> native backend in for eg. boost or igraph or something, but it seems that >> you won't get much of a speedup doing that, unless you also rewrite the >> algorithms. Since the beauty of networkx is the python algorithms- you may >> as well then just use another package from the outset. I've been looking at >> neo4j (http://neo4j.org/) for very large graphs, and while it certainly >> scales very well, it is really missing good tools for network analysis. >> There are python bindings for neo4j (http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need >> Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would be >> an easy job to put together readers and writers which would load and store >> networkx graphs in neo4j... probably a few have already done this?
>> I think it would also be possible to use neo4j as a backend for networkx- >> neo4j has Nodes and Relationships which would map nicely to nodes and edges >> in networkx. The Nodes and Relationships have property maps which could >> store any attributes of nodes/edges. Neo4j would handle the memory caching >> itself, so you would just send an access request (modify __getitem__ in >> graph.py) for the nodes/edges and let neo4j consult its stores and figure >> out how to return the data to you. While this might work for small graphs- >> its not really how the scalable graph databases are designed to work. For >> example: every write operation has to be wrapped in a transaction- this >> will kill your performance if you do it for every individual write, but is >> manageable if you batch the transactions together. This will likely require >> most of the algorithms to be reworked. The advantages of neo4j probably >> come from its indexes- you would have to figure out how to use these >> efficiently in the networkx algorithms. Many algorithms in networkx start >> with a list [] of nodes- this is not going to work either and would likely >> have to be modified to use the Traversal framework (http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). You >> might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) to >> get an idea for an efficient way of doing network analysis on distributed >> graphs.
>> It would be nice to see how far you can get with a neo4j (or similar >> backend). Modifying the graph class in networkx to use a different backend >> should be straightforward enough, but ensuring all (or most) the algorithms >> still work, and work efficiently will be the hard part.
>> regards,
>> On Tue, Mar 20, 2012 at 5:57 PM, project2501<darreng5...@gmail.com> wrote: >>> Yeah, because for some scales of graph, they won't fit in RAM anymore >>> or on a single machine. >>> That's the size I'm pursuing code for. >>> The general logic behind the data structure would go something like >>> this: >>> 1. When a node/edge or whatever is to be de-referenced by its id, the >>> local memory cache is consulted. >>> 2. If it doesn't reside in memory cache, then its queried from offline >>> storage (say mongodb). Mongo will take care of >>> identifying the server it actually resides on. >>> 3. That node/edge is pulled into the cache where its properties are >>> available. >>> So consider a depth first search algorithm. It would only need the >>> starting node in the cache, along with >>> its out and in vertex information (edges). From there, as networkx >>> consults its internal matrix, the de-referencing >>> process causes lookups to occur transparently. This can mean >>> consulting a large DHT or routing to other machines, >>> i.e. crawling. >>> networkx only sees its usual internal data structure. >>> Which classes should I look at to implement this? I'm not all that >>> knowledgable on networkx code. >>> On Mar 20, 1:21 pm, Tim Howard<tghow...@gmail.com> wrote: >>>> I've had great luck with SQLite and using it as storage for NetworkX. >>>> Obviously, I currently write the entire graph structure to and from >>>> SQLite and RAM and all the algorithms occur in RAM. I think it would >>>> be great to have a persistent implementation as discussed below. >>>> Tim >>>> On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg<aric.hagb...@gmail.com> >>> wrote: >>>>> On Tue, Mar 20, 2012 at 8:55 AM, project2501<darreng5...@gmail.com> >>> wrote: >>>>>> Hi, >>>>>> I'm interested in providing a persistent implementation of the >>> underlying >>>>>> graph representation for networkx. >>>>>> I'm not talking about serializing a graph to and from storage. I'm >>> talking >>>>>> about storing all the node/edge/property information in >>>>>> a storage implementation, transparent to networkx, such that when >>> networkx >>>>>> algorithms traverse nodes, edges, etc. >>>>>> That information is retrieved from storage/cache using my dictionary >>>>>> implementation. >>>>>> Is it possible? Any tips appreciated! >>>>> It's definitely possible and we have discussed it before. >>>>> There is a simple implementation using Python's shelve at >>>>> https://networkx.lanl.gov/trac/ticket/224 >>>>> It hasn't been updated too recently. >>>>> More generally it would make sense to use one of the graph databases. >>>>> I've asked the community to suggest an appropriate one, but so far >>>>> nobody has replied. I am willing to consider re-engineering parts the >>>>> NetworkX core to make it work with a persistent database solution if >>>>> that is necessary. >>>>> Aric >>>>> -- >>>>> You received this message because you are subscribed to the Google >>> Groups "networkx-discuss" group. >>>>> To post to this group, send email to networkx-discuss@googlegroups.com >>> . >>>>> To unsubscribe from this group, send email to >>> networkx-discuss+unsubscribe@googlegroups.com. >>>>> For more options, visit this group athttp:// >>> groups.google.com/group/networkx-discuss?hl=en. >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "networkx-discuss" group. >>> To post to this group, send email to networkx-discuss@googlegroups.com. >>> To unsubscribe from this group, send email to >>> networkx-discuss+unsubscribe@googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/networkx-discuss?hl=en.
Thanks. Yeah, I understand the issue there. I will study this some and
see if there's an efficient way to loosely couple
the internal representation.
Just offhand, it would seem that if the Graph class deferred lookups
to another class using a notional syntax for
storing and retrieving adjacent or referential sets of dicts from the
other class, then it could be implemented various ways
with only the additive cost of a method call.
On Mar 25, 5:10 pm, Moritz Emanuel Beber <moritz.be...@googlemail.com>
wrote:
> > In fact, I came from the whole neo4j,orientdb graph realm before
> > considering this idea.
> > Couple observations:
> > 1. They don't support native algorithms, so I would have to write my
> > own or patch a set of them over the database.
> > So that would be starting with the database and finding/writing the
> > algorithms (which I'm less adept at).
> > 2. Any of those databases could be used as the long-term storage
> > backend for what I'm describing above.
> > Some study of mapping algorithm code to efficient database queries
> > would need to be done (outside my scope).
> > So, the inverse of that approach is starting with the algorithm
> > package and layering the persistence underneath.
> > networkx not only has algorithms I need, but its python, and also has
> > export to javascript and SVG. two critical pieces for me.
> > Since I'm fairly decent with persistence, it _seems_ easier for me to
> > add that piece into networkx.
> I just want to highlight for you that, as Aric mentioned before, many of
> the algorithms access and modify the dictionary structure of the Graph
> class directly. This is done for performance gains but means that you
> cannot simply define a class that exposes the same interface but you
> actually need to mimic some of the internals as well.
> To see where such access of internals occurs would actually be
> interesting, imho, so that possibly those code sections could be altered
> to use "public" methods and attributes only.
> Good luck,
> Moritz
> > Of course, it won't perform the same as memory, but only on first
> > calls as I intend to cache the data in local memory,
> > backed by memcache (which is distributed and fast), backed by mongo
> > (long-term storage). If I implement the Graph.py
> > class correctly, networkx algorithms will solve the problem of what
> > queries to issue, in what order and when to solve
> > graph theoretic problems over the persistence layers. Both mongodb and
> > memcache scale horizontally and perform
> > exceptionally well. So even though it won't compare to memory, when
> > you're operating on graphs of enormous size
> > in "ad hoc" ways. Which is to say, user-driven request/response modes.
> > Its not possible to have ALL the graph
> > data in memory waiting for them.
> > Having said all that, I think a good first step (for study or general
> > use) is to see how this approach works or not.
> > :)
> > On Mar 22, 10:20 am, tcb<thecolourblu...@gmail.com> wrote:
> >> Hi,
> >> Its perhaps worth separating the two ideas...
> >> The idea of a persistent storage for networkx graphs is a good one, and has
> >> been brought up many times- for many people this will mean just writing a
> >> large graph into some format which is easy to store and transport to other
> >> machines or applications. SQLite would seem to be ideal for this job, and
> >> I'm sure several people have already written their own readers and writers
> >> for sqlite.
> >> To change the backend of networkx from a dict of dicts to something else,
> >> which implements persistent storage and is scalable across multiple
> >> processors is another job altogether. It has been suggested to implement a
> >> native backend in for eg. boost or igraph or something, but it seems that
> >> you won't get much of a speedup doing that, unless you also rewrite the
> >> algorithms. Since the beauty of networkx is the python algorithms- you may
> >> as well then just use another package from the outset. I've been looking at
> >> neo4j (http://neo4j.org/) for very large graphs, and while it certainly
> >> scales very well, it is really missing good tools for network analysis.
> >> There are python bindings for neo4j (http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need
> >> Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would be
> >> an easy job to put together readers and writers which would load and store
> >> networkx graphs in neo4j... probably a few have already done this?
> >> I think it would also be possible to use neo4j as a backend for networkx-
> >> neo4j has Nodes and Relationships which would map nicely to nodes and edges
> >> in networkx. The Nodes and Relationships have property maps which could
> >> store any attributes of nodes/edges. Neo4j would handle the memory caching
> >> itself, so you would just send an access request (modify __getitem__ in
> >> graph.py) for the nodes/edges and let neo4j consult its stores and figure
> >> out how to return the data to you. While this might work for small graphs-
> >> its not really how the scalable graph databases are designed to work. For
> >> example: every write operation has to be wrapped in a transaction- this
> >> will kill your performance if you do it for every individual write, but is
> >> manageable if you batch the transactions together. This will likely require
> >> most of the algorithms to be reworked. The advantages of neo4j probably
> >> come from its indexes- you would have to figure out how to use these
> >> efficiently in the networkx algorithms. Many algorithms in networkx start
> >> with a list [] of nodes- this is not going to work either and would likely
> >> have to be modified to use the Traversal framework (http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). You
> >> might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) to
> >> get an idea for an efficient way of doing network analysis on distributed
> >> graphs.
> >> It would be nice to see how far you can get with a neo4j (or similar
> >> backend). Modifying the graph class in networkx to use a different backend
> >> should be straightforward enough, but ensuring all (or most) the algorithms
> >> still work, and work efficiently will be the hard part.
> >> regards,
> >> On Tue, Mar 20, 2012 at 5:57 PM, project2501<darreng5...@gmail.com> wrote:
> >>> Yeah, because for some scales of graph, they won't fit in RAM anymore
> >>> or on a single machine.
> >>> That's the size I'm pursuing code for.
> >>> The general logic behind the data structure would go something like
> >>> this:
> >>> 1. When a node/edge or whatever is to be de-referenced by its id, the
> >>> local memory cache is consulted.
> >>> 2. If it doesn't reside in memory cache, then its queried from offline
> >>> storage (say mongodb). Mongo will take care of
> >>> identifying the server it actually resides on.
> >>> 3. That node/edge is pulled into the cache where its properties are
> >>> available.
> >>> So consider a depth first search algorithm. It would only need the
> >>> starting node in the cache, along with
> >>> its out and in vertex information (edges). From there, as networkx
> >>> consults its internal matrix, the de-referencing
> >>> process causes lookups to occur transparently. This can mean
> >>> consulting a large DHT or routing to other machines,
> >>> i.e. crawling.
> >>> networkx only sees its usual internal data structure.
> >>> Which classes should I look at to implement this? I'm not all that
> >>> knowledgable on networkx code.
> >>> On Mar 20, 1:21 pm, Tim Howard<tghow...@gmail.com> wrote:
> >>>> I've had great luck with SQLite and using it as storage for NetworkX.
> >>>> Obviously, I currently write the entire graph structure to and from
> >>>> SQLite and RAM and all the algorithms occur in RAM. I think it would
> >>>> be great to have a persistent implementation as discussed below.
> >>>> Tim
> >>>> On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg<aric.hagb...@gmail.com>
> >>> wrote:
> >>>>> On Tue, Mar 20, 2012 at 8:55 AM, project2501<darreng5...@gmail.com>
> >>> wrote:
> >>>>>> Hi,
> >>>>>> I'm interested in providing a persistent implementation of the
> >>> underlying
> >>>>>> graph representation for networkx.
> >>>>>> I'm not talking about serializing a graph to and from storage. I'm
> >>> talking
> >>>>>> about storing all the node/edge/property information in
> >>>>>> a storage implementation, transparent to networkx, such that when
> >>> networkx
> >>>>>> algorithms traverse nodes, edges, etc.
> >>>>>> That information is retrieved from storage/cache using my dictionary
> >>>>>> implementation.
> >>>>>> Is it possible? Any tips appreciated!
> >>>>> It's definitely possible and we have discussed it before.
> >>>>> There is a simple implementation using Python's shelve at
> >>>>>https://networkx.lanl.gov/trac/ticket/224 > >>>>> It hasn't been updated too recently.
> >>>>> More generally it would make sense to use one of the graph databases.
> >>>>> I've asked the community to suggest an appropriate one, but so far
> >>>>> nobody has replied. I am willing to consider re-engineering parts the
> >>>>> NetworkX core to make it work with a persistent database solution if
> >>>>> that is necessary.
> >>>>> Aric
> >>>>> --
> >>>>> You received this message because you are subscribed to the Google
> >>> Groups "networkx-discuss" group.
> >>>>> To post to this group, send email to networkx-discuss@googlegroups.com
> >>> .
> >>>>> To unsubscribe from this group, send email to
> >>> networkx-discuss+unsubscribe@googlegroups.com.
> >>>>> For more options, visit this group
On Tue, Mar 20, 2012 at 10:22, Aric Hagberg <aric.hagb...@gmail.com> wrote: > On Tue, Mar 20, 2012 at 8:55 AM, project2501 <darreng5...@gmail.com> > wrote: > > Hi, > > I'm interested in providing a persistent implementation of the > underlying > > graph representation for networkx. > > I'm not talking about serializing a graph to and from storage. I'm > talking > > about storing all the node/edge/property information in > > a storage implementation, transparent to networkx, such that when > networkx > > algorithms traverse nodes, edges, etc. > > That information is retrieved from storage/cache using my dictionary > > implementation.
> > Is it possible? Any tips appreciated!
> It's definitely possible and we have discussed it before.
> More generally it would make sense to use one of the graph databases. > I've asked the community to suggest an appropriate one, but so far > nobody has replied. I am willing to consider re-engineering parts the > NetworkX core to make it work with a persistent database solution if > that is necessary.
> Aric
> -- > You received this message because you are subscribed to the Google Groups > "networkx-discuss" group. > To post to this group, send email to networkx-discuss@googlegroups.com. > To unsubscribe from this group, send email to > networkx-discuss+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/networkx-discuss?hl=en.
On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darreng5...@gmail.com> wrote: > In fact, I came from the whole neo4j,orientdb graph realm before > considering this idea.
> Couple observations:
> 1. They don't support native algorithms, so I would have to write my > own or patch a set of them over the database. > So that would be starting with the database and finding/writing the > algorithms (which I'm less adept at). > 2. Any of those databases could be used as the long-term storage > backend for what I'm describing above. > Some study of mapping algorithm code to efficient database queries > would need to be done (outside my scope).
That's about right.
You could also look at the previously mentioned cloudlight to see how a persistent backend is done with SQLite.
> So, the inverse of that approach is starting with the algorithm > package and layering the persistence underneath. > networkx not only has algorithms I need, but its python, and also has > export to javascript and SVG. two critical pieces for me.
> Since I'm fairly decent with persistence, it _seems_ easier for me to > add that piece into networkx.
> Of course, it won't perform the same as memory, but only on first > calls as I intend to cache the data in local memory, > backed by memcache (which is distributed and fast), backed by mongo > (long-term storage). If I implement the Graph.py > class correctly, networkx algorithms will solve the problem of what > queries to issue, in what order and when to solve > graph theoretic problems over the persistence layers. Both mongodb and > memcache scale horizontally and perform > exceptionally well. So even though it won't compare to memory, when > you're operating on graphs of enormous size > in "ad hoc" ways. Which is to say, user-driven request/response modes. > Its not possible to have ALL the graph > data in memory waiting for them.
Sure, but this is really the problem that the graph databases like neo4j are trying to solve. If you can pull all your data from storage into memory, then you have no problems running networkx algorithms directly. But the graph databases are designed for those situations where you have more data than you can fit in memory on a single machine. This is much more of a problem for graph algorithms because of the non locality of the data. For example, finding neighbours of a node might mean pulling data from several different machines- there is usually no simple way of partitioning your graph nicely among N processors so that each processor has just the nodes and edges it needs.
Perhaps as first step, and to see how it might pan out, you could ignore those issues and just implement the persistent storage. If the networkx algorithms manipulate the Graph data through a standard api then they should 'just work', but they will be dreadfully slow.
The next step then, would to take advantage of the ability to work with large datasets, and perhaps to execute some algorithms in parallel using the native abilities of the backend. A parallel search algorithm implemented on a scalable graph database backend would be a tremendously useful addition to networkx.
> Having said all that, I think a good first step (for study or general > use) is to see how this approach works or not.
> :)
> On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote: > > Hi,
> > Its perhaps worth separating the two ideas...
> > The idea of a persistent storage for networkx graphs is a good one, and > has > > been brought up many times- for many people this will mean just writing a > > large graph into some format which is easy to store and transport to > other > > machines or applications. SQLite would seem to be ideal for this job, and > > I'm sure several people have already written their own readers and > writers > > for sqlite.
> > To change the backend of networkx from a dict of dicts to something else, > > which implements persistent storage and is scalable across multiple > > processors is another job altogether. It has been suggested to implement > a > > native backend in for eg. boost or igraph or something, but it seems that > > you won't get much of a speedup doing that, unless you also rewrite the > > algorithms. Since the beauty of networkx is the python algorithms- you > may > > as well then just use another package from the outset. I've been looking > at > > neo4j (http://neo4j.org/) for very large graphs, and while it certainly > > scales very well, it is really missing good tools for network analysis. > > There are python bindings for neo4j ( > http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need > > Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would > be > > an easy job to put together readers and writers which would load and > store > > networkx graphs in neo4j... probably a few have already done this?
> > I think it would also be possible to use neo4j as a backend for networkx- > > neo4j has Nodes and Relationships which would map nicely to nodes and > edges > > in networkx. The Nodes and Relationships have property maps which could > > store any attributes of nodes/edges. Neo4j would handle the memory > caching > > itself, so you would just send an access request (modify __getitem__ in > > graph.py) for the nodes/edges and let neo4j consult its stores and figure > > out how to return the data to you. While this might work for small > graphs- > > its not really how the scalable graph databases are designed to work. For > > example: every write operation has to be wrapped in a transaction- this > > will kill your performance if you do it for every individual write, but > is > > manageable if you batch the transactions together. This will likely > require > > most of the algorithms to be reworked. The advantages of neo4j probably > > come from its indexes- you would have to figure out how to use these > > efficiently in the networkx algorithms. Many algorithms in networkx start > > with a list [] of nodes- this is not going to work either and would > likely > > have to be modified to use the Traversal framework ( > http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). > You > > might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) > to > > get an idea for an efficient way of doing network analysis on distributed > > graphs.
> > It would be nice to see how far you can get with a neo4j (or similar > > backend). Modifying the graph class in networkx to use a different > backend > > should be straightforward enough, but ensuring all (or most) the > algorithms > > still work, and work efficiently will be the hard part.
> > regards,
> > On Tue, Mar 20, 2012 at 5:57 PM, project2501 <darreng5...@gmail.com> > wrote: > > > Yeah, because for some scales of graph, they won't fit in RAM anymore > > > or on a single machine. > > > That's the size I'm pursuing code for.
> > > The general logic behind the data structure would go something like > > > this:
> > > 1. When a node/edge or whatever is to be de-referenced by its id, the > > > local memory cache is consulted. > > > 2. If it doesn't reside in memory cache, then its queried from offline > > > storage (say mongodb). Mongo will take care of > > > identifying the server it actually resides on. > > > 3. That node/edge is pulled into the cache where its properties are > > > available.
> > > So consider a depth first search algorithm. It would only need the > > > starting node in the cache, along with > > > its out and in vertex information (edges). From there, as networkx > > > consults its internal matrix, the de-referencing > > > process causes lookups to occur transparently. This can mean > > > consulting a large DHT or routing to other machines, > > > i.e. crawling.
> > > networkx only sees its usual internal data structure.
> > > Which classes should I look at to implement this? I'm not all that > > > knowledgable on networkx code.
> > > On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote: > > > > I've had great luck with SQLite and using it as storage for NetworkX. > > > > Obviously, I currently write the entire graph structure to and from > > > > SQLite and RAM and all the algorithms occur in RAM. I think it would > > > > be great to have a persistent implementation as discussed below.
> > > > Tim
> > > > On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg < > aric.hagb...@gmail.com> > > > wrote: > > > > > On Tue, Mar 20, 2012 at 8:55 AM, project2501 < > darreng5...@gmail.com> > > > wrote: > > > > >> Hi, > > > > >> I'm interested in providing a persistent implementation of the > > > underlying > > > > >> graph representation for networkx. > > > > >> I'm not talking about serializing a graph to and from storage. I'm > > > talking > > > > >> about storing all the node/edge/property information in > > > > >> a storage implementation, transparent to networkx, such that when > > > networkx > > > > >> algorithms traverse nodes, edges, etc. > > > > >> That information is retrieved from storage/cache using my > dictionary > > > > >> implementation.
> > > > >> Is it possible? Any tips appreciated!
> > > > > It's definitely possible and we have discussed it before.
> > > > > There is a simple implementation using Python's shelve at > > > > >https://networkx.lanl.gov/trac/ticket/224 > > > > > It hasn't been updated too recently.
> > > > > More generally it would make sense to use one of the graph > databases. > > > > > I've asked the community to suggest an appropriate one, but so far > > > > > nobody has replied. I am willing to consider re-engineering parts > the > > > > > NetworkX core to make it work with a persistent database solution > if > > > > > that is necessary.
> > > > > Aric
> > > > > -- > > > > > You received this message because you are subscribed to the Google > > > Groups "networkx-discuss" group. > > > > > To post to this group,
Yeah, I get that graph databases are optimized for graph storage and retrieval.
I would certainly use one in an implementation underneath networkx. The problem is, as I mention, that the graph databases do not come with the algorithms networkx comes with. Also, reading the required nodes into memory then doing networkx algorithms presents a problem. Which nodes to read into memory. And also, perhaps not all the required ones fit into memory.
Replacing the networkx in-memory data model with an interface to a memory or persistent model will solve both of these problems.
As with most engineering problems, there are tradeoffs. Using networkx to navigate a persistent graph database would not be 'dreadfully' slow in my opinion for many common use cases. Slow is relative. If done using a forward cache in memory or DHT type system, it would perform acceptably given the power it brings - scale - and access to much more information.
On Monday, March 26, 2012 9:57:31 AM UTC-4, tcb wrote:
> On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darreng5...@gmail.com>wrote:
>> In fact, I came from the whole neo4j,orientdb graph realm before >> considering this idea.
>> Couple observations:
>> 1. They don't support native algorithms, so I would have to write my >> own or patch a set of them over the database. >> So that would be starting with the database and finding/writing the >> algorithms (which I'm less adept at). >> 2. Any of those databases could be used as the long-term storage >> backend for what I'm describing above. >> Some study of mapping algorithm code to efficient database queries >> would need to be done (outside my scope).
> That's about right.
> You could also look at the previously mentioned cloudlight to see how a > persistent backend is done with SQLite.
>> So, the inverse of that approach is starting with the algorithm >> package and layering the persistence underneath. >> networkx not only has algorithms I need, but its python, and also has >> export to javascript and SVG. two critical pieces for me.
>> Since I'm fairly decent with persistence, it _seems_ easier for me to >> add that piece into networkx.
>> Of course, it won't perform the same as memory, but only on first >> calls as I intend to cache the data in local memory, >> backed by memcache (which is distributed and fast), backed by mongo >> (long-term storage). If I implement the Graph.py >> class correctly, networkx algorithms will solve the problem of what >> queries to issue, in what order and when to solve >> graph theoretic problems over the persistence layers. Both mongodb and >> memcache scale horizontally and perform >> exceptionally well. So even though it won't compare to memory, when >> you're operating on graphs of enormous size >> in "ad hoc" ways. Which is to say, user-driven request/response modes. >> Its not possible to have ALL the graph >> data in memory waiting for them.
> Sure, but this is really the problem that the graph databases like neo4j > are trying to solve. If you can pull all your data from storage into > memory, then you have no problems running networkx algorithms directly. But > the graph databases are designed for those situations where you have more > data than you can fit in memory on a single machine. This is much more of a > problem for graph algorithms because of the non locality of the data. For > example, finding neighbours of a node might mean pulling data from several > different machines- there is usually no simple way of partitioning your > graph nicely among N processors so that each processor has just the nodes > and edges it needs.
> Perhaps as first step, and to see how it might pan out, you could ignore > those issues and just implement the persistent storage. If the networkx > algorithms manipulate the Graph data through a standard api then they > should 'just work', but they will be dreadfully slow.
> The next step then, would to take advantage of the ability to work with > large datasets, and perhaps to execute some algorithms in parallel using > the native abilities of the backend. A parallel search algorithm > implemented on a scalable graph database backend would be a tremendously > useful addition to networkx.
>> Having said all that, I think a good first step (for study or general >> use) is to see how this approach works or not.
>> :)
>> On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote: >> > Hi,
>> > Its perhaps worth separating the two ideas...
>> > The idea of a persistent storage for networkx graphs is a good one, and >> has >> > been brought up many times- for many people this will mean just writing >> a >> > large graph into some format which is easy to store and transport to >> other >> > machines or applications. SQLite would seem to be ideal for this job, >> and >> > I'm sure several people have already written their own readers and >> writers >> > for sqlite.
>> > To change the backend of networkx from a dict of dicts to something >> else, >> > which implements persistent storage and is scalable across multiple >> > processors is another job altogether. It has been suggested to >> implement a >> > native backend in for eg. boost or igraph or something, but it seems >> that >> > you won't get much of a speedup doing that, unless you also rewrite the >> > algorithms. Since the beauty of networkx is the python algorithms- you >> may >> > as well then just use another package from the outset. I've been >> looking at >> > neo4j (http://neo4j.org/) for very large graphs, and while it certainly >> > scales very well, it is really missing good tools for network analysis. >> > There are python bindings for neo4j ( >> http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need >> > Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would >> be >> > an easy job to put together readers and writers which would load and >> store >> > networkx graphs in neo4j... probably a few have already done this?
>> > I think it would also be possible to use neo4j as a backend for >> networkx- >> > neo4j has Nodes and Relationships which would map nicely to nodes and >> edges >> > in networkx. The Nodes and Relationships have property maps which could >> > store any attributes of nodes/edges. Neo4j would handle the memory >> caching >> > itself, so you would just send an access request (modify __getitem__ in >> > graph.py) for the nodes/edges and let neo4j consult its stores and >> figure >> > out how to return the data to you. While this might work for small >> graphs- >> > its not really how the scalable graph databases are designed to work. >> For >> > example: every write operation has to be wrapped in a transaction- this >> > will kill your performance if you do it for every individual write, but >> is >> > manageable if you batch the transactions together. This will likely >> require >> > most of the algorithms to be reworked. The advantages of neo4j probably >> > come from its indexes- you would have to figure out how to use these >> > efficiently in the networkx algorithms. Many algorithms in networkx >> start >> > with a list [] of nodes- this is not going to work either and would >> likely >> > have to be modified to use the Traversal framework ( >> http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). >> You >> > might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) >> to >> > get an idea for an efficient way of doing network analysis on >> distributed >> > graphs.
>> > It would be nice to see how far you can get with a neo4j (or similar >> > backend). Modifying the graph class in networkx to use a different >> backend >> > should be straightforward enough, but ensuring all (or most) the >> algorithms >> > still work, and work efficiently will be the hard part.
>> > regards,
>> > On Tue, Mar 20, 2012 at 5:57 PM, project2501 <darreng5...@gmail.com> >> wrote: >> > > Yeah, because for some scales of graph, they won't fit in RAM anymore >> > > or on a single machine. >> > > That's the size I'm pursuing code for.
>> > > The general logic behind the data structure would go something like >> > > this:
>> > > 1. When a node/edge or whatever is to be de-referenced by its id, the >> > > local memory cache is consulted. >> > > 2. If it doesn't reside in memory cache, then its queried from offline >> > > storage (say mongodb). Mongo will take care of >> > > identifying the server it actually resides on. >> > > 3. That node/edge is pulled into the cache where its properties are >> > > available.
>> > > So consider a depth first search algorithm. It would only need the >> > > starting node in the cache, along with >> > > its out and in vertex information (edges). From there, as networkx >> > > consults its internal matrix, the de-referencing >> > > process causes lookups to occur transparently. This can mean >> > > consulting a large DHT or routing to other machines, >> > > i.e. crawling.
>> > > networkx only sees its usual internal data structure.
>> > > Which classes should I look at to implement this? I'm not all that >> > > knowledgable on networkx code.
>> > > On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote: >> > > > I've had great luck with SQLite and using it as storage for >> NetworkX. >> > > > Obviously, I currently write the entire graph structure to and from >> > > > SQLite and RAM and all the algorithms occur in RAM. I think it would >> > > > be great to have a persistent implementation as discussed below.
>> > > > Tim
>> > > > On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg < >> aric.hagb...@gmail.com> >> > > wrote: >> > > > > On Tue, Mar 20, 2012 at 8:55 AM, project2501 < >> darreng5...@gmail.com> >> > > wrote: >> > > > >> Hi, >> > > > >> I'm interested in providing a persistent
For starters, inside Graph, instead of doing something like neighbors[a][b][c], make it a method call.
neighbors(a,b,c,...)
Let the default implementation use a dict of dicts inside of neighbors() if it wants. A graphdb implementation will construct the appropriate joined query and/or consult its immediate cache.
This should be done. Vastly useful in today's big data world.
On Tuesday, April 3, 2012 9:12:57 AM UTC-4, project2501 wrote:
> Yeah, I get that graph databases are optimized for graph storage and > retrieval.
> I would certainly use one in an implementation underneath networkx. The > problem is, > as I mention, that the graph databases do not come with the algorithms > networkx comes with. > Also, reading the required nodes into memory then doing networkx > algorithms presents a problem. > Which nodes to read into memory. And also, perhaps not all the required > ones fit into memory.
> Replacing the networkx in-memory data model with an interface to a memory > or persistent > model will solve both of these problems.
> As with most engineering problems, there are tradeoffs. Using networkx to > navigate a persistent > graph database would not be 'dreadfully' slow in my opinion for many > common use cases. Slow > is relative. If done using a forward cache in memory or DHT type system, > it would perform > acceptably given the power it brings - scale - and access to much more > information.
> On Monday, March 26, 2012 9:57:31 AM UTC-4, tcb wrote:
>> On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darreng5...@gmail.com>wrote:
>>> In fact, I came from the whole neo4j,orientdb graph realm before >>> considering this idea.
>>> Couple observations:
>>> 1. They don't support native algorithms, so I would have to write my >>> own or patch a set of them over the database. >>> So that would be starting with the database and finding/writing the >>> algorithms (which I'm less adept at). >>> 2. Any of those databases could be used as the long-term storage >>> backend for what I'm describing above. >>> Some study of mapping algorithm code to efficient database queries >>> would need to be done (outside my scope).
>> That's about right.
>> You could also look at the previously mentioned cloudlight to see how a >> persistent backend is done with SQLite.
>>> So, the inverse of that approach is starting with the algorithm >>> package and layering the persistence underneath. >>> networkx not only has algorithms I need, but its python, and also has >>> export to javascript and SVG. two critical pieces for me.
>>> Since I'm fairly decent with persistence, it _seems_ easier for me to >>> add that piece into networkx.
>>> Of course, it won't perform the same as memory, but only on first >>> calls as I intend to cache the data in local memory, >>> backed by memcache (which is distributed and fast), backed by mongo >>> (long-term storage). If I implement the Graph.py >>> class correctly, networkx algorithms will solve the problem of what >>> queries to issue, in what order and when to solve >>> graph theoretic problems over the persistence layers. Both mongodb and >>> memcache scale horizontally and perform >>> exceptionally well. So even though it won't compare to memory, when >>> you're operating on graphs of enormous size >>> in "ad hoc" ways. Which is to say, user-driven request/response modes. >>> Its not possible to have ALL the graph >>> data in memory waiting for them.
>> Sure, but this is really the problem that the graph databases like neo4j >> are trying to solve. If you can pull all your data from storage into >> memory, then you have no problems running networkx algorithms directly. But >> the graph databases are designed for those situations where you have more >> data than you can fit in memory on a single machine. This is much more of a >> problem for graph algorithms because of the non locality of the data. For >> example, finding neighbours of a node might mean pulling data from several >> different machines- there is usually no simple way of partitioning your >> graph nicely among N processors so that each processor has just the nodes >> and edges it needs.
>> Perhaps as first step, and to see how it might pan out, you could ignore >> those issues and just implement the persistent storage. If the networkx >> algorithms manipulate the Graph data through a standard api then they >> should 'just work', but they will be dreadfully slow.
>> The next step then, would to take advantage of the ability to work with >> large datasets, and perhaps to execute some algorithms in parallel using >> the native abilities of the backend. A parallel search algorithm >> implemented on a scalable graph database backend would be a tremendously >> useful addition to networkx.
>>> Having said all that, I think a good first step (for study or general >>> use) is to see how this approach works or not.
>>> :)
>>> On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote: >>> > Hi,
>>> > Its perhaps worth separating the two ideas...
>>> > The idea of a persistent storage for networkx graphs is a good one, >>> and has >>> > been brought up many times- for many people this will mean just >>> writing a >>> > large graph into some format which is easy to store and transport to >>> other >>> > machines or applications. SQLite would seem to be ideal for this job, >>> and >>> > I'm sure several people have already written their own readers and >>> writers >>> > for sqlite.
>>> > To change the backend of networkx from a dict of dicts to something >>> else, >>> > which implements persistent storage and is scalable across multiple >>> > processors is another job altogether. It has been suggested to >>> implement a >>> > native backend in for eg. boost or igraph or something, but it seems >>> that >>> > you won't get much of a speedup doing that, unless you also rewrite the >>> > algorithms. Since the beauty of networkx is the python algorithms- you >>> may >>> > as well then just use another package from the outset. I've been >>> looking at >>> > neo4j (http://neo4j.org/) for very large graphs, and while it >>> certainly >>> > scales very well, it is really missing good tools for network analysis. >>> > There are python bindings for neo4j ( >>> http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need >>> > Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it >>> would be >>> > an easy job to put together readers and writers which would load and >>> store >>> > networkx graphs in neo4j... probably a few have already done this?
>>> > I think it would also be possible to use neo4j as a backend for >>> networkx- >>> > neo4j has Nodes and Relationships which would map nicely to nodes and >>> edges >>> > in networkx. The Nodes and Relationships have property maps which could >>> > store any attributes of nodes/edges. Neo4j would handle the memory >>> caching >>> > itself, so you would just send an access request (modify __getitem__ in >>> > graph.py) for the nodes/edges and let neo4j consult its stores and >>> figure >>> > out how to return the data to you. While this might work for small >>> graphs- >>> > its not really how the scalable graph databases are designed to work. >>> For >>> > example: every write operation has to be wrapped in a transaction- this >>> > will kill your performance if you do it for every individual write, >>> but is >>> > manageable if you batch the transactions together. This will likely >>> require >>> > most of the algorithms to be reworked. The advantages of neo4j probably >>> > come from its indexes- you would have to figure out how to use these >>> > efficiently in the networkx algorithms. Many algorithms in networkx >>> start >>> > with a list [] of nodes- this is not going to work either and would >>> likely >>> > have to be modified to use the Traversal framework ( >>> http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). >>> You >>> > might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) >>> to >>> > get an idea for an efficient way of doing network analysis on >>> distributed >>> > graphs.
>>> > It would be nice to see how far you can get with a neo4j (or similar >>> > backend). Modifying the graph class in networkx to use a different >>> backend >>> > should be straightforward enough, but ensuring all (or most) the >>> algorithms >>> > still work, and work efficiently will be the hard part.
>>> > regards,
>>> > On Tue, Mar 20, 2012 at 5:57 PM, project2501 <darreng5...@gmail.com> >>> wrote: >>> > > Yeah, because for some scales of graph, they won't fit in RAM anymore >>> > > or on a single machine. >>> > > That's the size I'm pursuing code for.
>>> > > The general logic behind the data structure would go something like >>> > > this:
>>> > > 1. When a node/edge or whatever is to be de-referenced by its id, the >>> > > local memory cache is consulted. >>> > > 2. If it doesn't reside in memory cache, then its queried from >>> offline >>> > > storage (say mongodb). Mongo will take care of >>> > > identifying the server it actually resides on. >>> > > 3. That node/edge is pulled into the cache where its properties are >>> > > available.
>>> > > So consider a depth first search algorithm. It would only need the >>> > > starting node in the cache, along with >>> > > its out and in vertex information (edges). From there, as networkx >>> > > consults its internal matrix, the de-referencing >>> > > process causes lookups to occur transparently. This can mean >>> > > consulting a large DHT or routing to other machines, >>> > > i.e. crawling.
>>> > > networkx only sees its usual internal data structure.
>>> > > Which classes should I look at to implement this? I'm not all that >>> > > knowledgable
On Tue, Apr 3, 2012 at 2:12 PM, project2501 <darreng5...@gmail.com> wrote: > Yeah, I get that graph databases are optimized for graph storage and > retrieval.
> I would certainly use one in an implementation underneath networkx. The > problem is, > as I mention, that the graph databases do not come with the algorithms > networkx comes with. > Also, reading the required nodes into memory then doing networkx > algorithms presents a problem. > Which nodes to read into memory. And also, perhaps not all the required > ones fit into memory.
> Replacing the networkx in-memory data model with an interface to a memory > or persistent > model will solve both of these problems.
hmm, I'm not so sure. I wonder if you are underestimating the issues (or maybe I'm over estimating them?).
The reason graph databases don't come with the kind of algorithms networkx has is because they just don't work at that kind of scale. Replacing the storage model does not solve the in-memory data problem from the point of view of the algorithm. What the graph db's can do with reasonable efficiency is graph traversals. So any algorithm in networkx which relies on a traversal (and there are quite a few) could (possibly) be made to use the scalable graph traversal, and could (possibly) work reasonably well.
An example of the problems I would envisage are (at random) from core.py:core_number(G) - the first line in this algorithm does this:
degrees=G.degree()
that's a list of degrees of all nodes, and it makes another similar list for the cores which it proceeds to modify. If you have a large enough graph which requires distributed memory this just wont work (you just can't make a single list with size of number_of_nodes). You could try to make a new property on the node and update that property through the graph db backend- but in many cases to avoid this issue will involve substantial reworking of the algorithms.
If you can fit all your data in memory, it seems pointless to use a persistent storage backend- it will never compete with networkx's own Graph (neither in memory usage or speed). In this case, just use networkx directly, run the standard algorithms and at the end use the readwrite functions to write your graph back to persistent storage. So networkx should have the ability to read and write graphs from persistent storage, but it uses its own data model and algorithms.
If your graphs are too big for a single machine, then you need the distributed graph databases (neo4j as an example). These can do distributed storage and distributed graph traversal. But distributed computation is another field altogether and networkx's algorithms are just not designed to work for those problems. Maybe you can make some of the algorithms work acceptably well- I'd love to be proved wrong... ;)
> As with most engineering problems, there are tradeoffs. Using networkx to > navigate a persistent > graph database would not be 'dreadfully' slow in my opinion for many > common use cases. Slow > is relative. If done using a forward cache in memory or DHT type system, > it would perform > acceptably given the power it brings - scale - and access to much more > information.
> On Monday, March 26, 2012 9:57:31 AM UTC-4, tcb wrote:
>> On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darreng5...@gmail.com>wrote:
>>> In fact, I came from the whole neo4j,orientdb graph realm before >>> considering this idea.
>>> Couple observations:
>>> 1. They don't support native algorithms, so I would have to write my >>> own or patch a set of them over the database. >>> So that would be starting with the database and finding/writing the >>> algorithms (which I'm less adept at). >>> 2. Any of those databases could be used as the long-term storage >>> backend for what I'm describing above. >>> Some study of mapping algorithm code to efficient database queries >>> would need to be done (outside my scope).
>> That's about right.
>> You could also look at the previously mentioned cloudlight to see how a >> persistent backend is done with SQLite.
>>> So, the inverse of that approach is starting with the algorithm >>> package and layering the persistence underneath. >>> networkx not only has algorithms I need, but its python, and also has >>> export to javascript and SVG. two critical pieces for me.
>>> Since I'm fairly decent with persistence, it _seems_ easier for me to >>> add that piece into networkx.
>>> Of course, it won't perform the same as memory, but only on first >>> calls as I intend to cache the data in local memory, >>> backed by memcache (which is distributed and fast), backed by mongo >>> (long-term storage). If I implement the Graph.py >>> class correctly, networkx algorithms will solve the problem of what >>> queries to issue, in what order and when to solve >>> graph theoretic problems over the persistence layers. Both mongodb and >>> memcache scale horizontally and perform >>> exceptionally well. So even though it won't compare to memory, when >>> you're operating on graphs of enormous size >>> in "ad hoc" ways. Which is to say, user-driven request/response modes. >>> Its not possible to have ALL the graph >>> data in memory waiting for them.
>> Sure, but this is really the problem that the graph databases like neo4j >> are trying to solve. If you can pull all your data from storage into >> memory, then you have no problems running networkx algorithms directly. But >> the graph databases are designed for those situations where you have more >> data than you can fit in memory on a single machine. This is much more of a >> problem for graph algorithms because of the non locality of the data. For >> example, finding neighbours of a node might mean pulling data from several >> different machines- there is usually no simple way of partitioning your >> graph nicely among N processors so that each processor has just the nodes >> and edges it needs.
>> Perhaps as first step, and to see how it might pan out, you could ignore >> those issues and just implement the persistent storage. If the networkx >> algorithms manipulate the Graph data through a standard api then they >> should 'just work', but they will be dreadfully slow.
>> The next step then, would to take advantage of the ability to work with >> large datasets, and perhaps to execute some algorithms in parallel using >> the native abilities of the backend. A parallel search algorithm >> implemented on a scalable graph database backend would be a tremendously >> useful addition to networkx.
>>> Having said all that, I think a good first step (for study or general >>> use) is to see how this approach works or not.
>>> :)
>>> On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote: >>> > Hi,
>>> > Its perhaps worth separating the two ideas...
>>> > The idea of a persistent storage for networkx graphs is a good one, >>> and has >>> > been brought up many times- for many people this will mean just >>> writing a >>> > large graph into some format which is easy to store and transport to >>> other >>> > machines or applications. SQLite would seem to be ideal for this job, >>> and >>> > I'm sure several people have already written their own readers and >>> writers >>> > for sqlite.
>>> > To change the backend of networkx from a dict of dicts to something >>> else, >>> > which implements persistent storage and is scalable across multiple >>> > processors is another job altogether. It has been suggested to >>> implement a >>> > native backend in for eg. boost or igraph or something, but it seems >>> that >>> > you won't get much of a speedup doing that, unless you also rewrite the >>> > algorithms. Since the beauty of networkx is the python algorithms- you >>> may >>> > as well then just use another package from the outset. I've been >>> looking at >>> > neo4j (http://neo4j.org/) for very large graphs, and while it >>> certainly >>> > scales very well, it is really missing good tools for network analysis. >>> > There are python bindings for neo4j (http://docs.neo4j.org/** >>> chunked/snapshot/python-**embedded.html<http://docs.neo4j.org/chunked/snapshot/python-embedded.html>) >>> which need >>> > Jpype (http://jpype.sourceforge.net/**) since neo4j is in java- it >>> would be >>> > an easy job to put together readers and writers which would load and >>> store >>> > networkx graphs in neo4j... probably a few have already done this?
>>> > I think it would also be possible to use neo4j as a backend for >>> networkx- >>> > neo4j has Nodes and Relationships which would map nicely to nodes and >>> edges >>> > in networkx. The Nodes and Relationships have property maps which could >>> > store any attributes of nodes/edges. Neo4j would handle the memory >>> caching >>> > itself, so you would just send an access request (modify __getitem__ in >>> > graph.py) for the nodes/edges and let neo4j consult its stores and >>> figure >>> > out how to return the data to you. While this might work for small >>> graphs- >>> > its not really how the scalable graph databases are designed to work. >>> For >>> > example: every write operation has to be wrapped in a transaction- this >>> > will kill your performance if you do it for every individual write, >>> but is >>> > manageable if you batch the transactions together. This will likely >>> require >>> > most of the algorithms to be reworked. The advantages of neo4j probably >>> > come from its indexes- you would have to figure out how to use these >>> > efficiently in the networkx algorithms. Many algorithms in networkx >>> start >>> > with a list [] of nodes- this is not going to work either and would >>> likely >>> > have to be modified to use the Traversal framework ( >>> http://docs.neo4j.org/**chunked/stable/tutorial-** >>> traversal-concepts.html<http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html>). >>> You >>> > might also look at Gremlin
tcb, You make some valid points. However, I think this space of graphs, graphdbs, algorithms and the like is accelerating quickly. It has evolved quite a bite from just a few years ago. So my request to the networkx team is to simply put a data access layer in the Graph class to invite innovation in. Bundle with a default MemoryDictOfDicts implementation, but make it easy for people in the graph world to push the envelope.
Now,, I don't know the ultimate performance metrics, but I'm hoping to find out what the tradeoffs are because, for me at least, its certain that A) the nodes won't fit all into memory, and B) this capability prevails over the performance for now. In other words, my customers would rather have the ability to perform insightful algorithms over vast quantities of distributed data at whatever performance level can be achieved, than to have nothing.
So let's open the doors on this! It seems a simple modification to Graph class to add data access object or pattern (for one of the authors that is). From there, I expect we all will be surprised at what the community does after that!
On Tuesday, April 3, 2012 12:19:08 PM UTC-4, tcb wrote:
> On Tue, Apr 3, 2012 at 2:12 PM, project2501 <darreng5...@gmail.com> wrote:
>> Yeah, I get that graph databases are optimized for graph storage and >> retrieval.
>> I would certainly use one in an implementation underneath networkx. The >> problem is, >> as I mention, that the graph databases do not come with the algorithms >> networkx comes with. >> Also, reading the required nodes into memory then doing networkx >> algorithms presents a problem. >> Which nodes to read into memory. And also, perhaps not all the required >> ones fit into memory.
>> Replacing the networkx in-memory data model with an interface to a memory >> or persistent >> model will solve both of these problems.
> hmm, I'm not so sure. I wonder if you are underestimating the issues (or > maybe I'm over estimating them?).
> The reason graph databases don't come with the kind of algorithms networkx > has is because they just don't work at that kind of scale. Replacing the > storage model does not solve the in-memory data problem from the point of > view of the algorithm. What the graph db's can do with reasonable > efficiency is graph traversals. So any algorithm in networkx which relies > on a traversal (and there are quite a few) could (possibly) be made to use > the scalable graph traversal, and could (possibly) work reasonably well.
> An example of the problems I would envisage are (at random) from > core.py:core_number(G) - the first line in this algorithm does this:
> degrees=G.degree()
> that's a list of degrees of all nodes, and it makes another similar list > for the cores which it proceeds to modify. If you have a large enough graph > which requires distributed memory this just wont work (you just can't make > a single list with size of number_of_nodes). You could try to make a new > property on the node and update that property through the graph db backend- > but in many cases to avoid this issue will involve substantial reworking of > the algorithms.
> If you can fit all your data in memory, it seems pointless to use a > persistent storage backend- it will never compete with networkx's own Graph > (neither in memory usage or speed). In this case, just use networkx > directly, run the standard algorithms and at the end use the readwrite > functions to write your graph back to persistent storage. So networkx > should have the ability to read and write graphs from persistent storage, > but it uses its own data model and algorithms.
> If your graphs are too big for a single machine, then you need the > distributed graph databases (neo4j as an example). These can do distributed > storage and distributed graph traversal. But distributed computation is > another field altogether and networkx's algorithms are just not designed to > work for those problems. Maybe you can make some of the algorithms work > acceptably well- I'd love to be proved wrong... ;)
> regards
>> As with most engineering problems, there are tradeoffs. Using networkx to >> navigate a persistent >> graph database would not be 'dreadfully' slow in my opinion for many >> common use cases. Slow >> is relative. If done using a forward cache in memory or DHT type system, >> it would perform >> acceptably given the power it brings - scale - and access to much more >> information.
>> On Monday, March 26, 2012 9:57:31 AM UTC-4, tcb wrote:
>>> On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darreng5...@gmail.com>wrote:
>>>> In fact, I came from the whole neo4j,orientdb graph realm before >>>> considering this idea.
>>>> Couple observations:
>>>> 1. They don't support native algorithms, so I would have to write my >>>> own or patch a set of them over the database. >>>> So that would be starting with the database and finding/writing the >>>> algorithms (which I'm less adept at). >>>> 2. Any of those databases could be used as the long-term storage >>>> backend for what I'm describing above. >>>> Some study of mapping algorithm code to efficient database queries >>>> would need to be done (outside my scope).
>>> That's about right.
>>> You could also look at the previously mentioned cloudlight to see how a >>> persistent backend is done with SQLite.
>>>> So, the inverse of that approach is starting with the algorithm >>>> package and layering the persistence underneath. >>>> networkx not only has algorithms I need, but its python, and also has >>>> export to javascript and SVG. two critical pieces for me.
>>>> Since I'm fairly decent with persistence, it _seems_ easier for me to >>>> add that piece into networkx.
>>>> Of course, it won't perform the same as memory, but only on first >>>> calls as I intend to cache the data in local memory, >>>> backed by memcache (which is distributed and fast), backed by mongo >>>> (long-term storage). If I implement the Graph.py >>>> class correctly, networkx algorithms will solve the problem of what >>>> queries to issue, in what order and when to solve >>>> graph theoretic problems over the persistence layers. Both mongodb and >>>> memcache scale horizontally and perform >>>> exceptionally well. So even though it won't compare to memory, when >>>> you're operating on graphs of enormous size >>>> in "ad hoc" ways. Which is to say, user-driven request/response modes. >>>> Its not possible to have ALL the graph >>>> data in memory waiting for them.
>>> Sure, but this is really the problem that the graph databases like neo4j >>> are trying to solve. If you can pull all your data from storage into >>> memory, then you have no problems running networkx algorithms directly. But >>> the graph databases are designed for those situations where you have more >>> data than you can fit in memory on a single machine. This is much more of a >>> problem for graph algorithms because of the non locality of the data. For >>> example, finding neighbours of a node might mean pulling data from several >>> different machines- there is usually no simple way of partitioning your >>> graph nicely among N processors so that each processor has just the nodes >>> and edges it needs.
>>> Perhaps as first step, and to see how it might pan out, you could ignore >>> those issues and just implement the persistent storage. If the networkx >>> algorithms manipulate the Graph data through a standard api then they >>> should 'just work', but they will be dreadfully slow.
>>> The next step then, would to take advantage of the ability to work with >>> large datasets, and perhaps to execute some algorithms in parallel using >>> the native abilities of the backend. A parallel search algorithm >>> implemented on a scalable graph database backend would be a tremendously >>> useful addition to networkx.
>>>> Having said all that, I think a good first step (for study or general >>>> use) is to see how this approach works or not.
>>>> :)
>>>> On Mar 22, 10:20 am, tcb <thecolourblu...@gmail.com> wrote: >>>> > Hi,
>>>> > Its perhaps worth separating the two ideas...
>>>> > The idea of a persistent storage for networkx graphs is a good one, >>>> and has >>>> > been brought up many times- for many people this will mean just >>>> writing a >>>> > large graph into some format which is easy to store and transport to >>>> other >>>> > machines or applications. SQLite would seem to be ideal for this job, >>>> and >>>> > I'm sure several people have already written their own readers and >>>> writers >>>> > for sqlite.
>>>> > To change the backend of networkx from a dict of dicts to something >>>> else, >>>> > which implements persistent storage and is scalable across multiple >>>> > processors is another job altogether. It has been suggested to >>>> implement a >>>> > native backend in for eg. boost or igraph or something, but it seems >>>> that >>>> > you won't get much of a speedup doing that, unless you also rewrite >>>> the >>>> > algorithms. Since the beauty of networkx is the python algorithms- >>>> you may >>>> > as well then just use another package from the outset. I've been >>>> looking at >>>> > neo4j (http://neo4j.org/) for very large graphs, and while it >>>> certainly >>>> > scales very well, it is really missing good tools for network >>>> analysis. >>>> > There are python bindings for neo4j (http://docs.neo4j.org/** >>>> chunked/snapshot/python-**embedded.html<http://docs.neo4j.org/chunked/snapshot/python-embedded.html>) >>>> which need >>>> > Jpype (http://jpype.sourceforge.net/**) since neo4j is in java- it >>>> would be >>>> > an easy job to put together readers and writers which would load and >>>> store >>>> > networkx graphs in neo4j... probably a few have already
On Tue, Apr 3, 2012 at 4:13 PM, project2501 <darreng5...@gmail.com> wrote: > tcb, > You make some valid points. However, I think this space of graphs, > graphdbs, algorithms and the like is accelerating quickly. > It has evolved quite a bite from just a few years ago. So my request to the > networkx team is to simply put a data access layer in the Graph class > to invite innovation in. Bundle with a default MemoryDictOfDicts > implementation, but make it easy for people in the graph world to push the > envelope.
> Now,, I don't know the ultimate performance metrics, but I'm hoping to find > out what the tradeoffs are because, for me at least, its certain > that A) the nodes won't fit all into memory, and B) this capability prevails > over the performance for now. In other words, my customers would rather > have the ability to perform insightful algorithms over vast quantities of > distributed data at whatever performance level can be achieved, than > to have nothing.
> So let's open the doors on this! It seems a simple modification to Graph > class to add data access object or pattern (for one of the authors that is).
Do you have some suggestions on how this might be implemented? There are lots of engineering trade-offs that will come into play but if there was a clear design strategy that would help figure those out.
I need to study/trace the algorithm execution some more and will have some ideas after that.
But generally speaking, it would be something like this. Rather than loading all the nodes of a graph into memory before the algorithm runs, let the algorithm request the nodes from the underlying graph implementation that results in the graph implementation providing them. The default implementation may very well pre-load them into memory, but other implementations might fetch them for persistence if necessary - resulting in more of a crawling effect.
For example, to initiate a breadth first type search, I need a starting node only, not all nodes. Then, as the algorithm requests edges, adjacent nodes etc. from the graph, the graph retrieves them if they are not already loaded into memory.
If the algorithms operate on a list of lists residing in memory already, then this will require more work to separate and delegate to the Graph class.
On Tuesday, April 3, 2012 10:14:50 PM UTC-4, A Hagberg wrote:
> On Tue, Apr 3, 2012 at 4:13 PM, project2501 <darreng5...@gmail.com> wrote: > > tcb, > > You make some valid points. However, I think this space of graphs, > > graphdbs, algorithms and the like is accelerating quickly. > > It has evolved quite a bite from just a few years ago. So my request to > the > > networkx team is to simply put a data access layer in the Graph class > > to invite innovation in. Bundle with a default MemoryDictOfDicts > > implementation, but make it easy for people in the graph world to push > the > > envelope.
> > Now,, I don't know the ultimate performance metrics, but I'm hoping to > find > > out what the tradeoffs are because, for me at least, its certain > > that A) the nodes won't fit all into memory, and B) this capability > prevails > > over the performance for now. In other words, my customers would rather > > have the ability to perform insightful algorithms over vast quantities of > > distributed data at whatever performance level can be achieved, than > > to have nothing.
> > So let's open the doors on this! It seems a simple modification to Graph > > class to add data access object or pattern (for one of the authors that > is).
> Do you have some suggestions on how this might be implemented? There > are lots of engineering trade-offs that will come into play but if > there was a clear design strategy that would help figure those out.
On Wed, Apr 4, 2012 at 12:50 PM, project2501 <darreng5...@gmail.com> wrote: > I need to study/trace the algorithm execution some more and will have some > ideas after that.
> But generally speaking, it would be something like this. Rather than > loading all the nodes > of a graph into memory before the algorithm runs, let the algorithm > request the nodes from > the underlying graph implementation that results in the graph > implementation providing them. > The default implementation may very well pre-load them into memory, but > other implementations > might fetch them for persistence if necessary - resulting in more of a > crawling effect.
> For example, to initiate a breadth first type search, I need a starting > node only, not all nodes. > Then, as the algorithm requests edges, adjacent nodes etc. from the graph, > the graph retrieves > them if they are not already loaded into memory.
that's a good place to start- most of the persistent graph databases provide efficient traversal routines, and it should be possible to swap out the networkx traversals and replace them with the graph db traversals. There might still be some modifications to the algorithms, but hopefully nothing too serious.
You might be interested to compare with the approach taken by boost bgl to implement their library on top of mpi:
They manage to provide a fairly transparent interface to a distributed graph storage backend, but the algorithms still require fairly major surgery (although their focus is to make the algorithms run efficiently). Using an existing distributed graph db (like neo4j) means most of the backend stuff is done for you. Another key piece of the parallel bgl is a distributed map/list object. You could possibly design an object with a similar interface to a list/dict which uses the underlying distributed graph db for storage. There would then be minimal changes to the algorithms which would use the distributed lists/dicts when running in 'distributed' mode.
You might also look at other existing approaches to graph processing at scale, neatly summarised here
> If the algorithms operate on a list of lists residing in memory already, > then this will require > more work to separate and delegate to the Graph class.
> On Tuesday, April 3, 2012 10:14:50 PM UTC-4, A Hagberg wrote:
>> On Tue, Apr 3, 2012 at 4:13 PM, project2501 <darreng5...@gmail.com> >> wrote: >> > tcb, >> > You make some valid points. However, I think this space of graphs, >> > graphdbs, algorithms and the like is accelerating quickly. >> > It has evolved quite a bite from just a few years ago. So my request to >> the >> > networkx team is to simply put a data access layer in the Graph class >> > to invite innovation in. Bundle with a default MemoryDictOfDicts >> > implementation, but make it easy for people in the graph world to push >> the >> > envelope.
>> > Now,, I don't know the ultimate performance metrics, but I'm hoping to >> find >> > out what the tradeoffs are because, for me at least, its certain >> > that A) the nodes won't fit all into memory, and B) this capability >> prevails >> > over the performance for now. In other words, my customers would rather >> > have the ability to perform insightful algorithms over vast quantities >> of >> > distributed data at whatever performance level can be achieved, than >> > to have nothing.
>> > So let's open the doors on this! It seems a simple modification to Graph >> > class to add data access object or pattern (for one of the authors that >> is).
>> Do you have some suggestions on how this might be implemented? There >> are lots of engineering trade-offs that will come into play but if >> there was a clear design strategy that would help figure those out.
> To post to this group, send email to networkx-discuss@googlegroups.com. > To unsubscribe from this group, send email to > networkx-discuss+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/networkx-discuss?hl=en.
tcb, Thanks for the links. However, the one about boost bgl seems dated and the page looks unsupported (images don't event load), but I will dig deeper.
I'm aware of Pregel, but its proprietary I believe. I use Neo4j already and would probably pursue that as the engine, but would appeal to networkx engineers to devise a good way to separate data access from the algorithms making the coupling easier.
Perhaps down the road for networkx something as this emerges...
On Wednesday, April 4, 2012 8:52:12 AM UTC-4, tcb wrote:
> On Wed, Apr 4, 2012 at 12:50 PM, project2501 <darreng5...@gmail.com>wrote:
>> I need to study/trace the algorithm execution some more and will have >> some ideas after that.
>> But generally speaking, it would be something like this. Rather than >> loading all the nodes >> of a graph into memory before the algorithm runs, let the algorithm >> request the nodes from >> the underlying graph implementation that results in the graph >> implementation providing them. >> The default implementation may very well pre-load them into memory, but >> other implementations >> might fetch them for persistence if necessary - resulting in more of a >> crawling effect.
>> For example, to initiate a breadth first type search, I need a starting >> node only, not all nodes. >> Then, as the algorithm requests edges, adjacent nodes etc. from the >> graph, the graph retrieves >> them if they are not already loaded into memory.
> that's a good place to start- most of the persistent graph databases > provide efficient traversal routines, and it should be possible to swap out > the networkx traversals and replace them with the graph db traversals. > There might still be some modifications to the algorithms, but hopefully > nothing too serious.
> You might be interested to compare with the approach taken by boost bgl to > implement their library on top of mpi:
> They manage to provide a fairly transparent interface to a distributed > graph storage backend, but the algorithms still require fairly major > surgery (although their focus is to make the algorithms run efficiently). > Using an existing distributed graph db (like neo4j) means most of the > backend stuff is done for you. Another key piece of the parallel bgl is a > distributed map/list object. You could possibly design an object with a > similar interface to a list/dict which uses the underlying distributed > graph db for storage. There would then be minimal changes to the algorithms > which would use the distributed lists/dicts when running in 'distributed' > mode.
> You might also look at other existing approaches to graph processing at > scale, neatly summarised here
>> If the algorithms operate on a list of lists residing in memory already, >> then this will require >> more work to separate and delegate to the Graph class.
>> On Tuesday, April 3, 2012 10:14:50 PM UTC-4, A Hagberg wrote:
>>> On Tue, Apr 3, 2012 at 4:13 PM, project2501 <darreng5...@gmail.com> >>> wrote: >>> > tcb, >>> > You make some valid points. However, I think this space of graphs, >>> > graphdbs, algorithms and the like is accelerating quickly. >>> > It has evolved quite a bite from just a few years ago. So my request >>> to the >>> > networkx team is to simply put a data access layer in the Graph class >>> > to invite innovation in. Bundle with a default MemoryDictOfDicts >>> > implementation, but make it easy for people in the graph world to push >>> the >>> > envelope.
>>> > Now,, I don't know the ultimate performance metrics, but I'm hoping to >>> find >>> > out what the tradeoffs are because, for me at least, its certain >>> > that A) the nodes won't fit all into memory, and B) this capability >>> prevails >>> > over the performance for now. In other words, my customers would rather >>> > have the ability to perform insightful algorithms over vast quantities >>> of >>> > distributed data at whatever performance level can be achieved, than >>> > to have nothing.
>>> > So let's open the doors on this! It seems a simple modification to >>> Graph >>> > class to add data access object or pattern (for one of the authors >>> that is).
>>> Do you have some suggestions on how this might be implemented? There >>> are lots of engineering trade-offs that will come into play but if >>> there was a clear design strategy that would help figure those out.
>> To post to this group, send email to networkx-discuss@googlegroups.com. >> To unsubscribe from this group, send email to >> networkx-discuss+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/networkx-discuss?hl=en.
On Wed, Apr 4, 2012 at 1:38 PM, project2501 <darreng5...@gmail.com> wrote: > tcb, > Thanks for the links. However, the one about boost bgl seems dated and the > page looks unsupported (images don't event load), but I will dig deeper.
> I'm aware of Pregel, but its proprietary I believe. I use Neo4j already and > would probably pursue that as the engine, but would appeal to networkx > engineers to devise a good way to separate data access from the algorithms > making the coupling easier.
Can you be more specific about what would be required "to separate data access from the algorithms". We do have some separation and a (almost complete?) functional interface to get the data with object methods. In some places (algorithms) we access the data using Python built-in methods on the data dictionary because that is much faster.
Unfortunately the Neo4j licenses are not compatible with NetworkX's BSD. So we can't create a dependency on that package and keep the BSD. Are there other graph databases with BSD-compatible licenses?
Aric, The simple answer to this is to force the algorithms to use the Graph class methods exclusively. This would allow different Graph implementations to be used easily. Yes, accessing a native python dictionary with all the graph nodes will perform well, but of course, not scale. This is one tradeoff to consider.
However, it would be possible for a Graph implementation to return any kind of native dictionary with its node data. I thought also, this might work with a custom dictionary implementation but I tend to prefer doing this kind of abstraction at the Class/method level.
Yet another approach would be to associate the algorithms with a Graph class such that a Graph object can return instances of the algorithms. In this way, an algorithm implementation can be returned from a Graph object and both implementations can be encapsulated so they perform best for that implementation. This would allow a native python object/list implementation of a Graph to return algorithm instances that expect those data structures and performance would be the same as now.
For a Graph implementation that uses something like Neo4j, it can return an algorithm implementation that takes advantage of its implementation.
As for Neo4j, networkx need not bind to any of its code because the server exposes REST/HTTP interfaces. Other graphdb do this as well. There is a native/embedded python wrapper for Neo4j.
As for license incompatibilities, its a good point. But if a class wrapped neo4j and was interface compatible with networkx, but didn't load networkx, then an end-user could load that class and tell networkx to use it as a Graph implementation. Thus keeping things separate in the libraries.
Other graph databases include orientdb, infinitegraph, dex. I've only used neo4j and orientdb.
On Wednesday, April 4, 2012 5:44:25 PM UTC-4, A Hagberg wrote:
> On Wed, Apr 4, 2012 at 1:38 PM, project2501 <darreng5...@gmail.com> wrote: > > tcb, > > Thanks for the links. However, the one about boost bgl seems dated and > the > > page looks unsupported (images don't event load), but I will dig deeper.
> > I'm aware of Pregel, but its proprietary I believe. I use Neo4j already > and > > would probably pursue that as the engine, but would appeal to networkx > > engineers to devise a good way to separate data access from the > algorithms > > making the coupling easier.
> Can you be more specific about what would be required "to separate > data access from the algorithms". > We do have some separation and a (almost complete?) functional > interface to get the data with object methods. > In some places (algorithms) we access the data using Python built-in > methods on the data dictionary because that is much faster.
> Unfortunately the Neo4j licenses are not compatible with NetworkX's > BSD. So we can't create a dependency on that package and keep the > BSD. Are there other graph databases with BSD-compatible licenses?
On Thu, Apr 5, 2012 at 4:28 PM, project2501 <darreng5...@gmail.com> wrote: > Aric, > The simple answer to this is to force the algorithms to use the Graph > class methods exclusively. > This would allow different Graph implementations to be used easily. Yes, > accessing a native > python dictionary with all the graph nodes will perform well, but of course, > not scale. This is one tradeoff > to consider.
> However, it would be possible for a Graph implementation to return any kind > of native dictionary with > its node data. I thought also, this might work with a custom dictionary > implementation but I tend to prefer > doing this kind of abstraction at the Class/method level.
At this point its arguable whether the Python dict methods have become the de facto public interface. I don't think we've ever been very clear on that one way or another, and we have exposed the dictionary methods in some places to get (significant) performance gains.
So certainly one approach is to provide a dictionary-like alternative data structure. I have a SciPy sparse graph implementation (not ready for release) that uses that technique.
> Yet another approach would be to associate the algorithms with a Graph class > such that a Graph object > can return instances of the algorithms. In this way, an algorithm > implementation can be returned from > a Graph object and both implementations can be encapsulated so they perform > best for that implementation. > This would allow a native python object/list implementation of a Graph to > return algorithm instances that > expect those data structures and performance would be the same as now.
> For a Graph implementation that uses something like Neo4j, it can return an > algorithm implementation > that takes advantage of its implementation.
I've generally been resisting that approach (e.g. adding more algorithms as graph class methods) since there are so many varied uses of NetworkX. Some people only want the base classes and a few algorithms (or even none), etc.
But perhaps, considering distributed memory or database backends, there is a subset of core algorithms that most other algorithms can use and that provide high performance access, traversal, or other functions? If such a set existed then I can see including those as base class methods.
> As for Neo4j, networkx need not bind to any of its code because the server > exposes REST/HTTP interfaces. > Other graphdb do this as well. There is a native/embedded python wrapper for > Neo4j.
> As for license incompatibilities, its a good point. But if a class wrapped > neo4j and was interface compatible > with networkx, but didn't load networkx, then an end-user could load that > class and tell networkx to use it > as a Graph implementation. Thus keeping things separate in the libraries.
Yes, that would work. But it wouldn't be part of NetworkX.
> Other graph databases include orientdb, infinitegraph, dex. I've only used > neo4j and orientdb.
Thanks, I'll take a look. Anyone else have suggestions?
Some algorithms don't need much from the Graph class.
Shortest_path uses G[n] or G.successors/G.predecessors (bidirectional). I count G[n] being used for iterating over neighbors as "using the Graph class methods".
Algorithms that use the syntax G[u].items() or G[u][v] are counting on G[n] returning an adjacency dict of edge attribute dicts. That can be a performance burden on data structures that don't use dicts.
Perhaps we could adjust the algorithms to use a new (and fast) method __call__ like: G(u, attr=None) to iterate over 2-tuples (nbr, attr_dict) if attr is None; or (nbr, attr_value) otherwise. This reduces time creating dicts for data strorage that doesn't use dicts. If an algorithm need dicts, it can create one with dict(G (u,attr=weight).
I think the change in the algorithms would be something like:
>>> for nbr,wt in G(n,attr=weight): >>> ....
Instead of
>>> for nbr,edata in G[n].items(): >>> wt=edata.get(weight,1) >>> ....
To directly replace G[u][v][weight] the algorithms would need G(u)[v] [weight]
I guess what I'm trying to say is that your idea to "use the Graph class methods exclusively" doesn't mean we have to give up the speed of G[u] or G[u] [v] for our base classes. But there may be some corner cases I'm not taking into account.
On Sat, Apr 7, 2012 at 12:07 PM, Dan Schult <dsch...@colgate.edu> wrote: > Some algorithms don't need much from the Graph class.
> Shortest_path uses G[n] or G.successors/G.predecessors > (bidirectional). I count G[n] being used for iterating over neighbors > as "using the Graph class methods".
> Algorithms that use the syntax G[u].items() or G[u][v] are counting > on G[n] returning an adjacency dict of edge attribute dicts. That can > be a performance burden on data structures that don't use dicts.
> Perhaps we could adjust the algorithms to use a new (and fast) > method __call__ like: G(u, attr=None) to iterate over 2-tuples > (nbr, attr_dict) if attr is None; or (nbr, attr_value) otherwise. > This reduces time creating dicts for data strorage that doesn't use dicts. > If an algorithm need dicts, it can create one with dict(G(u,attr=weight).
> I think the change in the algorithms would be something like:
>>>> for nbr,wt in G(n,attr=weight): >>>> ....
> Instead of
>>>> for nbr,edata in G[n].items(): >>>> wt=edata.get(weight,1) >>>> ....
> To directly replace G[u][v][weight] the algorithms would need > G(u)[v][weight]
> I guess what I'm trying to say is that your idea to "use the Graph class > methods > exclusively" doesn't mean we have to give up the speed of G[u] or G[u][v] > for > our base classes. But there may be some corner cases I'm not taking into > account.
That could work.
But I think the missing piece is the design document that that describes what the (Python language) public interface is for the Graph class(es) and data objects. We have discussed aspects of this many times but right now the only documentation is the code.
If a clear specification what is holding us back from engineering something that can use alternative/distributed data representations then maybe it is time to address that. (But in a different thread....).