Custom/persistent graph implementation?

project2501

unread,

Mar 20, 2012, 10:55:39 AM3/20/12

to networkx...@googlegroups.com

Hi,

I'm interested in providing a persistent implementation of the underlying graph representation for networkx.

I'm not talking about serializing a graph to and from storage. I'm talking about storing all the node/edge/property information in

a storage implementation, transparent to networkx, such that when networkx algorithms traverse nodes, edges, etc.

That information is retrieved from storage/cache using my dictionary implementation.

Is it possible? Any tips appreciated!

Aric Hagberg

unread,

Mar 20, 2012, 11:22:13 AM3/20/12

to networkx...@googlegroups.com

It's definitely possible and we have discussed it before.

There is a simple implementation using Python's shelve at
https://networkx.lanl.gov/trac/ticket/224
It hasn't been updated too recently.

More generally it would make sense to use one of the graph databases.
I've asked the community to suggest an appropriate one, but so far
nobody has replied. I am willing to consider re-engineering parts the
NetworkX core to make it work with a persistent database solution if
that is necessary.

Aric

project2501

unread,

Mar 20, 2012, 12:59:18 PM3/20/12

to networkx...@googlegroups.com

A couple thoughts I had to make this work.

- Tinkerpop has a general interface to graph databases and Bulbs (bulbflow) is a python API to the REST interfaces of the graph server.

However, this is my last suggestion since it involves a lot of moving parts.

- Another option I thought was to use a memcached or memcachedb server with python client as the basis for a persistent dictionary used by networkx as its data structure.

The graph is represented using adjacency matrices instead of direct references.

- Separate from or underneath memcached can be mongodb, which autoshards across a network and autoroutes. wordnic has implemented graphing functions over mongodb.

And mongodb scales horizontally.

It would be very interesting to see if any of these robust, auto-scaling mechanisms can underly networkx structures transparently such that

retrieval from persistence and caching are handled automatically. Yet avoids bottlenecked performance issues (both memcached and mongo scale well).

This would be a relatively easy implementation I think. All of the above have python bindings.

On Tuesday, March 20, 2012 11:22:13 AM UTC-4, A Hagberg wrote:

Tim Howard

unread,

Mar 20, 2012, 1:21:15 PM3/20/12

to networkx...@googlegroups.com

I've had great luck with SQLite and using it as storage for NetworkX.
Obviously, I currently write the entire graph structure to and from
SQLite and RAM and all the algorithms occur in RAM. I think it would
be great to have a persistent implementation as discussed below.

Tim

> --
> You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
> To post to this group, send email to networkx...@googlegroups.com.
> To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/networkx-discuss?hl=en.
>

project2501

unread,

Mar 20, 2012, 1:57:39 PM3/20/12

to networkx-discuss

Yeah, because for some scales of graph, they won't fit in RAM anymore
or on a single machine.
That's the size I'm pursuing code for.

The general logic behind the data structure would go something like
this:

1. When a node/edge or whatever is to be de-referenced by its id, the
local memory cache is consulted.
2. If it doesn't reside in memory cache, then its queried from offline
storage (say mongodb). Mongo will take care of
identifying the server it actually resides on.
3. That node/edge is pulled into the cache where its properties are
available.

So consider a depth first search algorithm. It would only need the
starting node in the cache, along with
its out and in vertex information (edges). From there, as networkx
consults its internal matrix, the de-referencing
process causes lookups to occur transparently. This can mean
consulting a large DHT or routing to other machines,
i.e. crawling.

networkx only sees its usual internal data structure.

Which classes should I look at to implement this? I'm not all that
knowledgable on networkx code.

On Mar 20, 1:21 pm, Tim Howard <tghow...@gmail.com> wrote:
> I've had great luck with SQLite and using it as storage for NetworkX.
> Obviously, I currently write the entire graph structure to and from
> SQLite and RAM and all the algorithms occur in RAM. I think it would
> be great to have a persistent implementation as discussed below.
>
> Tim
>
>
>
>
>
>
>

> On Tue, Mar 20, 2012 at 11:22 AM, Aric Hagberg <aric.hagb...@gmail.com> wrote:

Dan Schult

unread,

Mar 22, 2012, 9:57:49 AM3/22/12

to networkx...@googlegroups.com

On Mar 20, 2012, at 1:57 PM, project2501 wrote:
>
> Which classes should I look at to implement this? I'm not all that
> knowledgable on networkx code.

The first place to look would be the Graph class: networkx/classes/graph.py
Right now the API and the data storage are fairly interlaced for
improved performance. We are experimenting with partial separations
of the graph interface from the data structure, e.g. using a scipy
sparse matrix representation instead of a dict-of-dict-of-dict.

> The general logic behind the data structure would go something like
> this:
>
> 1. When a node/edge or whatever is to be de-referenced by its id, the
> local memory cache is consulted.
> 2. If it doesn't reside in memory cache, then its queried from offline
> storage (say mongodb). Mongo will take care of
> identifying the server it actually resides on.
> 3. That node/edge is pulled into the cache where its properties are
> available.

It sounds like you might want to replace the dicts used for adj lists
with a subclass of dict that overrides __getitem__ with a method that
checks the internal storage and if nothing is found goes to the database.
Something like a dict-of-DB_adj_dict-of-DB_egdedata_dict

Dan

tcb

unread,

Mar 22, 2012, 10:20:08 AM3/22/12

to networkx...@googlegroups.com

Hi,

Its perhaps worth separating the two ideas...

The idea of a persistent storage for networkx graphs is a good one, and has been brought up many times- for many people this will mean just writing a large graph into some format which is easy to store and transport to other machines or applications. SQLite would seem to be ideal for this job, and I'm sure several people have already written their own readers and writers for sqlite.

To change the backend of networkx from a dict of dicts to something else, which implements persistent storage and is scalable across multiple processors is another job altogether. It has been suggested to implement a native backend in for eg. boost or igraph or something, but it seems that you won't get much of a speedup doing that, unless you also rewrite the algorithms. Since the beauty of networkx is the python algorithms- you may as well then just use another package from the outset. I've been looking at neo4j (http://neo4j.org/) for very large graphs, and while it certainly scales very well, it is really missing good tools for network analysis. There are python bindings for neo4j (http://docs.neo4j.org/chunked/snapshot/python-embedded.html) which need Jpype (http://jpype.sourceforge.net/) since neo4j is in java- it would be an easy job to put together readers and writers which would load and store networkx graphs in neo4j... probably a few have already done this?

I think it would also be possible to use neo4j as a backend for networkx- neo4j has Nodes and Relationships which would map nicely to nodes and edges in networkx. The Nodes and Relationships have property maps which could store any attributes of nodes/edges. Neo4j would handle the memory caching itself, so you would just send an access request (modify __getitem__ in graph.py) for the nodes/edges and let neo4j consult its stores and figure out how to return the data to you. While this might work for small graphs- its not really how the scalable graph databases are designed to work. For example: every write operation has to be wrapped in a transaction- this will kill your performance if you do it for every individual write, but is manageable if you batch the transactions together. This will likely require most of the algorithms to be reworked. The advantages of neo4j probably come from its indexes- you would have to figure out how to use these efficiently in the networkx algorithms. Many algorithms in networkx start with a list [] of nodes- this is not going to work either and would likely have to be modified to use the Traversal framework (http://docs.neo4j.org/chunked/stable/tutorial-traversal-concepts.html). You might also look at Gremlin (https://github.com/tinkerpop/gremlin/wiki) to get an idea for an efficient way of doing network analysis on distributed graphs.

It would be nice to see how far you can get with a neo4j (or similar backend). Modifying the graph class in networkx to use a different backend should be straightforward enough, but ensuring all (or most) the algorithms still work, and work efficiently will be the hard part.

regards,

project2501

unread,

Mar 25, 2012, 9:27:43 AM3/25/12

to networkx-discuss

In fact, I came from the whole neo4j,orientdb graph realm before
considering this idea.

Couple observations:

1. They don't support native algorithms, so I would have to write my
own or patch a set of them over the database.
So that would be starting with the database and finding/writing the
algorithms (which I'm less adept at).
2. Any of those databases could be used as the long-term storage
backend for what I'm describing above.
Some study of mapping algorithm code to efficient database queries
would need to be done (outside my scope).

So, the inverse of that approach is starting with the algorithm
package and layering the persistence underneath.
networkx not only has algorithms I need, but its python, and also has
export to javascript and SVG. two critical pieces for me.

Since I'm fairly decent with persistence, it _seems_ easier for me to
add that piece into networkx.

Of course, it won't perform the same as memory, but only on first
calls as I intend to cache the data in local memory,
backed by memcache (which is distributed and fast), backed by mongo
(long-term storage). If I implement the Graph.py
class correctly, networkx algorithms will solve the problem of what
queries to issue, in what order and when to solve
graph theoretic problems over the persistence layers. Both mongodb and
memcache scale horizontally and perform
exceptionally well. So even though it won't compare to memory, when
you're operating on graphs of enormous size
in "ad hoc" ways. Which is to say, user-driven request/response modes.
Its not possible to have ALL the graph
data in memory waiting for them.

Having said all that, I think a good first step (for study or general
use) is to see how this approach works or not.

:)

Moritz Emanuel Beber

unread,

Mar 25, 2012, 5:10:06 PM3/25/12

to networkx...@googlegroups.com

Hi,

On 03/25/2012 03:27 PM, project2501 wrote:
> In fact, I came from the whole neo4j,orientdb graph realm before
> considering this idea.
>
> Couple observations:
>
> 1. They don't support native algorithms, so I would have to write my
> own or patch a set of them over the database.
> So that would be starting with the database and finding/writing the
> algorithms (which I'm less adept at).
> 2. Any of those databases could be used as the long-term storage
> backend for what I'm describing above.
> Some study of mapping algorithm code to efficient database queries
> would need to be done (outside my scope).
>
> So, the inverse of that approach is starting with the algorithm
> package and layering the persistence underneath.
> networkx not only has algorithms I need, but its python, and also has
> export to javascript and SVG. two critical pieces for me.
>
> Since I'm fairly decent with persistence, it _seems_ easier for me to
> add that piece into networkx.

I just want to highlight for you that, as Aric mentioned before, many of
the algorithms access and modify the dictionary structure of the Graph
class directly. This is done for performance gains but means that you
cannot simply define a class that exposes the same interface but you
actually need to mimic some of the internals as well.

To see where such access of internals occurs would actually be
interesting, imho, so that possibly those code sections could be altered
to use "public" methods and attributes only.

Good luck,
Moritz

project2501

unread,

Mar 25, 2012, 6:09:55 PM3/25/12

to networkx-discuss

Thanks. Yeah, I understand the issue there. I will study this some and
see if there's an efficient way to loosely couple
the internal representation.

Just offhand, it would seem that if the Graph class deferred lookups
to another class using a notional syntax for
storing and retrieving adjacent or referential sets of dicts from the
other class, then it could be implemented various ways
with only the additive cost of a method call.

On Mar 25, 5:10 pm, Moritz Emanuel Beber <moritz.be...@googlemail.com>
wrote:

Pengkui Luo

unread,

Mar 25, 2012, 8:22:25 PM3/25/12

to networkx...@googlegroups.com

Hi Aric,

I came across a library called CloudLight which extends NetworkX to support in-disk graphs and uses SQLite3 for persistence.

http://www.assembla.com/spaces/cloudlight/wiki/Tutorial

I am not sure about the quality of this "wheel" though. What is your opinion on it?

~Pengkui

--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.

tcb

unread,

Mar 26, 2012, 9:57:31 AM3/26/12

to networkx...@googlegroups.com

On Sun, Mar 25, 2012 at 2:27 PM, project2501 <darre...@gmail.com> wrote:

In fact, I came from the whole neo4j,orientdb graph realm before
considering this idea.

Couple observations:

1. They don't support native algorithms, so I would have to write my
own or patch a set of them over the database.
So that would be starting with the database and finding/writing the
algorithms (which I'm less adept at).
2. Any of those databases could be used as the long-term storage
backend for what I'm describing above.
Some study of mapping algorithm code to efficient database queries
would need to be done (outside my scope).

That's about right.

You could also look at the previously mentioned cloudlight to see how a persistent backend is done with SQLite.

So, the inverse of that approach is starting with the algorithm
package and layering the persistence underneath.
networkx not only has algorithms I need, but its python, and also has
export to javascript and SVG. two critical pieces for me.

Since I'm fairly decent with persistence, it _seems_ easier for me to
add that piece into networkx.

Of course, it won't perform the same as memory, but only on first
calls as I intend to cache the data in local memory,
backed by memcache (which is distributed and fast), backed by mongo
(long-term storage). If I implement the Graph.py
class correctly, networkx algorithms will solve the problem of what
queries to issue, in what order and when to solve
graph theoretic problems over the persistence layers. Both mongodb and
memcache scale horizontally and perform
exceptionally well. So even though it won't compare to memory, when
you're operating on graphs of enormous size
in "ad hoc" ways. Which is to say, user-driven request/response modes.
Its not possible to have ALL the graph
data in memory waiting for them.

Sure, but this is really the problem that the graph databases like neo4j are trying to solve. If you can pull all your data from storage into memory, then you have no problems running networkx algorithms directly. But the graph databases are designed for those situations where you have more data than you can fit in memory on a single machine. This is much more of a problem for graph algorithms because of the non locality of the data. For example, finding neighbours of a node might mean pulling data from several different machines- there is usually no simple way of partitioning your graph nicely among N processors so that each processor has just the nodes and edges it needs.

Perhaps as first step, and to see how it might pan out, you could ignore those issues and just implement the persistent storage. If the networkx algorithms manipulate the Graph data through a standard api then they should 'just work', but they will be dreadfully slow.

The next step then, would to take advantage of the ability to work with large datasets, and perhaps to execute some algorithms in parallel using the native abilities of the backend. A parallel search algorithm implemented on a scalable graph database backend would be a tremendously useful addition to networkx.

project2501

unread,

Apr 3, 2012, 9:12:57 AM4/3/12

to networkx...@googlegroups.com

Yeah, I get that graph databases are optimized for graph storage and retrieval.

I would certainly use one in an implementation underneath networkx. The problem is,

as I mention, that the graph databases do not come with the algorithms networkx comes with.

Also, reading the required nodes into memory then doing networkx algorithms presents a problem.

Which nodes to read into memory. And also, perhaps not all the required ones fit into memory.

Replacing the networkx in-memory data model with an interface to a memory or persistent

model will solve both of these problems.

As with most engineering problems, there are tradeoffs. Using networkx to navigate a persistent

graph database would not be 'dreadfully' slow in my opinion for many common use cases. Slow

is relative. If done using a forward cache in memory or DHT type system, it would perform

acceptably given the power it brings - scale - and access to much more information.

> > > > To post to this group, send email to networkx-discuss@googlegroups.com

> > .
> > > > To unsubscribe from this group, send email to

> > networkx-discuss+unsubscribe@googlegroups.com.

> > > > For more options, visit this group athttp://
> > groups.google.com/group/networkx-discuss?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "networkx-discuss" group.

> > To post to this group, send email to networkx-discuss@googlegroups.com.

> > To unsubscribe from this group, send email to

> > networkx-discuss+unsubscribe@googlegroups.com.

> > For more options, visit this group at
> >http://groups.google.com/group/networkx-discuss?hl=en.

--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.

To post to this group, send email to networkx-discuss@googlegroups.com.
To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com.

project2501

unread,

Apr 3, 2012, 9:34:27 AM4/3/12

to networkx...@googlegroups.com

For starters, inside Graph, instead of doing something like neighbors[a][b][c],

make it a method call.

neighbors(a,b,c,...)

Let the default implementation use a dict of dicts inside of neighbors() if it wants.

A graphdb implementation will construct the appropriate joined query and/or consult its immediate cache.

This should be done. Vastly useful in today's big data world.

tcb

unread,

Apr 3, 2012, 12:19:08 PM4/3/12

to networkx...@googlegroups.com

On Tue, Apr 3, 2012 at 2:12 PM, project2501 <darre...@gmail.com> wrote:

Yeah, I get that graph databases are optimized for graph storage and retrieval.

I would certainly use one in an implementation underneath networkx. The problem is,
as I mention, that the graph databases do not come with the algorithms networkx comes with.

Also, reading the required nodes into memory then doing networkx algorithms presents a problem.
Which nodes to read into memory. And also, perhaps not all the required ones fit into memory.

Replacing the networkx in-memory data model with an interface to a memory or persistent
model will solve both of these problems.

hmm, I'm not so sure. I wonder if you are underestimating the issues (or maybe I'm over estimating them?).

The reason graph databases don't come with the kind of algorithms networkx has is because they just don't work at that kind of scale. Replacing the storage model does not solve the in-memory data problem from the point of view of the algorithm. What the graph db's can do with reasonable efficiency is graph traversals. So any algorithm in networkx which relies on a traversal (and there are quite a few) could (possibly) be made to use the scalable graph traversal, and could (possibly) work reasonably well.

An example of the problems I would envisage are (at random) from core.py:core_number(G) - the first line in this algorithm does this:

degrees=G.degree()

that's a list of degrees of all nodes, and it makes another similar list for the cores which it proceeds to modify. If you have a large enough graph which requires distributed memory this just wont work (you just can't make a single list with size of number_of_nodes). You could try to make a new property on the node and update that property through the graph db backend- but in many cases to avoid this issue will involve substantial reworking of the algorithms.

If you can fit all your data in memory, it seems pointless to use a persistent storage backend- it will never compete with networkx's own Graph (neither in memory usage or speed). In this case, just use networkx directly, run the standard algorithms and at the end use the readwrite functions to write your graph back to persistent storage. So networkx should have the ability to read and write graphs from persistent storage, but it uses its own data model and algorithms.

If your graphs are too big for a single machine, then you need the distributed graph databases (neo4j as an example). These can do distributed storage and distributed graph traversal. But distributed computation is another field altogether and networkx's algorithms are just not designed to work for those problems. Maybe you can make some of the algorithms work acceptably well- I'd love to be proved wrong... ;)

regards

To view this discussion on the web visit https://groups.google.com/d/msg/networkx-discuss/-/99zKcKogLXcJ.

To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.

project2501

unread,

Apr 3, 2012, 6:13:39 PM4/3/12

to networkx...@googlegroups.com

tcb,

You make some valid points. However, I think this space of graphs, graphdbs, algorithms and the like is accelerating quickly.

It has evolved quite a bite from just a few years ago. So my request to the networkx team is to simply put a data access layer in the Graph class

to invite innovation in. Bundle with a default MemoryDictOfDicts implementation, but make it easy for people in the graph world to push the envelope.

Now,, I don't know the ultimate performance metrics, but I'm hoping to find out what the tradeoffs are because, for me at least, its certain

that A) the nodes won't fit all into memory, and B) this capability prevails over the performance for now. In other words, my customers would rather

have the ability to perform insightful algorithms over vast quantities of distributed data at whatever performance level can be achieved, than

to have nothing.

So let's open the doors on this! It seems a simple modification to Graph class to add data access object or pattern (for one of the authors that is). From there, I expect

we all will be surprised at what the community does after that!

Aric Hagberg

unread,

Apr 3, 2012, 10:14:50 PM4/3/12

to networkx...@googlegroups.com

On Tue, Apr 3, 2012 at 4:13 PM, project2501 <darre...@gmail.com> wrote:
> tcb,
> You make some valid points. However, I think this space of graphs,
> graphdbs, algorithms and the like is accelerating quickly.
> It has evolved quite a bite from just a few years ago. So my request to the
> networkx team is to simply put a data access layer in the Graph class
> to invite innovation in. Bundle with a default MemoryDictOfDicts
> implementation, but make it easy for people in the graph world to push the
> envelope.
>
> Now,, I don't know the ultimate performance metrics, but I'm hoping to find
> out what the tradeoffs are because, for me at least, its certain
> that A) the nodes won't fit all into memory, and B) this capability prevails
> over the performance for now. In other words, my customers would rather
> have the ability to perform insightful algorithms over vast quantities of
> distributed data at whatever performance level can be achieved, than
> to have nothing.
>
> So let's open the doors on this! It seems a simple modification to Graph
> class to add data access object or pattern (for one of the authors that is).

Do you have some suggestions on how this might be implemented? There
are lots of engineering trade-offs that will come into play but if
there was a clear design strategy that would help figure those out.

Aric

project2501

unread,

Apr 4, 2012, 7:50:34 AM4/4/12

to networkx...@googlegroups.com

I need to study/trace the algorithm execution some more and will have some ideas after that.

But generally speaking, it would be something like this. Rather than loading all the nodes

of a graph into memory before the algorithm runs, let the algorithm request the nodes from

the underlying graph implementation that results in the graph implementation providing them.

The default implementation may very well pre-load them into memory, but other implementations

might fetch them for persistence if necessary - resulting in more of a crawling effect.

For example, to initiate a breadth first type search, I need a starting node only, not all nodes.

Then, as the algorithm requests edges, adjacent nodes etc. from the graph, the graph retrieves

them if they are not already loaded into memory.

If the algorithms operate on a list of lists residing in memory already, then this will require

more work to separate and delegate to the Graph class.

tcb

unread,

Apr 4, 2012, 8:52:12 AM4/4/12

to networkx...@googlegroups.com

On Wed, Apr 4, 2012 at 12:50 PM, project2501 <darre...@gmail.com> wrote:

I need to study/trace the algorithm execution some more and will have some ideas after that.

But generally speaking, it would be something like this. Rather than loading all the nodes
of a graph into memory before the algorithm runs, let the algorithm request the nodes from
the underlying graph implementation that results in the graph implementation providing them.
The default implementation may very well pre-load them into memory, but other implementations

might fetch them for persistence if necessary - resulting in more of a crawling effect.

For example, to initiate a breadth first type search, I need a starting node only, not all nodes.

Then, as the algorithm requests edges, adjacent nodes etc. from the graph, the graph retrieves
them if they are not already loaded into memory.

that's a good place to start- most of the persistent graph databases provide efficient traversal routines, and it should be possible to swap out the networkx traversals and replace them with the graph db traversals. There might still be some modifications to the algorithms, but hopefully nothing too serious.

You might be interested to compare with the approach taken by boost bgl to implement their library on top of mpi:

http://osl.iu.edu/research/pbgl/documentation/overview.html

They manage to provide a fairly transparent interface to a distributed graph storage backend, but the algorithms still require fairly major surgery (although their focus is to make the algorithms run efficiently). Using an existing distributed graph db (like neo4j) means most of the backend stuff is done for you. Another key piece of the parallel bgl is a distributed map/list object. You could possibly design an object with a similar interface to a list/dict which uses the underlying distributed graph db for storage. There would then be minimal changes to the algorithms which would use the distributed lists/dicts when running in 'distributed' mode.

You might also look at other existing approaches to graph processing at scale, neatly summarised here

http://stackoverflow.com/questions/2986010/any-open-source-pregel-like-framework-for-distributed-processing-of-large-graphs

regards

--

You received this message because you are subscribed to the Google Groups "networkx-discuss" group.

To view this discussion on the web visit https://groups.google.com/d/msg/networkx-discuss/-/gE_uY--rcRIJ.

To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.

project2501

unread,

Apr 4, 2012, 3:38:41 PM4/4/12

to networkx...@googlegroups.com

tcb,

Thanks for the links. However, the one about boost bgl seems dated and the page looks unsupported (images don't event load), but I will dig deeper.

I'm aware of Pregel, but its proprietary I believe. I use Neo4j already and would probably pursue that as the engine, but would appeal to networkx

engineers to devise a good way to separate data access from the algorithms making the coupling easier.

Perhaps down the road for networkx something as this emerges...

regards

To post to this group, send email to networkx-discuss@googlegroups.com.
To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com.

Aric Hagberg

unread,

Apr 4, 2012, 5:44:25 PM4/4/12

to networkx...@googlegroups.com

On Wed, Apr 4, 2012 at 1:38 PM, project2501 <darre...@gmail.com> wrote:
> tcb,

> Thanks for the links. However, the one about boost bgl seems dated and the
> page looks unsupported (images don't event load), but I will dig deeper.
>
> I'm aware of Pregel, but its proprietary I believe. I use Neo4j already and
> would probably pursue that as the engine, but would appeal to networkx
> engineers to devise a good way to separate data access from the algorithms
> making the coupling easier.

Can you be more specific about what would be required "to separate
data access from the algorithms".
We do have some separation and a (almost complete?) functional
interface to get the data with object methods.
In some places (algorithms) we access the data using Python built-in
methods on the data dictionary because that is much faster.

Unfortunately the Neo4j licenses are not compatible with NetworkX's
BSD. So we can't create a dependency on that package and keep the
BSD. Are there other graph databases with BSD-compatible licenses?

Aric

project2501

unread,

Apr 5, 2012, 6:28:44 PM4/5/12

to networkx...@googlegroups.com

Aric,

The simple answer to this is to force the algorithms to use the Graph class methods exclusively.

This would allow different Graph implementations to be used easily. Yes, accessing a native

python dictionary with all the graph nodes will perform well, but of course, not scale. This is one tradeoff

to consider.

However, it would be possible for a Graph implementation to return any kind of native dictionary with

its node data. I thought also, this might work with a custom dictionary implementation but I tend to prefer

doing this kind of abstraction at the Class/method level.

Yet another approach would be to associate the algorithms with a Graph class such that a Graph object

can return instances of the algorithms. In this way, an algorithm implementation can be returned from

a Graph object and both implementations can be encapsulated so they perform best for that implementation.

This would allow a native python object/list implementation of a Graph to return algorithm instances that

expect those data structures and performance would be the same as now.

For a Graph implementation that uses something like Neo4j, it can return an algorithm implementation

that takes advantage of its implementation.

As for Neo4j, networkx need not bind to any of its code because the server exposes REST/HTTP interfaces.

Other graphdb do this as well. There is a native/embedded python wrapper for Neo4j.

As for license incompatibilities, its a good point. But if a class wrapped neo4j and was interface compatible

with networkx, but didn't load networkx, then an end-user could load that class and tell networkx to use it

as a Graph implementation. Thus keeping things separate in the libraries.

Other graph databases include orientdb, infinitegraph, dex. I've only used neo4j and orientdb.

Aric Hagberg

unread,

Apr 7, 2012, 12:57:39 PM4/7/12

to networkx...@googlegroups.com

On Thu, Apr 5, 2012 at 4:28 PM, project2501 <darre...@gmail.com> wrote:
> Aric,
> The simple answer to this is to force the algorithms to use the Graph
> class methods exclusively.
> This would allow different Graph implementations to be used easily. Yes,
> accessing a native
> python dictionary with all the graph nodes will perform well, but of course,
> not scale. This is one tradeoff
> to consider.
>
> However, it would be possible for a Graph implementation to return any kind
> of native dictionary with
> its node data. I thought also, this might work with a custom dictionary
> implementation but I tend to prefer
> doing this kind of abstraction at the Class/method level.

At this point its arguable whether the Python dict methods have become
the de facto public interface. I don't think we've ever been very
clear on that one way or another, and we have exposed the dictionary
methods in some places to get (significant) performance gains.

So certainly one approach is to provide a dictionary-like alternative
data structure. I have a SciPy sparse graph implementation (not ready
for release) that uses that technique.

> Yet another approach would be to associate the algorithms with a Graph class
> such that a Graph object
> can return instances of the algorithms. In this way, an algorithm
> implementation can be returned from
> a Graph object and both implementations can be encapsulated so they perform
> best for that implementation.
> This would allow a native python object/list implementation of a Graph to
> return algorithm instances that
> expect those data structures and performance would be the same as now.
>
> For a Graph implementation that uses something like Neo4j, it can return an
> algorithm implementation
> that takes advantage of its implementation.

I've generally been resisting that approach (e.g. adding more
algorithms as graph class methods) since there are so many varied uses
of NetworkX. Some people only want the base classes and a few
algorithms (or even none), etc.

But perhaps, considering distributed memory or database backends,
there is a subset of core algorithms that most other algorithms can
use and that provide high performance access, traversal, or other
functions? If such a set existed then I can see including those as
base class methods.

> As for Neo4j, networkx need not bind to any of its code because the server
> exposes REST/HTTP interfaces.
> Other graphdb do this as well. There is a native/embedded python wrapper for
> Neo4j.
>
> As for license incompatibilities, its a good point. But if a class wrapped
> neo4j and was interface compatible
> with networkx, but didn't load networkx, then an end-user could load that
> class and tell networkx to use it
> as a Graph implementation. Thus keeping things separate in the libraries.

Yes, that would work. But it wouldn't be part of NetworkX.

> Other graph databases include orientdb, infinitegraph, dex. I've only used
> neo4j and orientdb.

Thanks, I'll take a look. Anyone else have suggestions?

Aric

Dan Schult

unread,

Apr 7, 2012, 2:07:49 PM4/7/12

to networkx...@googlegroups.com

Some algorithms don't need much from the Graph class.

Shortest_path uses G[n] or G.successors/G.predecessors
(bidirectional). I count G[n] being used for iterating over neighbors
as "using the Graph class methods".

Algorithms that use the syntax G[u].items() or G[u][v] are counting
on G[n] returning an adjacency dict of edge attribute dicts. That can
be a performance burden on data structures that don't use dicts.

Perhaps we could adjust the algorithms to use a new (and fast)
method __call__ like: G(u, attr=None) to iterate over 2-tuples
(nbr, attr_dict) if attr is None; or (nbr, attr_value) otherwise.
This reduces time creating dicts for data strorage that doesn't use
dicts.
If an algorithm need dicts, it can create one with dict(G
(u,attr=weight).

I think the change in the algorithms would be something like:

>>> for nbr,wt in G(n,attr=weight):
>>> ....

Instead of

>>> for nbr,edata in G[n].items():
>>> wt=edata.get(weight,1)
>>> ....

To directly replace G[u][v][weight] the algorithms would need G(u)[v]
[weight]

I guess what I'm trying to say is that your idea to "use the Graph
class methods
exclusively" doesn't mean we have to give up the speed of G[u] or G[u]
[v] for
our base classes. But there may be some corner cases I'm not taking
into account.

Thoughts?
Dan

Aric Hagberg

unread,

Apr 8, 2012, 11:09:59 AM4/8/12

to networkx...@googlegroups.com

That could work.

But I think the missing piece is the design document that that
describes what the (Python language) public interface is for the Graph
class(es) and data objects. We have discussed aspects of this many
times but right now the only documentation is the code.

If a clear specification what is holding us back from engineering
something that can use alternative/distributed data representations
then maybe it is time to address that. (But in a different
thread....).

Aric

project2501

unread,

Apr 10, 2012, 4:10:19 PM4/10/12

to networkx...@googlegroups.com

In my suggestion about the Graph class returning an implementation of algorithms,

I didn't mean to say that the algorithm methods would be on the Graph class, but rather,

a Graph implementation can act as a "factory" for an Algorithms implementation, where Algorithms is perhaps

a class of its own that hosts all the suitable methods in a "family" of implementations that, under

the covers, would agree with the Graph's implementation. It can look and behave the same

as now, but the Algorithms class and the Graph class would be coded in agreement behind the scenes.

For example, Neo4j comes with some built in algorithms. Only a handful. If I coded a Graph class

implementation for neo4j in networkx, I would also code a Algorithms implementation (returned by the Graph implementation),

that caters to the optimizations built into neo4j. On the outside, perhaps all the current algorithm methods

would be subsumed into an Algorithms class whereby only the ones supported by the Graph implementation

would actually be provided. So instead of package level algorithm methods, a concrete class under and interface of sorts

(since interfaces don't really exist in python).

Aric Hagberg

unread,

Dec 13, 2012, 9:19:30 AM12/13/12

to networkx...@googlegroups.com

On Wed, Dec 12, 2012 at 7:43 AM, Jon Ramvi <jonb...@gmail.com> wrote:
> So how did this end - any networkx implementation for app engine ready? :)

I don't think it ever started :-)
But there are certainly people still interested if you want to try
developing something.
Aric

Reply all

Reply to author

Forward