It's definitely possible and we have discussed it before.
There is a simple implementation using Python's shelve at
https://networkx.lanl.gov/trac/ticket/224
It hasn't been updated too recently.
More generally it would make sense to use one of the graph databases.
I've asked the community to suggest an appropriate one, but so far
nobody has replied. I am willing to consider re-engineering parts the
NetworkX core to make it work with a persistent database solution if
that is necessary.
Aric
Tim
> --
> You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
> To post to this group, send email to networkx...@googlegroups.com.
> To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/networkx-discuss?hl=en.
>
> The general logic behind the data structure would go something like
> this:
>
> 1. When a node/edge or whatever is to be de-referenced by its id, the
> local memory cache is consulted.
> 2. If it doesn't reside in memory cache, then its queried from offline
> storage (say mongodb). Mongo will take care of
> identifying the server it actually resides on.
> 3. That node/edge is pulled into the cache where its properties are
> available.
It sounds like you might want to replace the dicts used for adj lists
with a subclass of dict that overrides __getitem__ with a method that
checks the internal storage and if nothing is found goes to the database.
Something like a dict-of-DB_adj_dict-of-DB_egdedata_dict
Dan
On 03/25/2012 03:27 PM, project2501 wrote:
> In fact, I came from the whole neo4j,orientdb graph realm before
> considering this idea.
>
> Couple observations:
>
> 1. They don't support native algorithms, so I would have to write my
> own or patch a set of them over the database.
> So that would be starting with the database and finding/writing the
> algorithms (which I'm less adept at).
> 2. Any of those databases could be used as the long-term storage
> backend for what I'm describing above.
> Some study of mapping algorithm code to efficient database queries
> would need to be done (outside my scope).
>
> So, the inverse of that approach is starting with the algorithm
> package and layering the persistence underneath.
> networkx not only has algorithms I need, but its python, and also has
> export to javascript and SVG. two critical pieces for me.
>
> Since I'm fairly decent with persistence, it _seems_ easier for me to
> add that piece into networkx.
I just want to highlight for you that, as Aric mentioned before, many of
the algorithms access and modify the dictionary structure of the Graph
class directly. This is done for performance gains but means that you
cannot simply define a class that exposes the same interface but you
actually need to mimic some of the internals as well.
To see where such access of internals occurs would actually be
interesting, imho, so that possibly those code sections could be altered
to use "public" methods and attributes only.
Good luck,
Moritz
--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
In fact, I came from the whole neo4j,orientdb graph realm before
considering this idea.
Couple observations:
1. They don't support native algorithms, so I would have to write my
own or patch a set of them over the database.
So that would be starting with the database and finding/writing the
algorithms (which I'm less adept at).
2. Any of those databases could be used as the long-term storage
backend for what I'm describing above.
Some study of mapping algorithm code to efficient database queries
would need to be done (outside my scope).
So, the inverse of that approach is starting with the algorithm
package and layering the persistence underneath.
networkx not only has algorithms I need, but its python, and also has
export to javascript and SVG. two critical pieces for me.
Since I'm fairly decent with persistence, it _seems_ easier for me to
add that piece into networkx.
Of course, it won't perform the same as memory, but only on first
calls as I intend to cache the data in local memory,
backed by memcache (which is distributed and fast), backed by mongo
(long-term storage). If I implement the Graph.py
class correctly, networkx algorithms will solve the problem of what
queries to issue, in what order and when to solve
graph theoretic problems over the persistence layers. Both mongodb and
memcache scale horizontally and perform
exceptionally well. So even though it won't compare to memory, when
you're operating on graphs of enormous size
in "ad hoc" ways. Which is to say, user-driven request/response modes.
Its not possible to have ALL the graph
data in memory waiting for them.
> > > > To post to this group, send email to networkx-discuss@googlegroups.com
> > .
> > > > To unsubscribe from this group, send email to
> > > > For more options, visit this group athttp://
> > groups.google.com/group/networkx-discuss?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "networkx-discuss" group.
> > To post to this group, send email to networkx-discuss@googlegroups.com.
> > To unsubscribe from this group, send email to
> > For more options, visit this group at
> >http://groups.google.com/group/networkx-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
To post to this group, send email to networkx-discuss@googlegroups.com.
To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com.
Yeah, I get that graph databases are optimized for graph storage and retrieval.I would certainly use one in an implementation underneath networkx. The problem is,as I mention, that the graph databases do not come with the algorithms networkx comes with.Also, reading the required nodes into memory then doing networkx algorithms presents a problem.Which nodes to read into memory. And also, perhaps not all the required ones fit into memory.Replacing the networkx in-memory data model with an interface to a memory or persistentmodel will solve both of these problems.
degrees=G.degree()
To view this discussion on the web visit https://groups.google.com/d/msg/networkx-discuss/-/99zKcKogLXcJ.
To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.
Do you have some suggestions on how this might be implemented? There
are lots of engineering trade-offs that will come into play but if
there was a clear design strategy that would help figure those out.
Aric
I need to study/trace the algorithm execution some more and will have some ideas after that.But generally speaking, it would be something like this. Rather than loading all the nodesof a graph into memory before the algorithm runs, let the algorithm request the nodes fromthe underlying graph implementation that results in the graph implementation providing them.The default implementation may very well pre-load them into memory, but other implementationsmight fetch them for persistence if necessary - resulting in more of a crawling effect.For example, to initiate a breadth first type search, I need a starting node only, not all nodes.Then, as the algorithm requests edges, adjacent nodes etc. from the graph, the graph retrievesthem if they are not already loaded into memory.
--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
To view this discussion on the web visit https://groups.google.com/d/msg/networkx-discuss/-/gE_uY--rcRIJ.
To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.
regards
To post to this group, send email to networkx-discuss@googlegroups.com.
To unsubscribe from this group, send email to networkx-discuss+unsubscribe@googlegroups.com.
Can you be more specific about what would be required "to separate
data access from the algorithms".
We do have some separation and a (almost complete?) functional
interface to get the data with object methods.
In some places (algorithms) we access the data using Python built-in
methods on the data dictionary because that is much faster.
Unfortunately the Neo4j licenses are not compatible with NetworkX's
BSD. So we can't create a dependency on that package and keep the
BSD. Are there other graph databases with BSD-compatible licenses?
Aric
Aric
At this point its arguable whether the Python dict methods have become
the de facto public interface. I don't think we've ever been very
clear on that one way or another, and we have exposed the dictionary
methods in some places to get (significant) performance gains.
So certainly one approach is to provide a dictionary-like alternative
data structure. I have a SciPy sparse graph implementation (not ready
for release) that uses that technique.
> Yet another approach would be to associate the algorithms with a Graph class
> such that a Graph object
> can return instances of the algorithms. In this way, an algorithm
> implementation can be returned from
> a Graph object and both implementations can be encapsulated so they perform
> best for that implementation.
> This would allow a native python object/list implementation of a Graph to
> return algorithm instances that
> expect those data structures and performance would be the same as now.
>
> For a Graph implementation that uses something like Neo4j, it can return an
> algorithm implementation
> that takes advantage of its implementation.
I've generally been resisting that approach (e.g. adding more
algorithms as graph class methods) since there are so many varied uses
of NetworkX. Some people only want the base classes and a few
algorithms (or even none), etc.
But perhaps, considering distributed memory or database backends,
there is a subset of core algorithms that most other algorithms can
use and that provide high performance access, traversal, or other
functions? If such a set existed then I can see including those as
base class methods.
> As for Neo4j, networkx need not bind to any of its code because the server
> exposes REST/HTTP interfaces.
> Other graphdb do this as well. There is a native/embedded python wrapper for
> Neo4j.
>
> As for license incompatibilities, its a good point. But if a class wrapped
> neo4j and was interface compatible
> with networkx, but didn't load networkx, then an end-user could load that
> class and tell networkx to use it
> as a Graph implementation. Thus keeping things separate in the libraries.
Yes, that would work. But it wouldn't be part of NetworkX.
> Other graph databases include orientdb, infinitegraph, dex. I've only used
> neo4j and orientdb.
Thanks, I'll take a look. Anyone else have suggestions?
Aric
Shortest_path uses G[n] or G.successors/G.predecessors
(bidirectional). I count G[n] being used for iterating over neighbors
as "using the Graph class methods".
Algorithms that use the syntax G[u].items() or G[u][v] are counting
on G[n] returning an adjacency dict of edge attribute dicts. That can
be a performance burden on data structures that don't use dicts.
Perhaps we could adjust the algorithms to use a new (and fast)
method __call__ like: G(u, attr=None) to iterate over 2-tuples
(nbr, attr_dict) if attr is None; or (nbr, attr_value) otherwise.
This reduces time creating dicts for data strorage that doesn't use
dicts.
If an algorithm need dicts, it can create one with dict(G
(u,attr=weight).
I think the change in the algorithms would be something like:
>>> for nbr,wt in G(n,attr=weight):
>>> ....
Instead of
>>> for nbr,edata in G[n].items():
>>> wt=edata.get(weight,1)
>>> ....
To directly replace G[u][v][weight] the algorithms would need G(u)[v]
[weight]
I guess what I'm trying to say is that your idea to "use the Graph
class methods
exclusively" doesn't mean we have to give up the speed of G[u] or G[u]
[v] for
our base classes. But there may be some corner cases I'm not taking
into account.
Thoughts?
Dan
That could work.
But I think the missing piece is the design document that that
describes what the (Python language) public interface is for the Graph
class(es) and data objects. We have discussed aspects of this many
times but right now the only documentation is the code.
If a clear specification what is holding us back from engineering
something that can use alternative/distributed data representations
then maybe it is time to address that. (But in a different
thread....).
Aric