Suitability of NetworkX for large graphs (5-25M nodes, 100-200M edges)?

12,705 views
Skip to first unread message

VanL

unread,
Apr 2, 2010, 4:35:59 PM4/2/10
to networkx-discuss
I have a dataset that I am building to do graph analysis with. I would
love to use NetworkX, but I am concerned that my graph may be too big.
How suitable is NX for analysis of a 5-25M node graph? Specifically I
am interested in finding cliques and spreading activation or pagerank
from a node.

If NX is not suitable, any suggestions?

Thanks,

Van

Richard Careaga

unread,
Apr 2, 2010, 4:49:23 PM4/2/10
to networkx...@googlegroups.com
I've used it without difficulty at the lower end of your range.

Andrew Conway

unread,
Apr 2, 2010, 5:00:45 PM4/2/10
to networkx...@googlegroups.com
NX is certainly capable of handling graphs that large, however, performance will largely be a function of your hardware setup.  Aric will likely give a better answer, but NX loads graphs into memory at once, so in the ranges your are describing you will need a substantial amount of free memory for it to work.

An alternative for large graph manipulation is igraph (http://igraph.sourceforge.net/), but you will be substantially limited in the forms your graphs can take.

--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/networkx-discuss?hl=en.



___________________________________
Drew Conway
Ph.D. Student
Department of Politics, New York University
@drewconway

VanL

unread,
Apr 2, 2010, 5:12:43 PM4/2/10
to networkx...@googlegroups.com
On 4/2/2010 4:00 PM, Andrew Conway wrote:
> NX is certainly capable of handling graphs that large, however,
> performance will largely be a function of your hardware setup. Aric
> will likely give a better answer, but NX loads graphs into memory at
> once, so in the ranges your are describing you will need a substantial
> amount of free memory for it to work.
>
> An alternative for large graph manipulation is igraph
> (http://igraph.sourceforge.net/), but you will be
> substantially limited in the forms your graphs can take.

One thought I had was to subclass the Graph class(es) and put in a
neo4j-based representation underneath. From an initial gander at the
source code, it seemed like graph objects were mostly opaque, and the
various algorithms didn't poke around the internals too much. Thoughts?

Dan Schult

unread,
Apr 2, 2010, 5:31:57 PM4/2/10
to networkx...@googlegroups.com
Doesn't neo4j store its graph on disk?
If so doesn't it take a large hit for speed?

Anyway, the base class methods are the only methods
that access the data structure directly so it should
be fairly straightforward to subclass it with another
underlying representation.

Dan

> --
> You received this message because you are subscribed to the Google
> Groups "networkx-discuss" group.

> To post to this group, send email to networkx-
> dis...@googlegroups.com.
> To unsubscribe from this group, send email to networkx-discuss
> +unsub...@googlegroups.com.

VanL

unread,
Apr 3, 2010, 2:32:34 PM4/3/10
to networkx...@googlegroups.com
This is again dealing with my large dataset that I am trying to analyze.
One of the things that I want to do is find communities/cliques within
the graph. Looking at the complexity of some of those algorithms,
though, it appears that it might be prohibitive.

Looking again at the structure of my data, though, I think there might
be a way to avoid some of the complexity if there is an iterative
algorithm that keeps the cliques organized as new nodes are added. Among
other things, my dataset represents changes over time, and I fully
expect that the cliques at time T1 will evolve and be slightly different
at times T2, T3, etc.

Is there a good algorithm that will keep things "sorted" -- clique-wise
-- while I add new nodes and edges over time? If possible, I would like
to avoid re-analyzing the whole dataset at each timestep.

Thanks,

Van

Richard Careaga

unread,
Apr 3, 2010, 3:30:03 PM4/3/10
to networkx...@googlegroups.com
Are your cliques cumulative or instantaneous? I'm currently working on
the Enron email dataset of 250K messages, which I've skinnied down to
new topic messages (30K). Initially, I'm working with the visuals, but
plan to go on to other metrics. You go get a very different sense of
cliques if you look at the entire time series than if you chop it up
into increments, where cliques form and dissolve from month-to-month.

VanL

unread,
Apr 3, 2010, 10:31:22 PM4/3/10
to networkx...@googlegroups.com
On 4/3/2010 2:30 PM, Richard Careaga wrote:
> Are your cliques cumulative or instantaneous? I'm currently working on
> the Enron email dataset of 250K messages, which I've skinnied down to
> new topic messages (30K). Initially, I'm working with the visuals, but
> plan to go on to other metrics. You go get a very different sense of
> cliques if you look at the entire time series than if you chop it up
> into increments, where cliques form and dissolve from month-to-month.

Cumulative. My dataset is the US Patent Database, which I have spent
some time getting into a tractable form -- data acquisition and
normalization has taken several months. Now I am want to examine
communities based upon a number of different types of connections. I am
starting with direct references, but have several other types of
references that I will eventually look at.

I expect that cliques will form around new technologies. These cliques
will gradually differentiate into subfields, and some may merge with
other cliques as cross-disciplinary approaches are tried.

I want to try it two ways: first, just adding new nodes as they come and
examining the cliques that develop, and second try aging the links
between nodes to see if that changes things.

Thanks,

Van

Richard Careaga

unread,
Apr 3, 2010, 11:54:36 PM4/3/10
to networkx...@googlegroups.com
Wow. That's an ambitious project!

So, at t0, you have N nodes and V edges, call the methods and you get C cliques, each of which has two or more nodes (or, maybe, to 'seed' you will allow single node cliques for this round). One structure for this is a list of lists, which is convenient because the clique objects are implemented as list types.  Call this LL0 and either pickle or, probably preferably, load in SQL. A copy of this, LL, will serve as the cumulative master list of lists of cliques. While we're at it, let's create the t0 list of clique members:

members0 = set(your_favorite_flatten(LL0)) #unique clique members

At t1, you repeat and get another list of lists, LL1 to go with LL0. The first part is easy:

discards = [item for item in LL1 if item in LL1] #already have these cliques

candidates = [item for item in LL1 if item not in discards] #new or changed cliques

Find the set of  members of these cliques

members1 = set(your_favorite_flatten(LL1)

and the difference and overlap

newkids = members1.difference(members0)
recidivists = members1.intersection(members0)

And here I lose my way because I'm cramping on coming up with a suitable way of abstracting your decision logic on how edges form, but I guess you could iterate over LL0 and see if each newkid  or recidivist should be appended to a new existing clique.. When that's accounted for, what's left should be the new cliques. In either case you'd update LL with the new or changed lists.

franck kalala

unread,
Apr 4, 2010, 7:21:11 AM4/4/10
to networkx...@googlegroups.com
Hey All,

Is there any routine in Networks that can plot the distribution of distances in a given graph?
I want to have in abscissa the distances and in ordinates the probabilities.
want to plot P(d) vs d where d is the distance between two given nodes in the network.


De : Richard Careaga <pub...@careaga.net>
À : networkx...@googlegroups.com
Envoyé le : Dim 4 avril 2010, 4 h 54 min 36 s
Objet : Re: [networkx-discuss] Iterative community-finding algorithms
--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
To post to this group, send email to networkx...@googlegroups.com.
To unsubscribe from this group, send email to networkx-discu...@googlegroups.com.

franck kalala

unread,
Apr 4, 2010, 7:32:53 AM4/4/10
to networkx...@googlegroups.com

Richard Careaga

unread,
Apr 4, 2010, 11:18:04 AM4/4/10
to networkx...@googlegroups.com
nx has facilities, piggybacking on matplotlib, to draw graphs: http://networkx.lanl.gov/reference/drawing.html

but it is not a general purpose plotting facility, but you can find everything you need in the matplotlib package:

http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot

franck kalala wrote:

Aric Hagberg

unread,
Apr 4, 2010, 12:19:10 PM4/4/10
to networkx...@googlegroups.com
On Sun, Apr 4, 2010 at 9:18 AM, Richard Careaga <leu...@gmail.com> wrote:
> nx has facilities, piggybacking on matplotlib, to draw graphs:
> http://networkx.lanl.gov/reference/drawing.html
>
> but it is not a general purpose plotting facility, but you can find
> everything you need in the matplotlib package:
>
> http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot

You just need pylab.hist() and a little code:

import networkx
G=networkx.path_graph(5)
lengths=[] # all path lengths in graph, counted twice
for n in G:
p=networkx.single_source_shortest_path_length(G,n).values()
lengths.extend(p)

import pylab
pylab.hist(lengths,bins=max(lengths),align='left',rwidth=0.5)
pylab.show()

The histogram is of frequencies so you'll have to normalize
accordingly.

Aric

Aric Hagberg

unread,
Apr 4, 2010, 12:26:36 PM4/4/10
to networkx...@googlegroups.com

There is an example of a subclass at
http://networkx.lanl.gov/examples/subclass/printgraph.html

We have tried to not expose the underlying data structures anywhere
in the algorithms (but there is no test we run to demonstrate that is true).
We do rely on Python dictionary expressions like G[v] etc.

I also have experimented using Python shelve to store persistent graph objects
https://networkx.lanl.gov/trac/ticket/224


Aric

franck kalala

unread,
Apr 4, 2010, 12:48:49 PM4/4/10
to networkx...@googlegroups.com
Thanks Aric for you help.

I ment the kind of the plot in the attached file.

May be was not to precise about.

See the attached file

Cheers


----- Message d'origine ----
De : Aric Hagberg <ahag...@gmail.com>
À : networkx...@googlegroups.com
Envoyé le : Dim 4 avril 2010, 17 h 19 min 10 s
Objet : Re: [networkx-discuss] plottting the distribution of distances

Aric

--

plot.ppt

franck kalala

unread,
Apr 5, 2010, 8:42:50 AM4/5/10
to networkx...@googlegroups.com
Hey all,

Is there any routine that generates strongly regular graph?

http://en.wikipedia.org/wiki/Strongly_regular_graph

Cheers



Reply all
Reply to author
Forward
0 new messages