Generate/write/load huge graph

2,473 views

Skip to first unread message

Martin

unread,

Nov 13, 2010, 6:32:08 AM11/13/10

to networkx-discuss

Helle everybody,

i am trying to figure out the best way to generate and after that
write and reload a graph with networkx.

A few words on the background. I fetch data from a database and
generate a graph out of it. The graph has around 50k nodes and is
extremly dense (around 26 million edges). I already did that for first
analysis with R - igraph and an adjacency list as input (node1 node2
weight). Now i need more information in the graph (node and edge
labels) for input to my algorithm. To be more flexible i decided to
use networkx and python for further analysis.

First i tried to generate a dot file out of the database to load it
with read_dot in a graph. The dot file has about 850mb (due to the
labels and the large amount of edges). First i ran into memory
problems (core i7, 12gb ram) but managed that the file is loaded. But
i think the problem is with the adding of the edges to the graph. i
started the import 2 days ago and it is still running (python takes
11,8gb ram and additional 4gb swap memory).

i figured that the dot file is maybe not the best way to feed the
graph to my algorithm (the data changes on regular basis and i still
have to figure out the best parameters for my algorithm (simulated
annealing) so i will have to load the graph quite often.). I thougt
about generating the graph once and save it with gpickle. So in theory
to load the gpickle graph should be quite fast.

I have the data stored in a dict (quite a lot of calculations and
combinations was done between the database and the dict). I have a
separate dict for the nodes (node: labeldata) which i use to store the
label of the nodes and map the node names (which are big strings) to
integers and one for the edges (key: node1 -- node2: labeldata as
list). After that i try to generate the graph like this:

G=nx.Graph(name='G')
for key in node_dict:
G.add_node(node_dict[key]['shortname'],node_dict[key])

#adding of nodes finished after a few seconds

for key in edge_dict:
keys = string.split(key,' -- ') #the edges are stored as key of
the input dict as "node1 -- node2"
G.add_edge(keys[0], keys[1], edge_dict[key])

#runs forever.. :)

nx.write_gpickle(G,"Full_Graph.gpickle")

After all the talk to my question. :) Is there a faster/better way to
do what i want to do? The adding of edges already takes forever (runs
8 hours till now and still running) and will i be able to load the
graph from the gpickle to run my algorithm on it? Is it even possible
to work with such a dense and big graph in networkx?

thanks a lot for your help. :)

greetings

Martin

Aric Hagberg

unread,

Nov 13, 2010, 10:30:15 AM11/13/10

to networkx...@googlegroups.com

As you discovered the flexibility of storing labels and other
information with the graph nodes comes with some memory cost.

If the database has a standard interface there is probably a Python
module to grab the data directly from your database. That might
remove some of the steps of dumping the data to intermediate files and
allow you to load and update the graph faster directly from the
database.

But I think the main problem you are having is running out of memory
when you load the graph. If you are swapping memory it's going to be
very slow to load. If you have a machine with more memory it should
be possible - for example I just loaded a multigraph on my laptop that
has 5k nodes and 4M edges and it took about one minute to read from a
text file of edges and loaded to about 2GB of memory. So you can
roughly guess that you'll need about 10 times that for your graph
depending on exactly what you are storing in the graph.

We have explored using on-disk storage for graphs that are too big to
fit into memory. Obviously there would be some performance reduction
when accessing the data but if you can't fit it in memory it might be
the best solution. Currently there is no code in NetworkX for that
but I did make a demo implementation using Python shelve to store the
data (see https://networkx.lanl.gov/trac/ticket/224 and the linked
email discussion there). That code probably won't work with the
latest version of NetworkX but it wouldn't be hard to update it.