Bipartite projection and write to CSV with NetworkX — how to speed up writing to handle large file

688 views
Skip to first unread message

Curtis Hampton

unread,
Jul 22, 2014, 7:46:37 PM7/22/14
to networkx...@googlegroups.com
I have a pretty big file (3 million lines) with each line being a person-to-event relationship.  Ultimate, I want to project this bipartite network onto a single-mode, weighted, network, and write it to a CSV file.  I'm using NetworkX, and I've tested my code on a much smaller sample dataset, and it works as it should.  However, when I scale up to my actual dataset, my computer just maxes out on memory and spins and spins, but doesn't make any progress.

I'm using an AWS EC2 machine with 32GB of memory. 

After some sample testing, I'm pretty sure things are getting hung up in the final step after the graph has been projected, and it is being written to a CSV file. I've tried breaking up the file into chunks, but then I have a problem with missing edged, or correctly adding edgeweights together.  But I think a better solution is going to be to find a way to speed up writing the projected graph to CSV.

Code using NetworkX to Project Bipartite Network and Write to CSV

    # import modules
    import time
    import csv
    import networkx as nx
    from networkx.algorithms import bipartite

    startTime = datetime.datetime.now()

    # rename files
    infile = 'bipartite_network.csv'
    name_outfile = infile.replace('.csv', '_nameFolded.csv.')
    print 'Files renamed at: ' + str(datetime.datetime.now() - startTime)

    # load CSV into a dict
    with open(infile, 'rb') as csv_file:
        rawData = list(csv.DictReader(csv_file))
    print 'Files loaded at: ' + str(datetime.datetime.now() - startTime)

    # create edgelist for Name -x- Event relationships
    edgelist = []
    for i in rawData:
        edgelist.append(
        (i['Event'],
         i['Name'])   
        )
    print 'Bipartite edgelist created at: ' + str(datetime.datetime.now() - startTime)

    # deduplicate edgelist
    edgelist = sorted(set(edgelist))
    print 'Bipartite edgelist deduplicated at: ' + str(datetime.datetime.now() - startTime)

    # create a unique list of Name and Event for nodes
    Event = sorted(set([i['Event'] for i in rawData]))
    Name = sorted(set([i['Name'] for i in rawData]))
    print 'Node entities deduplicated at: ' + str(datetime.datetime.now() - startTime)

    # add nodes and edges to a graph
    B = nx.Graph()
    B.add_nodes_from(Event, bipartite=0)
    B.add_nodes_from(Name, bipartite=1)
    B.add_edges_from(edgelist)
    print 'Bipartite graph created at: ' + str(datetime.datetime.now() - startTime)

    # create bipartite projection graph
    name_nodes, event_nodes = bipartite.sets(B)
    event_nodes = set(n for n,d in B.nodes(data=True) if d['bipartite']==0)
    name_nodes = set(B) - event_nodes
    name_graph = bipartite.weighted_projected_graph(B, name_nodes)
    print 'Single-mode projected graph created at: ' + str(datetime.datetime.now() - startTime)

    # write graph to CSV
    nx.write_weighted_edgelist(name_graph, name_outfile, delimiter=',')
    print 'Single-mode weighted edgelist to CSV: ' + str(datetime.datetime.now() -    startTime)

    endTime = datetime.datetime.now()
    print 'Run time: ' + str(endTime - startTime)


Using Pandas to Write the Projected Edgelist, but Missing Edge Weight?

I've thought about using `pandas` to write to `name_graph` to CSV, but that doesn't include the weight.  Any idea how to include the weight in the dataframe?  Would this be a good option for speeding up the writing to CSV part of the process?

    import pandas as pd
    df = pd.DataFrame(nx.edges(name_graph))
    df.to_csv('foldedNetwork.csv')

Moritz Beber

unread,
Jul 23, 2014, 5:30:57 AM7/23/14
to networkx...@googlegroups.com
Hi,


On Wed, Jul 23, 2014 at 1:46 AM, Curtis Hampton <curtisl...@gmail.com> wrote:
I have a pretty big file (3 million lines) with each line being a person-to-event relationship.  Ultimate, I want to project this bipartite network onto a single-mode, weighted, network, and write it to a CSV file.  I'm using NetworkX, and I've tested my code on a much smaller sample dataset, and it works as it should.  However, when I scale up to my actual dataset, my computer just maxes out on memory and spins and spins, but doesn't make any progress.

I'm using an AWS EC2 machine with 32GB of memory. 

After some sample testing, I'm pretty sure things are getting hung up in the final step after the graph has been projected, and it is being written to a CSV file. I've tried breaking up the file into chunks, but then I have a problem with missing edged, or correctly adding edgeweights together.  But I think a better solution is going to be to find a way to speed up writing the projected graph to CSV.

Are you sure it's the final part? Writing to CSV should not be a problem, really.
The nx.Graph class handles de-duplication of edges for you already, so why not use it directly?

for row in raw_data:
    B.add_node(row['Event'], bipartite=0)
    B.add_node(row['Name'], bipartite=1)
    B.add_edge(row['Event'], row['Name'])

You don't add any edge attribute here directly, so maybe something's missing? Or did you want to use a MultiGraph?
 

    # add nodes and edges to a graph
    B = nx.Graph()
    B.add_nodes_from(Event, bipartite=0)
    B.add_nodes_from(Name, bipartite=1)
    B.add_edges_from(edgelist)
    print 'Bipartite graph created at: ' + str(datetime.datetime.now() - startTime)

    # create bipartite projection graph
    name_nodes, event_nodes = bipartite.sets(B)
    event_nodes = set(n for n,d in B.nodes(data=True) if d['bipartite']==0)
    name_nodes = set(B) - event_nodes
    name_graph = bipartite.weighted_projected_graph(B, name_nodes)
    print 'Single-mode projected graph created at: ' + str(datetime.datetime.now() - startTime)

    # write graph to CSV
    nx.write_weighted_edgelist(name_graph, name_outfile, delimiter=',')
    print 'Single-mode weighted edgelist to CSV: ' + str(datetime.datetime.now() -    startTime)

    endTime = datetime.datetime.now()
    print 'Run time: ' + str(endTime - startTime)


Using Pandas to Write the Projected Edgelist, but Missing Edge Weight?

I've thought about using `pandas` to write to `name_graph` to CSV, but that doesn't include the weight.  Any idea how to include the weight in the dataframe?  Would this be a good option for speeding up the writing to CSV part of the process?

    import pandas as pd
    df = pd.DataFrame(nx.edges(name_graph))
    df.to_csv('foldedNetwork.csv')

--
You received this message because you are subscribed to the Google Groups "networkx-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to networkx-discu...@googlegroups.com.
To post to this group, send email to networkx...@googlegroups.com.
Visit this group at http://groups.google.com/group/networkx-discuss.
For more options, visit https://groups.google.com/d/optout.

Raf Guns

unread,
Jul 23, 2014, 5:46:22 AM7/23/14
to networkx...@googlegroups.com
Hi!

On Wed, Jul 23, 2014 at 1:46 AM, Curtis Hampton
<curtisl...@gmail.com> wrote:
> I've thought about using `pandas` to write to `name_graph` to CSV, but that
> doesn't include the weight. Any idea how to include the weight in the
> dataframe? Would this be a good option for speeding up the writing to CSV
> part of the process?
>
> import pandas as pd
> df = pd.DataFrame(nx.edges(name_graph))

I don't know if writing CSV is the bottleneck here, but pandas' CSV
reading and writing is optimized Cython, so I would expect it to be a
lot faster. You can include the weights by changing the above line:

df = pd.DataFrame([(u, v, d['weight']) for u, v, d in G.edges_iter(data=True)])


Raf

Curtis Hampton

unread,
Jul 23, 2014, 9:50:10 AM7/23/14
to networkx...@googlegroups.com
After further testing, it appears to be getting hung up on the projection stage.  Any ideas on what I can do to get past this part?

name_graph = bipartite.weighted_projected_graph(B, name_nodes)

Aric Hagberg

unread,
Jul 23, 2014, 10:37:54 AM7/23/14
to networkx...@googlegroups.com
You could just write the edges directly to a file instead of building
a new graph (the projected graph) first. e.g. steal the code from
bipartite.weighted_projected_graph()

def weighted_projected_graph(B, nodes, ratio=False):
if B.is_multigraph():
raise nx.NetworkXError("not defined for multigraphs")
if B.is_directed():
pred=B.pred
G=nx.DiGraph()
else:
pred=B.adj
G=nx.Graph()
G.graph.update(B.graph)
G.add_nodes_from((n,B.node[n]) for n in nodes)
n_top = float(len(B) - len(nodes))
for u in nodes:
unbrs = set(B[u])
nbrs2 = set((n for nbr in unbrs for n in B[nbr])) - set([u])
for v in nbrs2:
vnbrs = set(pred[v])
common = unbrs & vnbrs
if not ratio:
weight = len(common)
else:
weight = len(common) / n_top
#replace-> G.add_edge(u,v,weight=weight)
#with writer -> u,v,weight
return


Aric

Curtis Hampton

unread,
Jul 23, 2014, 6:37:10 PM7/23/14
to networkx...@googlegroups.com
Aric,

Thank you for your time and help.  Check out my code below and let me know if you can think of any additional suggestion to hopefully make this run. 


# project graph and write projected graph's edgelist to a list

if B.is_multigraph():
    raise nx.NetworkXError("not defined for multigraphs")
if B.is_directed():
    pred = B.pred
else:
    pred = B.adj
    G = nx.Graph()
G.graph.update(B.graph)
G.add_nodes_from((n,B.node[n]) for n in name_nodes)
n_top = float(len(B) - len(name_nodes))
folded = []
for u in name_nodes:

    unbrs = set(B[u])
    nbrs2 = set((n for nbr in unbrs for n in B[nbr])) - set([u])
    for v in nbrs2:
        vnbrs = set(pred[v])
        common = unbrs &
 vnbrs
        weight = len(common)
        row = u, v, weight
        folded.append(row)

# write projected graph list to CSV
df = pd.DataFrame(folded)
df.to_csv(name_outfile_pd)

Aric Hagberg

unread,
Jul 23, 2014, 7:23:03 PM7/23/14
to networkx...@googlegroups.com
I was thinking something like this:

import networkx as nx

B = nx.Graph()
B.add_edge('a',1)
B.add_edge('a',2)
B.add_edge('b',1)
B.add_edge('b',2)
B.add_edge('b',3)
B.add_edge('c',3)

nodes = ['a','b','c']
seen = set()
for u in nodes:
#    seen=set([u]) # print both u-v, and v-u
    seen.add(u) # don't print v-u
    unbrs = set(B[u])
    nbrs2 = set((n for nbr in unbrs for n in B[nbr])) - seen
    for v in nbrs2:
        vnbrs = set(B[v])
        common = unbrs & vnbrs
        weight = len(common)
        print("%s, %s, %d"%(u,v,weight))
Reply all
Reply to author
Forward
0 new messages