Two questions: Data affinity and adding/removing vertices/edges at runtime?

99 views
Skip to first unread message

Brenton Partridge

unread,
Nov 22, 2013, 2:24:40 AM11/22/13
to graph...@googlegroups.com
Hi all,

New to the community, thanks in advance for reading! I'm considering using GraphLab for a large probabilistic inference problem on astronomical images, and I'm wondering about whether GraphLab has certain features that would be crucial. Our model can be represented by two "layers" of vertices, each distributed (relatively) uniformly over space and connected to "nearby" vertices in the other layer only. The top layer has S source nodes (corresponding to latent stellar objects), and the bottom layer has N observation nodes (pixel data, either represented as 1 node per pixel or 1 node per image patch?). However, because the linkages are limited in their "distance" in space, there are K<<SN edges. I'm relatively confident that our inference algorithm can be expressed as Gather-Apply-Scatter steps on these vertices/edges.

With that context, I'd like to pick your brains about two features that we'd need, since it's hard to tell from the tutorials and API whether they are supported in GraphLab - and if they are supported, what would I look for in the documentation?

1. Since N*data_per_observation_node is large (multiple terabytes), we'd like to minimize the amount of movement of data between machines. Is there a way to tell GraphLab to do this, i.e. give observation node n, and operations on its local neighborhood, an affinity for machine g, instead of distributing it to a random machine every time the scheduler decides to update n's local neighborhood? Can GraphLab deal with fault tolerance on machine g's associated vertices if machine g goes down?

2. A priori, we do not know the number of sources S, and we'd like to vary this number using reversible jump Metropolis-Hastings. This would correspond to adding or removing source vertices and edges in the graph between iterations/during GAS rounds. Is it possible to do this while the system is running, i.e. update steps are occurring? Or would we need to do something like start with a pool of disabled source vertices, and enable from that? It might be very inefficient to do so since we would need to support all plausible edges between a new source vertex and the observation vertices.

Thanks for your help! GraphLab looks awesome, and I hope that we can use it!

Best,
Brenton Partridge
Harvard University

Danny Bickson

unread,
Nov 23, 2013, 1:18:25 AM11/23/13
to graph...@googlegroups.com
Hi Brenton!
Thanks for reaching out - your project sounds awesome!
I will try to give some initial advice but since there are more than one possible solution we may get some additional answers from the team.

First, we have a newer warp engine you can consider using: http://docs.graphlab.org/using_warp_graph_vertex_program.html
for some cases it will result in a simpler code (vs. GAS). 

Recently we have added support for dynamic graph edges, namely edges that are added to the graph after the engine is started. However, I think we do not support removal of edges yet. My advice to take a look at the kmeans code: https://github.com/graphlab-code/graphlab/blob/master/toolkits/clustering/kmeans.cpp which basically allow testing the distance of a point vs. every other cluster center
without forming the explicit edges between all of them.

Regarding the local partitioning, I will answer in a separate email. 

Keep us posted with any questions. 


Danny Bickson
Co-Founder
GraphLab Inc.


--
You received this message because you are subscribed to the Google Groups "GraphLab API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphlabapi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rong Chen

unread,
Nov 25, 2013, 1:11:16 AM11/25/13
to graph...@googlegroups.com
Hi, Brenton

The case you mentioned is much interesting, especially about data affinity.
Our group developed a new hybrid engine and partition for GraphLab, which applies different computation and partition strategies for different vertices.
(http://ipads.se.sjtu.edu.cn/projects/powerlyra.html)

Currently, we focuses on extending hybrid partition to improve bipartite graphs and algorithms.
The new bipartite partition applies different partition strategies for vertices of two groups.
All vertices of one group has no replicas (no duplicated vertex data).

I think our new partition with some modification can satisfy your requirement to data affinity.

- Parallel loading vertices and edges from local disk of each machine
- All N (observation) vertices would be fixed at the machining loading it without replication, and S (stellar) vertices would be replicated for computation
Is it you wanted?

If you can provide more detail information about your application, I'm glad to add data affinity to our bipartite partition and you can try it in your application.
(For example, can you provide some sample code and data for our development.)

Rong Chen
Institute of Parallel and Distributed Systems (IPADS)
Shanghai Jiao Tong University

Rong Chen

unread,
Dec 13, 2013, 8:14:27 AM12/13/13
to graph...@googlegroups.com
We have implemented a tentative graph partition to support data affinity for bipartite graph.
the vertex in favorite subset of bipartite graph can be fixed in machine storing data files and will not be replicated in any other machines.

You can find detail information from http://ipads.se.sjtu.edu.cn/projects/powerlyra.html


Best,
Brenton Partridge
Harvard University


Reply all
Reply to author
Forward
0 new messages