B+ Tree

139 views
Skip to first unread message

Matthias Broecheler

unread,
Sep 7, 2012, 2:13:16 PM9/7/12
to galax...@googlegroups.com
Hey Ron,

thanks for answering my issue and pointing me to the mailing list. I am very interested to see how a B+ tree would perform on galaxy in practice. Your theoretical analysis is compelling - an actual implementation would be even more compelling as would be a comparison to native distributed B+ tree implementations (e.g. http://www.systap.com/bigdata.htm).

I would be happy to help out with the implementation, but don't have enough time to take the lead on it.

The reason I am interested in a B+ tree implementation on galaxy is that it could make a really nice storage backend for Titan (http://thinkaurelius.github.com/titan/). Titan is a distributed graph database that has a pluggable storage backend model - currently it runs against cassandra, hbase, berkeleyDb. For those storage backends, Titan handles the id assignment which in turn determines the location and therefore data locality - but data locality is static. Essentially, we make a good guess on where to put stuff and then we have to live with that. However, as you pointed out in the article, that gives very predictable access performance. With a galaxy B+ tree implementation, I would use the B+ tree to store each vertex's adjacency list and configure the id assignment so that it uses the id's returned by galaxy. Hence, galaxy would manage the data locality dynamically.
Bottom line: with a B+ tree implementation it should not take me long to finish a Titan storage backend for galaxy which would allow us to test your hypothesis of whether the galaxy framework is a good fit for graph databases. Titan provides a graph database middleware, query language, webserver etc. So, you would have a fully powered galaxy-driven graph database.

Keep up the good work and let me know if/where I can help,
Matthias

pron

unread,
Sep 8, 2012, 2:55:06 PM9/8/12
to galax...@googlegroups.com
What I don't quite understand is why do you need a B+ tree? Do you need to do range queries? Do range queries help with your graph representation?
It seems to me that we need to come up with a better data-structure for graphs. The problem is, of course, partitioning. Galaxy lets you seamlessly transfer one data item from one machine to another (or share a data item on multiple nodes), but it enforces no policy as to when items should be migrated. The B+ tree implementation outlined in the blog post clusters items into machines based on the relative closeness of their numeric/string keys. 
For graphs, it would make sense to cluster items (say, graph nodes) based on the graph structure, because it would make sense to store nearby nodes on the same machine. As the graph changes – i.e. edges and nodes are added or removed – it would make sense to migrate nodes among machines based on some sensible partitioning scheme. So, actually, you can store Titan nodes on Galaxy even without the B+ tree, but without a good partitioning scheme you won't get the major performance boost over DHT solutions (like Cassandra). I would love to hear suggestions for a good partitioning algorithm.

bloudermilk

unread,
Sep 9, 2012, 12:40:55 PM9/9/12
to galax...@googlegroups.com
Hey Matthias,

Funny to see you on this list, as I was posting on the Blueprints list just a few weeks ago asking some questions that eventually led me to Galaxy. One of my first thoughts on Titan was in fact "how do we keep it all in memory?" so it's exciting to see that you're investigating that and you ended up at Galaxy as well.

I've been considering a (potentially naive) method for graph sharding that I'd love to get feedback on from either or both of you. The basic premise is to apply the graph to a modified version of an existing visual layout algorithm. ForceAtlas2, for example, seems particularly well suited due of its use of the Barnes-Hut simulation, but may need to be modified to be better balanced.

If you imagine the entire graph laid out in some two dimensional space over time as vertices/edges are added and removed, you begin to see how this could lead to a distribution technique. By picking a point near the center of space and drawing a circle around all vertices we can establish our bounds. From there, we subdivide the circle into n same-sized pie-shaped slices which represent our shards. Our layout algorithm has already done the work of balancing the graph and grouping highly connected vertices, so we simply need to draw the lines that define our shards, probably with some consideration not to intersect large clusters. 

In Galaxy terms, each shard would own the edges and vertices in its slice of the graph. For redundancy and traversal optimization, we might be able to "bleed" the edges determined in the sharding step. For example, for each vertex on a given shard we could share any vertex which is within our known application-specific step count and in one of the two neighboring slices. 

An alternative (and seemingly more complicated) redundancy technique could be to modify the layout algorithm to use the point we drew in the very first step (the center of the circle) as a center of gravity. This point would repel any lonely vertices and attract highly-connected vertices or clusters. Then we would establish some threshold, a smaller circle with the same center as our first one. Anything within the smaller circle gets shared with all shards.

As you can probably tell I've been thinking about this for some time. Unfortunately due to lack a of time and experience I haven't been able to actually test any of this. At the moment I'm practicing a few prerequisites in my spare time and hope to start tinkering with some real demos within a few weeks. If it seems like I'm going down the wrong path (surely there's a simpler solution) please let me know!

Thanks,
Brendan 

Matthias Broecheler

unread,
Sep 10, 2012, 2:27:25 PM9/10/12
to galax...@googlegroups.com
What I don't quite understand is why do you need a B+ tree? Do you need to do range queries? Do range queries help with your graph representation?

Titan stores the graph in a special type of adjacency list. "Special" in the sense that it introduces an order on the adjacent edges so that it can execute vertex centric queries very efficiently by translating them into range queries. See http://www.slideshare.net/slidarko/titan-the-rise-of-big-graph-data for more information.
With that, an efficient implementation in galaxy would store that adjacency list for each vertex in a B+ tree. This is (kind of) how the implementation against BerkeleyDB works (with prefix keys and some other modifications).
 
It seems to me that we need to come up with a better data-structure for graphs. The problem is, of course, partitioning. Galaxy lets you seamlessly transfer one data item from one machine to another (or share a data item on multiple nodes), but it enforces no policy as to when items should be migrated. The B+ tree implementation outlined in the blog post clusters items into machines based on the relative closeness of their numeric/string keys. 

I might understand you wrong here, but I thought the whole point of galaxy was that you don't have to assign each data block to a fixed machine but that galaxy would automatically migrate/replicate it to where it is written to/ read? 
 
For graphs, it would make sense to cluster items (say, graph nodes) based on the graph structure, because it would make sense to store nearby nodes on the same machine. As the graph changes – i.e. edges and nodes are added or removed – it would make sense to migrate nodes among machines based on some sensible partitioning scheme. So, actually, you can store Titan nodes on Galaxy even without the B+ tree, but without a good partitioning scheme you won't get the major performance boost over DHT solutions (like Cassandra). I would love to hear suggestions for a good partitioning algorithm.

Part of my PhD research focused on effective/efficient graph partitioning for graph database. See, for instance, the following publication:
This approaches (and related ones) try to place vertices such that clusters in the graph (as identified topologically) are co-located in the persistence infrastructure. The paper above defines how such co-location should work and gives a mathematical proof for the optimal partitioning (which is actually NOT based on the graph topology but on the query access paths). 
This is the approach that Titan is taking right now and this has been implemented on top of Cassandra.
Again, I thought the idea behind Galaxy is that I would not have to do that since galaxy would migrate data based on access. I find it very interesting to compare vertex placement methods (such as the one above) with dynamic data migration (as in galaxy, if I understand correctly). The former has an upfront investment at insertion time and potentially suffers from poor statistics whereas the latter has potentially higher read cost since data items must be located. Also, in case of the former, we can forward the queries to the machine that is hosting the data (with DHT that is uniquely and easily identified) which is the approach taken in the paper above, whereas with galaxy we would be pulling the data to the local machine. Again, it is not clear which approach is more efficient.

Matthias Broecheler

unread,
Sep 10, 2012, 2:38:59 PM9/10/12
to galax...@googlegroups.com
Hey Brendan,

glad to hear that you are investigating graph partitioning/sharding algorithms. This is a very fascinating area of research full of surprises and interesting challenges.

Here are some observations from my own research:
1) If you want to shard graph data you likely have a lot of it and therefore need a very efficient partitioning algorithm. This disqualifies all of the algorithms in the force-vector optimization group like the ForceAtlas family, Fruchterman-Rheingold, etc). In the paper I just send around I use a greedy force-vector approach as well as a greedy modularity maximization algorithm.
2) You need to be careful that you end up with balanced partitions. Graphs have highly skewed statistics (and I cannot emphasize the "highly" enough). This likely leads to unbalanced partitions if you are not very careful and therefore hot spots in your database that kill performance.
3) The impact of non-random partitioning is often overestimated for two reasons: a) negative impact of imbalance and b) what matters is query access path locality and not graph locality.

Now, please ignore most of what I said and come up with a completely new way of thinking about the problem - we need it ;-)

Good luck,
Matthias

pron

unread,
Sep 11, 2012, 6:54:55 AM9/11/12
to galax...@googlegroups.com
> I might understand you wrong here, but I thought the whole point of galaxy was that you don't have to assign each data block to a fixed machine but that galaxy would automatically migrate/replicate it to where it is written to/ read? 

You are correct. You do not assign a fixed machine, and Galaxy does automatically share/migrate items based on access. However, you do have a choice whether you want to migrate data to code (simply by accessing an item), or code to data (by sending a message to the owner of the data). This decision is best made by a layer sitting on top of Galaxy (i.e. a data-structure implementation) that is tailored for the data and access patterns. 

> Titan stores the graph in a special type of adjacency list. "Special" in the sense that it introduces an order on the adjacent edges so that it can execute vertex centric queries very efficiently by translating them into range queries. See http://www.slideshare.net/slidarko/titan-the-rise-of-big-graph-data for more information.

I see. So a B+ tree would, indeed, be better than no data-structure at all. But wouldn't dynamically partitioning the graph and deciding where each graph node should be stored (again, dynamically) be better? Maybe using Kernighan-Lin/Fiduccia-Mattheyses on top of Galaxy? 

Matthias Broecheler

unread,
Sep 12, 2012, 12:33:15 AM9/12/12
to galax...@googlegroups.com


On Tuesday, September 11, 2012 3:54:55 AM UTC-7, pron wrote:
> I might understand you wrong here, but I thought the whole point of galaxy was that you don't have to assign each data block to a fixed machine but that galaxy would automatically migrate/replicate it to where it is written to/ read? 

You are correct. You do not assign a fixed machine, and Galaxy does automatically share/migrate items based on access. However, you do have a choice whether you want to migrate data to code (simply by accessing an item), or code to data (by sending a message to the owner of the data). This decision is best made by a layer sitting on top of Galaxy (i.e. a data-structure implementation) that is tailored for the data and access patterns. 


Titan already does the latter, i.e. sending messages to other machines that host part of the graph that is needed to answer parts of the query by decomposing the query and sending the corresponding part over the wire. We implemented this on top of standard DHT backends. But what we cannot do with DHT models is migrate code to data (we can request it, but we would have to re-request it each time). This is where galaxy could provide an improvement.

 
> Titan stores the graph in a special type of adjacency list. "Special" in the sense that it introduces an order on the adjacent edges so that it can execute vertex centric queries very efficiently by translating them into range queries. See http://www.slideshare.net/slidarko/titan-the-rise-of-big-graph-data for more information.

I see. So a B+ tree would, indeed, be better than no data-structure at all. But wouldn't dynamically partitioning the graph and deciding where each graph node should be stored (again, dynamically) be better? Maybe using Kernighan-Lin/Fiduccia-Mattheyses on top of Galaxy? 

The question of where to store a node and how to store the adjacency list for each node are two orthogonal questions. For the first we use linear-time greedy algorithms, for the latter we found locality preserving data structures like B+ trees to be best.

pron

unread,
Sep 12, 2012, 5:58:16 PM9/12/12
to galax...@googlegroups.com
> This is where galaxy could provide an improvement.

Yes. Also, you could get a performance boost by running Titan code on Galaxy's nodes, and directly access the data in RAM (this is how Galaxy should be used). I've read Some of Titan's documentation, and I see you access Cassandra/HBase remotely.

> The question of where to store a node and how to store the adjacency list for each node are two orthogonal questions. For the first we use linear-time greedy algorithms, for the latter we found locality preserving data structures like B+ trees to be best.

But wouldn't it be optimal to keep a vertex's adjacency list alongside the vertex (i.e. on the same machine)? I don't quite understand how you see the data distributed across Galaxy nodes.

I was pondering something else, too (B trees aside). All graph partitioning algorithms I've seen (and I've only seen a few), require full knowledge of the graph (or, at least, of the subgraph to be partitioned). But with Galaxy's peer-to-peer architecture, it would be cool if decisions as to which vertices to push to/pull from other machines would be made locally on each machine, with only partial knowledge of the subgraphs stored on other machines. Perhaps a multilevel approach where each machine stores some high-level, "low-resolution" information about the subgraphs stored on other machines could work. This is essentially what is done with trees as well, with all nodes sharing the root.

bloudermilk

unread,
Sep 12, 2012, 6:04:51 PM9/12/12
to galax...@googlegroups.com
Ron,

Correct me if I'm wrong, but I believe the "low-resolution" technique you're talking about can be implemented with the Barnes-Hut simulation where "far-away" vertices are grouped into subgraphs. The way I planned on implementing my partitioning algorithm would require nodes to know about only their own individual vertices/edges and the subgraphs of other nodes.

pron

unread,
Sep 12, 2012, 6:17:17 PM9/12/12
to galax...@googlegroups.com
Probably. You're essentially transforming the graph partitioning problem into a spatial partitioning problem, which, incidentally, was the initial motivation behind Galaxy. But I'm not a graph expert, and I don't know how well such a spatial technique would perform. I don't think I've seen it offered as a solution for the graph partitioning problem before (then again, I'm not too familiar with the field). So, it might work, but I don't know if anyone has ever analyzed this solution. Matthias mentioned that any force-vector solution may not be efficient enough for large graphs.

bloudermilk

unread,
Sep 12, 2012, 6:26:13 PM9/12/12
to galax...@googlegroups.com
I'm certainly looking forward to trying it out. I've got a test dataset (githubarchive.org) that I've been playing with on other platforms that I plan to test the partitioning theory with. Working with a few million vertices/edges is definitely a good start, but ideally this would scale to tens of millions of vertices/edges. It seems that by nature, the algorithm couldn't scale linearly, though to be honest I'm not sure what the criteria is to be designated as linearly scalable. At the very least it would seem that in a real-world scenario the graph would be growing over time, not just appear out of nowhere, so only recently added/modified vertices would need to be repositioned with any considerable frequency. Once a vertex "settles" it should be fine to process it once every couple of seconds.
Reply all
Reply to author
Forward
0 new messages