How does Titan execute queries?

685 views
Skip to first unread message

Paul Fleetwood

unread,
Dec 10, 2013, 12:16:50 PM12/10/13
to aureliu...@googlegroups.com
Hello all,

I've been getting up to speed on graph databases as a potential solution at my company.  

In my reading, I've learned that Titan scales well in terms of the size of the graph and in terms of concurrent transactions.  I've also learned that Faunus is a way to run graph wide analytics.  I have found some posts that indicate that there is a certain size of graph where moving to Faunus makes sense.

I've also read up on graph and vertex-centric indexes.

In our use case, we are going to probably generate a graph starting with about a 500 million nodes, and potentially growing to 500 billion nodes or more (hopefully).  I am thinking of backing Titan with HBase.

Here are my questions / concerns:

1) Does Titan execute a single query in parallel across all the nodes in its cluster?  Suppose that the query being run is something like g.V("indexedProperty", "value").out.count().  In that case, what will happen?

I (ignorantly) expect that Titan would look at an index on the "indexedProperty", determine which nodes store the requested vertexes, and then issue the query to all of those nodes in parallel.  Each node in the Titan cluster would perform the traversal requested and return a partial answer, and then the results from each node would be aggregated before the final result was returned.  This is how I am used to things like RedShift (a distributed RDMS) working when I perform an aggregating query.

2) In one of Marko's YouTube videos, he mentions that Faunus cannot take advantage of indexes.  But, can Faunus be seeded with a sub-graph that is the result of a query that DOES make use of the indexes?  For instance, could g.V("indexedProperty", "value") be run against Titan to provide a data set for Faunus, or would Faunus have to visit every node in the graph to determine which ones to filter out?  

3) What is the physical constraint that requires moving to Faunus?  Is it memory?  I'm trying to understand how the query I write in Gremlin will exhaust (or won't) physical resources on a Titan node.  Is it because Titan performs traversals in parallel, and so has to keep all the traversal histories for all paths in memory at the same time?  Or maybe it does the traversals in series, but I need to worry about the size of the accumulated result?  (I saw a post somewhere that mentioned a concern about a groupCount where the key had very many distinct values).

The question above is being driven by a requirement that we do analytics a subset of the graph; we will rarely have "ego-centric" queries.  I anticipate we would index the handful of ways we would get to starting vertexes, but the subject of the index would be things like "type" and "category", not something like a specific user.  A query like the example above, is likely to produce a fraction of the graph, but a fraction of a large graph is still a large graph.  

In addition, we have the requirement that the queries not take "too long" (on the order of a few seconds, maybe up to 15).  I'm concerned about Faunus' reliance on Hadoop, which tends to have significant spin up overhead, and (in my experience) I wouldn't expect to return in less than a few minutes, even for small datasets.  I am hoping that with a good understanding about what would break "native" Titan queries, that I can avoid those pitfalls, use indexes, and deliver results more quickly than I could with the more scalable Faunus.

Oh, one other detail.  Our graph is likely to look like a whole bunch of moderately connected networks, with just a few links between those networks (if any).  It's more like we have many tiny graphs than one large one.  This gives me hope that we will largely avoid cross-node communication with the right sharding heuristic; my understanding is that Titan allows for the provision of a strategy to assign vertexes to nodes.

Thanks in advance!

Matthias Broecheler

unread,
Dec 11, 2013, 4:21:50 PM12/11/13
to aureliu...@googlegroups.com
Hi Paul,

500 billion vertices is a lot - I hope you have some time to ramp up before you need to get there ;-)

1) Titan, as an OLTP database, does not parallelize queries automatically since that may reduce throughput in highly concurrent systems (which is of no concern for OLAP systems like Redshift). 
However, it does provide a feature called MultiQuery which allows you to group queries and execute as one:

This will either batch up all queries into one query or execute the queries in parallel depending on what's better for the used storage backend. So, in that sense, its a meta feature. This is currently not supported in Gremlin which is why you have to do it manually for now. We are working on that.

2) General rule of thumb: If your index call is selective (i.e. returns relatively few vertices compared to the size of the graph) you are better off doing an index call first through Titan and then using that as input to Faunus. If not, just use Faunus.

3) It's not about Titan or Faunus. The two work together. Titan is a real time graph database (like a scalable MySQL for graphs) and Faunus is a compute framework that can extract the graph from Titan and do massive computations on it (like a Apache Hive for graphs). So, your graph would live in Titan (updated and queried in real time) and you would use Faunus for elaborate/OLAP style queries.

4) Yes, Faunus does have the overhead cost of Hadoop. For sub-minute response times on very large graphs you should stick with Titan and build vertex centric indexes that facilitate your query.

5) You can implement your own IDPlacementStrategy and enable graph partitioning to enable "smart" data partitioning. We are working on better supporting this for the 0.5.0 release.

HTH,
Matthias


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Matthias Broecheler
http://www.matthiasb.com

Paul Fleetwood

unread,
Dec 16, 2013, 11:11:34 AM12/16/13
to aureliu...@googlegroups.com
Thanks for your response, Matthias.  It does help.

I've got just a couple quick follow ups:

1) I see.  Do I have it right, then, that having a distributed backend for Titan (e.g., HBase) allows for the horizontal scaling of storage and concurrent requests, but NOT compute for a single query?  In other words, adding more machines to a Titan cluster does not decrease the response times or increase the allowed complexity of the queries the cluster is able to execute.

A related question: Does Titan automatically take care of data locality?  In other words, how does Titan decide which of it's nodes should handle a query?  (My guess would be a fairly simple load balancing, and that locality isn't considered).

4) "For sub-minute response times on very large graphs you should stick with Titan and build vertex centric indexes that facilitate your query."  This is what I was thinking.  But, my problem is that I don't have very ego-centric queries to perform - they are more like OLAP.  Relating back to my comments above, I was hoping that if we spread our graph "thin enough" over enough Titan nodes, than we could execute a query that touched as many vertices as we needed and get a response back quickly.

Thanks again,
Paul

Matthias Broecheler

unread,
Dec 17, 2013, 1:43:31 AM12/17/13
to aureliu...@googlegroups.com
Hi Paul,


1) I see.  Do I have it right, then, that having a distributed backend for Titan (e.g., HBase) allows for the horizontal scaling of storage and concurrent requests, but NOT compute for a single query?  In other words, adding more machines to a Titan cluster does not decrease the response times or increase the allowed complexity of the queries the cluster is able to execute.
No, you can also execute more complex queries by parallelizing the request - most graph traversals are trivially parallelizable.

 

A related question: Does Titan automatically take care of data locality?  In other words, how does Titan decide which of it's nodes should handle a query?  (My guess would be a fairly simple load balancing, and that locality isn't considered).

The default is random partitioning, however, you can also configure partitioning strategies in Titan. This is something that isn't documented heavily but used with our customers since its a requirement for scaling graphs beyond 10s of billions of edges.
 

4) "For sub-minute response times on very large graphs you should stick with Titan and build vertex centric indexes that facilitate your query."  This is what I was thinking.  But, my problem is that I don't have very ego-centric queries to perform - they are more like OLAP.  Relating back to my comments above, I was hoping that if we spread our graph "thin enough" over enough Titan nodes, than we could execute a query that touched as many vertices as we needed and get a response back quickly.

It depends. You can distribute the workload across multiple machines but your response times will certainly depend on the type of query you are doing and the size of the graph. If you are trying to run an OLAP query that touches a large part of a 500 billion vertex graph you are hitting physical limitation that won't allow second response times since you are touches 100s of terrabytes of data in the worst case.

Hope that helps,
Matthias

Paul Fleetwood

unread,
Dec 17, 2013, 2:48:01 PM12/17/13
to aureliu...@googlegroups.com
"No, you can also execute more complex queries by parallelizing the request - most graph traversals are trivially parallelizable."

Ok - if I understand, you are saying that I can manually break my query into a set of smaller ones, and then load balance those across the cluster and aggregate the results myself.  Correct?

I think an example would help me.  

Suppose that I am tracking people and have created an index of "country".  If I wanted to start traversals at all vertices with a country of "USA", I would naively write a gremlin query g.V('country', 'USA')..., but that would get executed on a single Titan node.  Is the following a reasonable way to "shard" the query?  
  • I co-locate HBase and Titan
  • I create a compound index 'country' and 'personaId', such that I can lexically sort and use a range index
  • I use a partitioning strategy to store these people vertices by their "personId", such that personId [0, 1 million) is on node 1, [1 million, 2 million) is on node 2, etc.
  • I then construct a set of gremlin queries, where each query limits the vertices considered to those on the node receiving the query. For example,
    •  g.query().has('personaId_country', T.gte, '0000000_USA').has('personaId_country', T.lt, '1000000_USA') for the query sent to node 1
    •  g.query().has('personaId_country', T.gte, '1000000_USA').has('personaId_country', T.lt, '2000000_USA') for the query sent to node 2
  • Finally, I take all the returned results and aggregate them myself.
What I would want is for Titan to lookup vertices by index and to limit the set to those for which I have data locality.

Cheers,
Paul

Matthias Broecheler

unread,
Dec 21, 2013, 4:09:57 PM12/21/13
to aureliu...@googlegroups.com
Ok - if I understand, you are saying that I can manually break my query into a set of smaller ones, and then load balance those across the cluster and aggregate the results myself.  Correct?

I think an example would help me.  

Suppose that I am tracking people and have created an index of "country".  If I wanted to start traversals at all vertices with a country of "USA", I would naively write a gremlin query g.V('country', 'USA')..., but that would get executed on a single Titan node.  Is the following a reasonable way to "shard" the query?  
  • I co-locate HBase and Titan
  • I create a compound index 'country' and 'personaId', such that I can lexically sort and use a range index
  • I use a partitioning strategy to store these people vertices by their "personId", such that personId [0, 1 million) is on node 1, [1 million, 2 million) is on node 2, etc.
  • I then construct a set of gremlin queries, where each query limits the vertices considered to those on the node receiving the query. For example,
    •  g.query().has('personaId_country', T.gte, '0000000_USA').has('personaId_country', T.lt, '1000000_USA') for the query sent to node 1
    •  g.query().has('personaId_country', T.gte, '1000000_USA').has('personaId_country', T.lt, '2000000_USA') for the query sent to node 2
  • Finally, I take all the returned results and aggregate them myself.
What I would want is for Titan to lookup vertices by index and to limit the set to those for which I have data locality.

That's a very interesting use case. We are currently working on a prototype that would do this automatically for you - i.e. shard vertices across all instances in a cluster so you can essentially do what you are proposing but Titan would take care of all of this automatically. Currently, you would have to implement the protocol above manually:
- Create one "USA" vertex per partition and make sure that it only connects to local edges. Then, for your query, you would first look up all the USA vertices (a hand-full) and then distribute the query out as you wrote.
Reply all
Reply to author
Forward
0 new messages