Hello all,
I've been getting up to speed on graph databases as a potential solution at my company.
In my reading, I've learned that Titan scales well in terms of the size of the graph and in terms of concurrent transactions. I've also learned that Faunus is a way to run graph wide analytics. I have found some posts that indicate that there is a certain size of graph where moving to Faunus makes sense.
I've also read up on graph and vertex-centric indexes.
In our use case, we are going to probably generate a graph starting with about a 500 million nodes, and potentially growing to 500 billion nodes or more (hopefully). I am thinking of backing Titan with HBase.
Here are my questions / concerns:
1) Does Titan execute a single query in parallel across all the nodes in its cluster? Suppose that the query being run is something like g.V("indexedProperty", "value").out.count(). In that case, what will happen?
I (ignorantly) expect that Titan would look at an index on the "indexedProperty", determine which nodes store the requested vertexes, and then issue the query to all of those nodes in parallel. Each node in the Titan cluster would perform the traversal requested and return a partial answer, and then the results from each node would be aggregated before the final result was returned. This is how I am used to things like RedShift (a distributed RDMS) working when I perform an aggregating query.
2) In one of Marko's YouTube videos, he mentions that Faunus cannot take advantage of indexes. But, can Faunus be seeded with a sub-graph that is the result of a query that DOES make use of the indexes? For instance, could g.V("indexedProperty", "value") be run against Titan to provide a data set for Faunus, or would Faunus have to visit every node in the graph to determine which ones to filter out?
3) What is the physical constraint that requires moving to Faunus? Is it memory? I'm trying to understand how the query I write in Gremlin will exhaust (or won't) physical resources on a Titan node. Is it because Titan performs traversals in parallel, and so has to keep all the traversal histories for all paths in memory at the same time? Or maybe it does the traversals in series, but I need to worry about the size of the accumulated result? (I saw a post somewhere that mentioned a concern about a groupCount where the key had very many distinct values).
The question above is being driven by a requirement that we do analytics a subset of the graph; we will rarely have "ego-centric" queries. I anticipate we would index the handful of ways we would get to starting vertexes, but the subject of the index would be things like "type" and "category", not something like a specific user. A query like the example above, is likely to produce a fraction of the graph, but a fraction of a large graph is still a large graph.
In addition, we have the requirement that the queries not take "too long" (on the order of a few seconds, maybe up to 15). I'm concerned about Faunus' reliance on Hadoop, which tends to have significant spin up overhead, and (in my experience) I wouldn't expect to return in less than a few minutes, even for small datasets. I am hoping that with a good understanding about what would break "native" Titan queries, that I can avoid those pitfalls, use indexes, and deliver results more quickly than I could with the more scalable Faunus.
Oh, one other detail. Our graph is likely to look like a whole bunch of moderately connected networks, with just a few links between those networks (if any). It's more like we have many tiny graphs than one large one. This gives me hope that we will largely avoid cross-node communication with the right sharding heuristic; my understanding is that Titan allows for the provision of a strategy to assign vertexes to nodes.
Thanks in advance!