Read/write ratio while importing a large graph in Titan

39 views
Skip to first unread message

Ümit Akkuş

unread,
Nov 23, 2016, 1:42:53 PM11/23/16
to Aurelius
We're evaluating Titan to see if it is going to satisfy our needs. For that we're importing a large graph into Titan backed with Google BigTable. We're using Titan 1.0.0. The graph contains 800M+ vertices and 25B+ edges. In terms of schema, I think the only relevant part is that the vertices are keyed with a non-long value in the original store so we had to define an index on the key property to find the vertices efficiently. Vertex and edge properties (schema-wise) are pretty minimal, 3-5 properties with small values on both.

We've gone through http://s3.thinkaurelius.com/docs/titan/1.0.0/bulk-loading.html and enabled storage.batch-loading. We haven't played with the ID allocation parameters or buffer sizes. We're using 20 machines to import the graph. 

We were able to import the vertices in a relatively short amount of time (less than a day). However, when we started importing edges, we see a much slower pace (ie 100 times slower). Some slowdown is expected because for edges we need to find the vertices first, but such slowdown is a deal breaker. 

How can we investigate where the slowdown is occurring? 

We've also observed that (per big table metrics in google console) during vertex import write requests/s is 800/s while read requests are 22K/s. During edge import however, write requests/s is 300/s while read requests are 85K/s. It is not clear why the read requests are so high compared to the write requests, especially during edge import the ratio becomes almost 300. Any idea why this might be happening?

Thanks

PS: We have tried using BulkLoaderVertexProgram but we couldn't get it to work due to dependency issues between Titan-Tinkerpop-GoogleBigTable libraries. If anyone has successfully achieved this, please let me know. 



HadoopMarc

unread,
Nov 23, 2016, 4:03:34 PM11/23/16
to Aurelius

Hi Ümit,

I have no answer to your questions, but would like to comment anyway:

  - maybe you have some way to cache returned vertex objects, so that you can reuse them for edge imports?  Indeed, this will require some pre-arrangement of vertices and edges (graph partitioning), beause otherwise you will run out of memory very quickly.

  - dependency problems are very common when you use SparkGraphComputer on a Spark/Yarn cluster (especially if it uses the spark-assembly jar). This has not been sufficiently solved yet, waiting for java 9? Maybe also see the spark performance threads of the TinkerPop team on a pure Spark cluster without Yarn:
https://groups.google.com/forum/#!searchin/gremlin-users/performance$20Spark%7Csort:relevance/gremlin-users/j7lDGg5pIo8/Fgyl6wzwBgAJ

Cheers,     Marc

Op woensdag 23 november 2016 19:42:53 UTC+1 schreef Ümit Akkuş:

Ümit Akkuş

unread,
Nov 23, 2016, 7:11:18 PM11/23/16
to Aurelius
Thanks Marc,

While prearranging is an option, we purposefully avoided that so that we understand the limits we would face while updating the graph with a continuous stream of changes. What we're seeing that most of the time, the reads are going to the txn logs in titan db. Does that give a hint?
Reply all
Reply to author
Forward
0 new messages