Using Titan/Hadoop to mutate a large Cassandra-backed graph?

64 views
Skip to first unread message

Alex DeBurie

unread,
Jun 15, 2016, 5:08:11 PM6/15/16
to Aurelius

My need is relatively simple: As quickly as possible add a property to each vertex in a Titan/Cassandra graph. For some vertices, add an edge that skips a couple of edges (sort of like a grandfather edge). While we can perform this in regular gremlin scripts (threaded to eek-out more performance) it is still rather slow (i.e. more than 24 hours given our graph)

 

Titan Hadoop seems promising, but it seems I need to either rely on a separate Titan graph connection to do mutations, or run a series of Hadoop jobs...

 

ETL workflow would, as I understand it, require a minimum of 3 Hadoop jobs: one to extract from Cassandra to an HDFS store, the second to manipulate the data within HDFS, and the third to load the data into the (possibly same but cleared) Cassandra keyspace.  Note, I've been unable to affect any data during extract or load phases; hence the need for the second step to do mutations. Oh, and the examples from the docs don’t seem to work: g.V.as('x').out('father').out('father').linkIn('grandfather','x')

 

Live Update uses Hadoop to quickly read all the vertices from the graph and call a map() method for each. That method could have a separate, 'regular' connection to Titan to perform mutations. This is very close to our multi-threaded gremlin scripts today, the key difference is that we would have Hadoop distributing the execution to multiple machines + multiple threads. One fear is that the pressure on Cassandra would be enormous.

 

Is there a method for updating a Titan/Cassandra graph outside of the thee above (Gremlin scripts, ETL or Live Update?)


Thanks,

Alex


HadoopMarc

unread,
Jun 16, 2016, 2:59:03 PM6/16/16
to Aurelius
Hi Alex

I second the question.

For titan-0.5.x people use(d) the HBaseInputFormat or CassandraInputFormat for Faunus, e.g. see

https://groups.google.com/forum/#!search/faunus$20/aureliusgraphs/enmRwi-2Sxs/sSQ9X2OlU3MJ

So, indeed you have separate connections to read from and write to the same graph database.

For titan-1.x the HBaseInputFormat and CassandraInputFomat still exist for use with HadoopGraph, but I have not seen working exampes on the forum or the reference docs yet.

Is this supposed to work?

Cheers,    Marc
Reply all
Reply to author
Forward
0 new messages