graphSON - titan to faunus ?

David

unread,

Dec 12, 2012, 8:00:53 PM12/12/12

to aureliu...@googlegroups.com

Hi,

I have a play graph in a Titan database backed by Berkeley DB - 100K nodes, 1.3 million edges.
I used Berkeley DB because the global V operator apparently still works with Titan and allowed me to use
the GraphSONWriter.outputGraph(theGraph, out); feature to output the graph into a GraphSON
file.

I wanted to load this GraphSON file into Faunus running on a separate machine backed
by a pseudo-cluster hadoop installation. However, I discovered that Faunus only accepts,
what looks like, a pretty different "GraphSON" format.

Short of attempting a clustered environment as mentioned here:
http://thinkaurelius.com/2012/10/17/deploying-the-aurelius-graph-cluster/
what are the best options for getting the existing Titan graph into a Faunus environment ?

Thanks !

Marko Rodriguez

unread,

Dec 12, 2012, 8:12:49 PM12/12/12

to aureliu...@googlegroups.com

Hi,

You can suck (MapReduce) the graph out of Titan directly into Faunus. No need to generate the intermediate GraphJSON representation.

Simply point Faunus to your Titan instance and run:

g._()

..in Faunus to generate a SequenceFile that is a binary representation of your graph (way more efficient than GraphSON representation).

There are examples in the post you mention as well as on the Faunus site.

....ah, via BerkeleyDB. Yea, the GraphSON format for Faunus is an "adjacency format" representation, not an "edgelist format."

You can use this class to help you if you like:

https://github.com/thinkaurelius/faunus/blob/master/src/main/java/com/thinkaurelius/faunus/formats/graphson/GraphSONUtility.java

Basically, you could do:

File file = new File('myjson.json');

for(Vertex v : g.getVertices()) {

file.append(GraphSONUtility.toJSON(v).toString() + "\n");

}

file.close()

HTH,

Marko.

http://markorodriguez.com

--

David Robinson

unread,

Dec 14, 2012, 10:27:48 AM12/14/12

to aureliu...@googlegroups.com

Hi Marko,

Thank you for the help.

The code snippet you pasted produces the following:

Exception in thread "main" java.lang.ClassCastException: com.thinkaurelius.titan.graphdb.vertices.RemovableRelationIterable incompatible with java.util.List
    at com.thinkaurelius.faunus.formats.graphson.GraphSONUtility.toJSON(GraphSONUtility.java:98)
    at com.thinkaurelius.faunus.formats.graphson.GraphSONUtility$toJSON.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
    at com.xxxx.graphson.output.TitanGraphsonWriter.writeGraphson2(TitanGraphsonWriter.groovy:30)
    at com.xxxx.graphson.converter.JsonToTitanToGraphson.writeGraphToGraphson(JsonToTitanToGraphson.java:146)
    at com.xxxx.graphson.converter.JsonToTitanToGraphson.main(JsonToTitanToGraphson.java:56)

Looking at the code, perhaps this line in GraphSONUtility.java could check and then cast to Iterable?

            List<Edge> edges = (List<Edge>) vertex.getEdges(OUT);



I am down the road on trying to set up a clustered configuration with hadoop/Faunus hbase/Titan and give that a try 
since my existing install with titan/berkeley db doesn't produce an easy path into Faunus.

--

David

unread,

Dec 14, 2012, 3:43:11 PM12/14/12

to aureliu...@googlegroups.com

Hi Marko,

I put hbase-titan and hadoop-faunus on the same system, loaded the graph-of-the-gods into titan and then went to faunus and tried to run a g.V.count() and received the following exception:

java.lang.NullPointerException
    at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.loadRelations(StandardTitanGraph.java:293)
    at com.thinkaurelius.faunus.formats.titan.hbase.FaunusTitanHBaseGraph.readFaunusVertex(FaunusTitanHBaseGraph.java:39)
    at com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseRecordReader.nextKeyValue(TitanHBaseRecordReader.java:38)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Here are a few details:
In Titan:
gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@24d37b87
gremlin> conf.setProperty('storage.backend', 'hbase')
==>null
gremlin> conf.setProperty('storage.tablename', 'gods')
==>null
gremlin> conf.setProperty('storage.hostname', '127.0.0.1')
==>null
gremlin> g=TitanFactory.open(conf)
12/12/14 15:34:17 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.3-1240972, built on 02/06/2012 10:48 GMT
.....
==>titangraph[hbase:127.0.0.1]
gremlin> saturn = g.V('name','saturn').next()
==>v[20]
gremlin> saturn.map()
==>name=saturn
==>type=titan

That all looks ok. Now going over to the faunus gremlin on the same machine:

titan-hbase.properties looks like this:

faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
hbase.zookeeper.quorum=9.44.132.142
# hbase.zookeeper.property.clientPort=9160
# hbase.mapreduce.inputtable=titan
hbase.mapreduce.inputtable=gods
hbase.mapreduce.scan.cachedrows=1000

# output data (graph or statistic) parameters
faunus.graph.output.format=com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat
faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
faunus.output.location=output
faunus.output.location.overwrite=true

and once in the faunus gremlin I do:

gremlin> g = FaunusFactory.open('titan-hbase.properties')
==>faunusgraph[titanhbaseinputformat]
gremlin> g.V.count()
12/12/14 15:39:16 INFO mapred.JobClient: Running job: job_201212140909_0013
12/12/14 15:39:17 INFO mapred.JobClient: map 0% reduce 0%
...
exception pasted above appears.

Is my Faunus "seeing" the graph written in Titan ?

Marko Rodriguez

unread,

Dec 14, 2012, 4:09:34 PM12/14/12

to aureliu...@googlegroups.com

Hi,

Weird. I've never seen that before. What version of Titan are you using?

Marko.

http://markorodriguez.com

--

David

unread,

Dec 17, 2012, 4:43:55 PM12/17/12

to aureliu...@googlegroups.com

Hi Marko,

I am using titan-0.1.0 but believe I found the solution for the Null Pointer Exception.

In TitanHBaseInputFormat.java it defaults the hbase.input.table parameter to titan.

If titan is used as the table name, all is well. However, if the table name changes, I needed to supply hase.input.table=gods
(in this case) in the titan-hbase.properties of faunus in addition to the hbase.mapreduce.inputtable=gods parameter...
and then things seem to work better.

The TitanHBaseInputFormat.java doesn't use the hbase.mapreduce.inputtable parameter... so I guess both need to be set to "gods".

...and that that is working....
If I should start a new thread on this next question please let me know, but now that I can see the data from Titan/hbase into Faunus/haddop
I wanted to write a simple Java/Groovy program to perform a g.V.count().

This question may be because of my lack of experience so far with hadoop, but how do I get the data back out to my Java/Groovy program ?
I manually see the answer in the hadoop result files - 12 - for the gods case, but
a) am not sure how my program is supposed to "know" the name of the result file after issuing the FaunusGremlin g.V.count()...
b) I'm not sure if I need to write code to locate open and read the results file when using FaunusGraph/Gremlin, or if there is some other way to do it using Faunus ?

Thanks,

David

unread,

Dec 17, 2012, 5:03:31 PM12/17/12

to aureliu...@googlegroups.com

I also was a bit hasty asking how to read results back into my groovy program from a Faunus map/reduce job, because I don't even know how to generate the results from a groovy program
in the first place.

The following code snippet:

    // must load gremlin first before any gremlin will work
    static
    {
        FaunusGremlin.load();
    }

    // first method called in the groovy object after it is instantiated
    public void initCountVertices(FaunusGraph g)
    {
        g.V.count();
    }

doesn't start a map/reduce job (there don't appear to be any output files in the hdfs).
double vertexCount = g.V.count() throws an exception.

Marko Rodriguez

unread,

Dec 18, 2012, 7:45:01 PM12/18/12

to aureliu...@googlegroups.com

Hi,

In TitanHBaseInputFormat.java it defaults the hbase.input.table parameter to titan.
If titan is used as the table name, all is well. However, if the table name changes, I needed to supply hase.input.table=gods
(in this case) in the titan-hbase.properties of faunus in addition to the hbase.mapreduce.inputtable=gods parameter...
and then things seem to work better.

Ah. If you want to provide a pull request to titan-hbase.properties that demonstrates that with a commented out line, that would be great. Would probable help others.

The TitanHBaseInputFormat.java doesn't use the hbase.mapreduce.inputtable parameter... so I guess both need to be set to "gods".
...and that that is working....

Again, please provide a pull request.

If I should start a new thread on this next question please let me know, but now that I can see the data from Titan/hbase into Faunus/haddop
I wanted to write a simple Java/Groovy program to perform a g.V.count().

This question may be because of my lack of experience so far with hadoop, but how do I get the data back out to my Java/Groovy program ?
I manually see the answer in the hadoop result files - 12 - for the gods case, but
a) am not sure how my program is supposed to "know" the name of the result file after issuing the FaunusGremlin g.V.count()...
b) I'm not sure if I need to write code to locate open and read the results file when using FaunusGraph/Gremlin, or if there is some other way to do it using Faunus ?

Your result is saved in the sideeffect folder. Odd, I couldn't find it in the documentation. You can read more here:

http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ (see "Dataflows in Faunus...")

In short,

gremlin> hdfs.head('output/job-*/sideeffect*')

https://github.com/thinkaurelius/faunus/wiki/HDFS-Handling

HTH,

Marko.

http://markorodriguez.com

Marko Rodriguez

unread,

Dec 18, 2012, 7:57:26 PM12/18/12

to aureliu...@googlegroups.com

Hi,

Hmmm.. Never thought of doing that. With Hadoop, you need to send your job to the JobTracker. This is what FaunusCompiler does:

https://github.com/thinkaurelius/faunus/blob/master/src/main/java/com/thinkaurelius/faunus/mapreduce/FaunusCompiler.java#L541

So, from a FaunusPipeline, you need to submit that pipeline to the FaunusCompiler which will package the jar and send it to Hadoop.

This is the way to do it for Faunus 0.1-alpha and the master branch, but this is probably not going to be stable (API wise).

FaunusGremlin.load()

g.V.count().done().submit()

https://github.com/thinkaurelius/faunus/blob/master/src/main/java/com/thinkaurelius/faunus/FaunusPipeline.java#L1017

https://github.com/thinkaurelius/faunus/blob/master/src/main/java/com/thinkaurelius/faunus/FaunusPipeline.java#L1046

Again, I haven't thought about people automating this, I've come from the perspective of "guy in REPL throwin' down MapReduce fatties and rippin' the _science_." As such, please provide your thoughts on the matter as you go down that road -- make a ticket, say how you would like it look.

Thank Dave,

Marko.

http://markorodriguez.com

--

Reply all

Reply to author

Forward