Titan/HBase + SparkGraphComputer Working

899 views
Skip to first unread message

David

unread,
Jan 27, 2016, 6:07:25 PM1/27/16
to Aurelius
I have a Titan 1.0+, TinkerPop 3.0.1-incubating build working with Hadoop 2.7.1, HBase 1.1.1 and SparkGraphComputer located here:

https://github.com/graben1437/titan1withtp3.1

Someone else verified the build in their environment, so it works in at least 2 places, relatively easily.
The Spark Bulk Loading through SparkGraphComputing works.

                 \,,,/
                (o o)
-----oOOo-(3)-oOOo-----
plugin activated: aurelius.titan
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin> graph=GraphFactory.open('./conf/hadoop-graph/read-hbase-spark.properties')
==>hadoopgraph[hbaseinputformat->gryooutputformat]
gremlin> g=graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[hbaseinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>808
gremlin>


The build has base Titan code changes in at least 4 places and a few build changes that are not in the base Titan builds.
I played with "shading" guava more than is healthy and decided the shading route is not the way to go. This build
just says no to more guava shading.

There is a long list of "gotchas" in setting this up.
One key is matching version numbers very carefully.
You must use Spark 1.5.1 with this build because that is what TinkerPop specifies in its hadoop client code.
If you use something else, even 1.5.2, you will see serialization errors.

Make sure HADOOP_GREMLIN_LIBS is set correctly to the lib dir of Titan.
Make sure your Spark Classpath is set correctly in spark-env.sh. I pasted my classpath below as an example.

Giraph still has not worked..but I think it is getting close the mapper is running to 75%....

If you try to alter the build to use 3.1.1-SNAPSHOT of TP...it doesn't work...something changed.

Let me know if you try this build and how it goes for you.


spark-env.sh classpath:

SPARK_CLASSPATH=$HBASECONF:$TITANLIB/jersey-server-1.9.jar:$TITANLIB/titan-core-1.1.0-SNAPSHOT.jar:$TITANLIB/gremlin-console-3.1.0-incubating.jar:$TITANLIB/gremlin-core-3.1.0-incubating.jar:$TITANLIB/gremlin-driver-3.1.0-incubating.jar:$TITANLIB/gremlin-groovy-3.1.0-incubating.jar:$TITANLIB/gremlin-server-3.1.0-incubating.jar:gremlin-shaded-3.1.0-incubating.jar:$TITANLIB/hadoop-gremlin-3.1.0-incubating.jar:$TITANLIB/spark-gremlin-3.1.0-incubating.jar:$TITANLIB/gremlin-shaded-3.1.0-incubating.jar:$TITANLIB/javatuples-1.2.jar:$TITANLIB/titan-hbase-1.1.0-SNAPSHOT.jar:$TITANLIB/htrace-core-3.1.0-incubating.jar:$TITANLIB/tinkergraph-gremlin-3.1.0-incubating.jar:$TITANLIB/reflections-0.9.9-RC1.jar:$TITANLIB/hppc-0.7.1.jar:$TITANLIB/high-scale-lib-1.1.2.jar:$TITANLIB/titan-hadoop-1.1.0-SNAPSHOT.jar:$TITANLIB/hbase-server-1.1.1.jar:$MEGADIR/*


Other spark-env.sh settings:

HADOOPCLIENTLIB=/usr/hdp/2.3.0.0-2557/hadoop/client

HBASELIB=/usr/hdp/2.3.0.0-2557/hbase/lib

TITANLIB=/whereever your titan lib is located

HBASECONF=/usr/hdp/2.3.0.0-2557/hbase/conf
HADOOP_CONF_DIR=/usr/hdp/2.3.0.0-2557/hadoop/conf

MEGADIR=/just to be lazy for this test I copied all hbase and...I forget...maybe hbase server? jars into one directory  You can do better ;-)

Dylan Bethune-Waddell

unread,
Jan 27, 2016, 10:47:13 PM1/27/16
to Aurelius
Nice David! I have been trying to get 1.1.0-SNAPSHOT/3.1.1-SNAPSHOT to work with my Titan/Cassandra cluster, and I was wondering what kind of result you got when trying to get traversal(computer(SparkGraphComputer)) to work, and what the benefit of the StopwatchTitan class is with respect to getting SparkGraphComputer to work?

Cheers,
Dylan

David

unread,
Jan 28, 2016, 9:32:26 AM1/28/16
to Aurelius
Hi Dylan,


>>>> I was wondering what kind of result you got when trying to get traversal(computer(SparkGraphComputer))

With <tinkerpop.version>3.1.0-incubating</tinkerpop.version>  of course, "everything" works, which is what I posted about.

But changing nothing other than the tinkerpop version to <tinkerpop.version>3.1.1-SNAPSHOT</tinkerpop.version>,
and making slight adjustments to the classpath because the TP jar file names change, causes things to break.

After rebuilding with <tinkerpop.version>3.1.1-SNAPSHOT</tinkerpop.version> and bumping up the spark running to version 1.5.2
as required by TP 3.1.1 dependencies, here is what I see:

gremlin> g.V().count()
07:51:04 WARN  org.apache.spark.metrics.MetricsSystem  - Using default name DAGScheduler for source because spark.app.id is not set.
java.util.NoSuchElementException: No value present
Display stack trace? [yN] y
java.lang.IllegalStateException: java.util.NoSuchElementException: No value present
    at org.apache.tinkerpop.gremlin.process.computer.traversal.step.map.ComputerResultStep.processNextStart(ComputerResultStep.java:82)
    at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:140)
    at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.hasNext(DefaultTraversal.java:147)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.apache.tinkerpop.gremlin.console.Console$_closure3.doCall(Console.groovy:205)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    ......

I have not spent time debugging this.  I am hoping it is a simple set up mistake on my part, but
honestly, I get a sinking feeling TP 3.1.1 broke something.

>>>> StopwatchTitan

I will stick to the very (very) short version.  I moved:

titan-hbase-parent/titan-hbase-core/src/main/java/com/google/common/base/Stopwatch.java

from where Dan L previously had it in the build, to here:

titan-core/src/main/java/com/google/common/base/StopwatchTitan.java

and renamed the file so there is absolutely no question it isn't coming from guava.
The renaming is no big deal.  Moving it is huge.  I am tired of Stopwatch dependency errors  ;-)

>>>> Titan/Cassandra cluster
I plan to also test with Cassandra, just have not gotten there yet.

This particular build upgraded Thrift to a later version and has code changes in place
to reflect that.

Dylan Bethune-Waddell

unread,
Jan 28, 2016, 1:35:57 PM1/28/16
to Aurelius
Hi David,

>>>> java.util.NoSuchElementException: No value present

I get the same error - I am not sure why BulkLoaderVertexProgram is "working" for me with 3.1.1 over Spark 1.5.2 and TraversalVertexProgram isn't, but I haven't tried debugging TraversalVertexProgram either as I am still dealing with the BulkLoader which has been a stickler in these ways: 
1) Doesn't like "g.V().hasId(<SomeLong>).next()" or "g.V(<SomeLong>).next()", only graph.vertices(id).next() when calling getVertexById(ID, graph, g)
2) Doesn't like it when you don't mutate the graph in the vertex loading phase, and doesn't seem to go ahead with adding edges in this case.
3) Sometimes throws a FastNoSuchElementException in the edge loading phase due to a null TitanID in the (bulkLoader.vertex.id, TitanID) message tuples passed to the outV's of all inE's for in the vertex loading phase.
4) In the vertex loading phase, the sourceVertex Spark has in memory is "getOrCreate"-ed against Titan and then has it's "bulkLoader.vertex.id" replaced with the Titan long ID (in memory) - this sometimes causes a call to "outVId = sourceVertex.property(BULK_LOADER_DEFAULT_ID)" in the edge loading phase to throw an exception saying there is no value associated to that property or the property doesn't exist on the vertex.

I have been running these bulk loading jobs on anywhere from 1-20 nodes at a time, and I still can't figure out if this occurs when I set the number of workers too high or if it is related to the input data, Titan ID allocation, or SparkGraphComputer-land. I also haven't tried it on a single node Titan/Cassandra cluster or Titan/BerkeleyDB yet to see if this stuff only happens under "eventually consistent" conditions. I have been meaning to reproduce this behaviour on test data at DEBUG log levels and cover some of those things so that I can submit meaningful issues, but I thought I would just give you the bullet points in case something jumps out at you and you have an interest in getting 3.1.1-SNAPSHOT going.

By putting some null checks and property existence checks in place in BulkLoaderVertexProgram, particularly in the edge loading phase, I have managed to more or less sidestep all those issues, although BulkLoading is blowing the collective ES heap and freezing all the nodes in some cases.

>>>> I am tired of Stopwatch dependency errors  ;-)

Now if I see these, I'll know what to do - thanks!

>>>> This particular build upgraded Thrift to a later version and has code changes in place to reflect that.

That's good to know, I didn't realize that at first glance - I may give that a shot this afternoon and see if it helps my bulk loading issues.

Cheers,
Dylan

Dylan Bethune-Waddell

unread,
Feb 3, 2016, 4:27:16 AM2/3/16
to Aurelius
Hi David,

I managed to get SparkGraphComputer from 3.1.1/3.2.0-SNAPSHOT working with CassandraInputFormat, with the hack I described here:


It seems like it's just the fact that gremlin.hadoop.inputLocation=none throws InputFormatRDD off, which expects a file path and not some random text. Hopefully this works for HBaseInputFormat too.

Cheers,
Dylan

On Thursday, January 28, 2016 at 9:32:26 AM UTC-5, David wrote:

David

unread,
Feb 3, 2016, 12:00:44 PM2/3/16
to Aurelius
Hey Dylan,

Nice work !  Let me look at what you did and see if I can reproduce it.

In the meantime, I am getting ready to push final changes to the build I linked earlier for GiraphGraphComputer, which
is now working in my build, at least for basic stuff.

Drawing up a "support matrix" that would mean together we have:

HBase ->          SparkGraphComputer             Yes
HBase ->          GiraphGraphComputer            Yes
Cassandra ->   SparkGraphComputer              Yes
Cassandra ->   GiraphGraphComputer             ??

I haven't jumped to 3.1.1-snapshot (soon to be incubating) or  3.2.0  yet.
Will look at your fix and see if that helps. 
If so, I will branch my build to 3.2.0  and hopefully everything just keeps working.
Lots more testing (including VertexComputing) to do and test case fixing....

Note that during Giraph testing a bug in TinkerPop (still there in 3.2.0) was uncovered...no Jira
or fix yet...but that is coming.  You can work around the problem by setting things in the properties
file in Titan.

David

unread,
Feb 3, 2016, 1:19:29 PM2/3/16
to Aurelius
Dylan,

Confirmed your suggested TP fix in the InputFormatRDD also works for me using 3.2.0 (a.k.a 3.1.1-incubating when voted)

Marko Rodriguez

unread,
Feb 3, 2016, 2:57:32 PM2/3/16
to aureliu...@googlegroups.com
Hi David,

If there is a glaring bug in TinkerPop 3.1.1, note that we are still in "code freeze" and haven't put it up for VOTE. If you have a test case and a solution, please provide it and we can update the source.

However, if the bug has always been there (since TinkerPop 3.1.0), then we can just push the fix to 3.1.2 as it seems you all have an easy workaround.

Thoughts?,
Marko.
--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/69d95aac-7919-43d0-893b-a562b5e10ac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David

unread,
Feb 3, 2016, 3:34:13 PM2/3/16
to Aurelius
Hi Marko,

Looks like there are two TinkerPop issues:

#1:
Dylan's fix mentioned here makes 3.2.0 work:

https://issues.apache.org/jira/browse/TINKERPOP-1117

This problem, or at least the symptoms, didn't exist in 3.1.0

#2:
hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/structure/io/gryo/GryoRecordReader.java

- this.splitLength = split.getLength() - (seekToHeader(this.inputStream, start) - start);
+ this.splitLength = split.getLength();
+ if (this.splitLength > 0) this.splitLength -= (seekToHeader(this.inputStream, start) - start);


The second item was just found and the fix verified with some basic testing.
If we could get both of these in 3.1.1, that would be helpful...otherwise Titan 1.1 + 3.1.1 won't work.

Marko Rodriguez

unread,
Feb 3, 2016, 3:36:56 PM2/3/16
to aureliu...@googlegroups.com
Hi,

#2:
hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/structure/io/gryo/GryoRecordReader.java

- this.splitLength = split.getLength() - (seekToHeader(this.inputStream, start) - start);
+ this.splitLength = split.getLength();
+ if (this.splitLength > 0) this.splitLength -= (seekToHeader(this.inputStream, start) - start);

This is odd. We haven't touched GryoRecordReader for some time. I have no test cases that show any issue either. What is the specific problem? Is it only with Titan and if so, was this a problem in Titan 1.0/TinkerPop3.0?

Thanks,
Marko.


Marko Rodriguez

unread,
Feb 3, 2016, 4:27:46 PM2/3/16
to aureliu...@googlegroups.com
Hi David,

Also note that GryoRecordReader is used by GryoInputFormat and thus, not related to Titan in any way. Perhaps your Gryo file is corrupted and that is why you are getting the error. ? Please try with another Gryo file to be certain.

Marko.

Dylan Bethune-Waddell

unread,
Feb 3, 2016, 7:52:50 PM2/3/16
to Aurelius
Hi David and Marko,

Glad we have SparkGraphComputer going for Titan Cassandra/HBase especially given the recent flurry of optimizations, David I will see if I can complete that support matrix for GiraphGraphComputer :)

>>>> #2: hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/structure/io/gryo/GryoRecordReader.java

Also curious to know what specifically this change solved, I made this edit but haven't noticed a difference while manually messing around or BulkLoading with SparkGraphComputer just yet.

>>>> If there is a glaring bug in TinkerPop 3.1.1, note that we are still in "code freeze" and haven't put it up for VOTE. If you have a test case and a solution, please provide it and we can update the source.

I am having some issues with BulkLoaderVertexProgram that the workaround I had going in 3.1.0-incubating no longer solves straight up. I will try and get solid enough on what is going on to file a ticket with a test case and at least the workaround, but the gist is that in the edge loading phase (DUMMY_ID, TITAN_ID) messages are sometimes coming in with null TITAN_IDs and that was since 3.1.0, but now in 3.1.1 errors were being thrown about sourceVertex(s) in the edge loading phase missing a bulkLoader.vertex.id property. I started checking for that, ignoring those vertices, and logging a warning, and it looks like a whole extra iteration through every source/StarVertex is taking place where none of them have the bulkLoader.vertex.id property - this happens after all the edges were loaded into the graph properly in the first place... the log statements that I added to the vertex loading phase haven't seen a null or negative (read something about this being a temporary ID allocation from Titan in the code so I decided to check for it) ID being put into the message tuple, and it seems to be limited to datasets that have edges between different vertices - I have a few datasets I'm loading that have exclusively self-edges, and those don't seem to trigger any warnings about no bulkLoader.vertex.id being present on the vertices in this "ghost iteration".

Just thought I would try providing a general explanation in case something makes more sense to you than it does to me.

Thanks,
Dylan

Daniel Kuppitz

unread,
Feb 3, 2016, 9:43:56 PM2/3/16
to aureliu...@googlegroups.com
Hi Dylan,

do you provide in- and out-edges for all vertices? Is there a chance that you have some in-vertices that never appear on the out-vertex side?
These are common issues that lead to the described error (id properties not found).

Cheers,
Daniel


David

unread,
Feb 3, 2016, 10:36:00 PM2/3/16
to Aurelius
I just pushed final changes and the build that I linked works with Cassandra as well as HBase to "complete" the matrix
for Graph Computers.

I am discussing #2 with Marko and owe him more information, which I will provide tomorrow in detail, but the teaser is that #2 fixes
this potential exception in Giraph jobs.

2016-02-03 16:08:46,220 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException

at java.io.DataInputStream.readByte(DataInputStream.java:267)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.seekToHeader(GryoRecordReader.java:82)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.initialize(GryoRecordReader.java:74)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat.createRecordReader(GryoInputFormat.java:39)

at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:515)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:758)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Dylan Bethune-Waddell

unread,
Feb 4, 2016, 12:06:04 AM2/4/16
to Aurelius
Hi Daniel,

Thanks for the response, the data does go through a few hoops before ending up in adjacency list format with both in- and out- edges so I will double check. 

David - nice, I have never even tried running Giraph so thanks for putting your build out there.

Cheers,
Dylan

David

unread,
Feb 4, 2016, 9:09:56 AM2/4/16
to Aurelius
Hi Marko,

Here are the details for this fix:
#2:
hadoop-gremlin/src/main/java/
org/apache/tinkerpop/gremlin/hadoop/structure/io/gryo/GryoRecordReader.java

- this.splitLength = split.getLength() - (seekToHeader(this.inputStream, start) - start);
+ this.splitLength = split.getLength();
+ if (this.splitLength > 0) this.splitLength -= (seekToHeader(this.inputStream, start) - start);


Giraph supports configuring the number of workers:

giraph.minWorkers=2


For a smaller Titan graph, the way splits are occurring is 1 worker gets all of the data
and other workers get none.  This can be seen by listing HDFS after the first Giraph
map job runs and before the second one starts.  Note the 0 file length in the second
part file:

$ hdfs dfs -ls -R output

drwxr-xr-x - graphie hdfs 0 2016-02-03 22:07 output/~g

-rw-r--r-- 3 graphie hdfs 0 2016-02-03 22:07 output/~g/_SUCCESS

-rw-r--r-- 3 graphie hdfs 1348 2016-02-03 22:07 output/~g/part-m-00001

-rw-r--r-- 3 graphie hdfs 0 2016-02-03 22:07 output/~g/part-m-00002

drwxr-xr-x - graphie hdfs 0 2016-02-03 22:08 output/~reducing

drwxr-xr-x - graphie hdfs 0 2016-02-03 22:08 output/~reducing/_temporary

drwxr-xr-x - graphie hdfs 0 2016-02-03 22:08 output/~reducing/_temporary/1


When the second stage of mapping (reducing) starts in Giraph, the same number of configured
workers starts - and the exception I pasted earlier occurs from TinkerPop because of the 0 length
"split" it attempts to read.  This exception stops (fails) the Giraph job.

The first part of that exception again in GryoRecordReader:

2016-02-03 16:08:46,220 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException

at java.io.DataInputStream.readByte(DataInputStream.java:267)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.seekToHeader(GryoRecordReader.java:82)


When the edge case check for 0 length file is added to Gryo, the second Giraph worker exits without throwing an exception
and the overall Giraph job runs successfully to completion.

There are other things in play here too, but the TP fix makes things more robust.
Plurad and I are discussing how to turn this into an automated test case for TP...not sure right now.

Marko Rodriguez

unread,
Feb 4, 2016, 9:33:35 AM2/4/16
to aureliu...@googlegroups.com
Hello David,

I understand. It comes about in the corner case of a 0 byte split.

I made the change you suggested, tested it against our Gryo test suite, and pushed to tp31/.

Thanks for the find,
Marko.

David

unread,
Feb 4, 2016, 9:39:09 AM2/4/16
to Aurelius
I should also mention how "Gryo" comes into the picture.
The config file contains this:

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

The "part" files written to hdfs are in Gryo format in between the map/reduce phases and the GryoRecordReader
is needed to read the part files in the second stage.

David

unread,
Feb 4, 2016, 9:40:10 AM2/4/16
to Aurelius
Marko,

Thank you very much...as always.

David

unread,
Feb 4, 2016, 2:06:02 PM2/4/16
to Aurelius
Once SparkGraphComputer works, it appears to be a breeze to drop in the canned Vertex Programs to run across Titan.

Cool stuff Marko.


Page Rank:

gremlin> graph=TitanFactory.open('./conf/titan-hbase.properties')
hbaseVersion is: 1.1.1
==>standardtitangraph[hbase:[10.114.223.88:2181]]
gremlin> pagerank = PageRankVertexProgram.build().create()
==>PageRankVertexProgram[alpha=0.85,iterations=30]
gremlin>  result = graph.compute().program(pagerank).submit().get()
==>result[standardtitantx[0x510ebf20],memory[size:0]]
gremlin> g = result.graph().traversal(standard())
==>graphtraversalsource[standardtitantx[0x510ebf20], standard]
gremlin> g.V().valueMap('name',PageRankVertexProgram.PAGE_RANK)
12:47:46 WARN  com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>[gremlin.pageRankVertexProgram.pageRank:[0.23778575206741645], name:[MOJO]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.44304809560502834], name:[FUNICULI FUNICULA]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[BALLAD OF FRANKIE LEE AND JUDAS PRIEST]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.8957592326860998], name:[BROKEDOWN PALACE]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[C.C.RIDER]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.185511044285037], name:[BLACKBIRD]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.20619851262546196], name:[BAD MOON RISING]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.21375000000000002], name:[Peter_Krug]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.6126282801267651], name:[BIG BOSS MAN]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.8514128915531739], name:[JAM]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.2195579121656338], name:[CHIMES OF FREEDOM]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.950180703456302], name:[LAZY LIGHTNING]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.1894148972396984], name:[SAGE AND SPIRIT]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.36908654360123677], name:[Garcia_Lesh]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.3398798396572379], name:[ARE YOU LONELY FOR ME]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.1698007877242691], name:[Hardin_Petty]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.6096803885091338], name:[CHINA DOLL]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.20015092659776262], name:[TANGLED UP IN BLUE]]
.......

Peer Pressure:

gremlin> result = graph.compute().program(PeerPressureVertexProgram.build().create()).
gremlin>                     mapReduce(ClusterPopulationMapReduce.build().create()).
gremlin>                     mapReduce(ClusterCountMapReduce.build().create()).submit().get()
12:58:44 WARN  com.thinkaurelius.titan.graphdb.olap.computer.FulgoraGraphComputer  - Property key [gremlin.peerPressureVertexProgram.voteStrength] is not part of the schema and will be created. It is advised to initialize all keys.
12:58:45 WARN  com.thinkaurelius.titan.graphdb.olap.computer.FulgoraGraphComputer  - Property key [gremlin.peerPressureVertexProgram.cluster] is not part of the schema and will be created. It is advised to initialize all keys.
==>result[standardtitantx[0x35ef803c],memory[size:3]]
gremlin> result.memory().clusterPopulation
==>20504=1
==>12312=1
==>81944=1
==>106520=2
==>114712=2
==>127000=1
==>102424=9
==>139288=1
==>143384=1
==>151576=2
==>159768=2
==>163864=2
==>167960=1
==>237592=1
==>217112=1
...
gremlin> result.memory().clusterCount
==>303

Marko Rodriguez

unread,
Feb 4, 2016, 2:29:04 PM2/4/16
to aureliu...@googlegroups.com
Hi David,

What you did there would make a great blog post.

Note that in TinkerPop 3.2.x, all the TinkerPop provided VertexPrograms will be accessible via GraphTraversal. For instance, we will be able to do:

g.V().hasLabel('person').
  pageRank(0.85).by(out('knows')).
    order().by('pageRank').limit(10)

The tickets for this feature are here if you are interested in participating:


Take care,
Marko.

David

unread,
Feb 4, 2016, 2:55:59 PM2/4/16
to Aurelius
Example running the PageRankVertexProgram Across A Titan Graph With Giraph

gremlin> graph=GraphFactory.open('./conf/hadoop/read-hbase-giraph.properties')
==>hadoopgraph[hbaseinputformat->gryooutputformat]

gremlin> pagerank = PageRankVertexProgram.build().create()
==>PageRankVertexProgram[alpha=0.85,iterations=30]
gremlin> result = graph.compute(GiraphGraphComputer).program(pagerank).submit().get()
13:43:27 INFO  org.apache.hadoop.mapreduce.Job  - The url to track the job: http://xxx.xx.cloudy.yepper.com:8088/proxy/application_1452543424291_0022/
13:43:55 INFO  org.apache.hadoop.mapreduce.Job  - Running job: job_1452543424291_0022
13:43:56 INFO  org.apache.hadoop.mapreduce.Job  - Job job_1452543424291_0022 running in uber mode : false
13:43:56 INFO  org.apache.hadoop.mapreduce.Job  -  map 100% reduce 0%
13:45:28 INFO  org.apache.hadoop.mapreduce.Job  - Job job_1452543424291_0022 completed successfully
13:45:28 INFO  org.apache.hadoop.mapreduce.Job  - Counters: 80
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=574416
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=88
        HDFS: Number of bytes written=444288
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=205541
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=205541
        Total vcore-seconds taken by all map tasks=205541
        Total megabyte-seconds taken by all map tasks=841895936
    Map-Reduce Framework
        Map input records=2
        Map output records=0
        Input split bytes=88
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=1633
        CPU time spent (ms)=62870
        Physical memory (bytes) snapshot=1295695872
        Virtual memory (bytes) snapshot=11062444032
        Total committed heap usage (bytes)=1112539136
    Giraph Stats
        Aggregate edges=0
        Aggregate finished vertices=0
        Aggregate sent message message bytes=15128724
        Aggregate sent messages=249426
        Aggregate vertices=808
        Current master task partition=0
        Current workers=1
        Last checkpointed superstep=0
        Sent message bytes=489315
        Sent messages=8046
        Superstep=31
    Giraph Timers
        Initialize (ms)=1369
        Input superstep (ms)=20684
        Setup (ms)=2034
        Shutdown (ms)=9328
        Superstep 0 GiraphComputation (ms)=1407
        Superstep 1 GiraphComputation (ms)=2023
        Superstep 10 GiraphComputation (ms)=2002
        Superstep 11 GiraphComputation (ms)=1998
        Superstep 12 GiraphComputation (ms)=2005
        Superstep 13 GiraphComputation (ms)=2001
        Superstep 14 GiraphComputation (ms)=2001
        Superstep 15 GiraphComputation (ms)=2001
        Superstep 16 GiraphComputation (ms)=2004
        Superstep 17 GiraphComputation (ms)=2001
        Superstep 18 GiraphComputation (ms)=2002
        Superstep 19 GiraphComputation (ms)=2003
        Superstep 2 GiraphComputation (ms)=2006
        Superstep 20 GiraphComputation (ms)=2003
        Superstep 21 GiraphComputation (ms)=2017
        Superstep 22 GiraphComputation (ms)=1988
        Superstep 23 GiraphComputation (ms)=2006
        Superstep 24 GiraphComputation (ms)=2037
        Superstep 25 GiraphComputation (ms)=2002
        Superstep 26 GiraphComputation (ms)=2000
        Superstep 27 GiraphComputation (ms)=2003
        Superstep 28 GiraphComputation (ms)=2003
        Superstep 29 GiraphComputation (ms)=2002
        Superstep 3 GiraphComputation (ms)=2001
        Superstep 30 GiraphComputation (ms)=2322
        Superstep 4 GiraphComputation (ms)=2004
        Superstep 5 GiraphComputation (ms)=2000
        Superstep 6 GiraphComputation (ms)=2002
        Superstep 7 GiraphComputation (ms)=2002
        Superstep 8 GiraphComputation (ms)=2002
        Superstep 9 GiraphComputation (ms)=2006
        Total (ms)=93911
    Zookeeper base path
        /_hadoopBsp/job_1452543424291_0022=0
    Zookeeper halt node
        /_hadoopBsp/job_1452543424291_0022/_haltComputation=0
    Zookeeper server:port
        10.114.23.55:2181=0
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
==>result[hadoopgraph[gryoinputformat->gryooutputformat],memory[size:0]]
gremlin> g = result.graph().traversal(standard())
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], standard]
gremlin>  g.V().valueMap('name',PageRankVertexProgram.PAGE_RANK)

==>[gremlin.pageRankVertexProgram.pageRank:[0.3398798396572379], name:[ARE YOU LONELY FOR ME]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.3690865436012367], name:[Garcia_Lesh]]

==>[gremlin.pageRankVertexProgram.pageRank:[0.1894148972396984], name:[SAGE AND SPIRIT]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.950180703456302], name:[LAZY LIGHTNING]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.2195579121656338], name:[CHIMES OF FREEDOM]]
......

zuoz...@huawei.com

unread,
Nov 3, 2016, 9:23:56 AM11/3/16
to Aurelius
Hi Dvlan,can you show your file 'read-hbase-spark.properties' for me? I also want to read titan(hbase) by hadoopGraph, but there is no example in http://titan.thinkaurelius.com/.  mine is under,but doesn't work.

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=/zzb/output2

#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=hbase
titanmr.ioformat.conf.storage.hostname=8-5-144-2
titanmr.ioformat.conf.storage.port=24002
titanmr.ioformat.conf.storage.hbase.titan.table=zzbtest18json
#titanmr.ioformat.conf.storage.cassandra.keyspace=titan


在 2016年1月28日星期四 UTC+8下午10:32:26,David写道:

Jason Plurad

unread,
Nov 3, 2016, 9:40:16 AM11/3/16
to Aurelius

Imri Hecht

unread,
Dec 19, 2016, 5:23:33 PM12/19/16
to Aurelius
You can use Mizo -- it is an implementation of Spark RDD for Titan on HBase, that bypasses HBase main API and parses the internal data files used by HBase (called HFiles).

I have tested in on a pretty large Titan Graph -- about 25 TB of storage, hundreds of billions of elements.
Reply all
Reply to author
Forward
0 new messages