Load large files into Titan 1.0.0

Laxmikant Patil

未读，

2016年2月4日 19:27:452016/2/4

收件人 Aurelius

Hi Daniel/Jason/Marko/Stephen,

I have large CSV files in the following format.

recordID, person_name, Address, phone

------------------------------------------------------------------

(1,John, WA, 456781911)

(2, Anna, WA, 626762657)

(3, Peter,MA, 872783873)

.

I want to create a large network graph (There has to be one one node with same name /address/phone). I have to load this big file into Titan 1.0.0 ( I have Cassandra as backend) .

I just want to know whether this data can be directly loaded into Titan BulkLoaderVertexProgram directly? Or the data has to be transformed & create edge list separately?

Thanks.

Daniel Kuppitz

未读，

2016年2月4日 20:37:362016/2/4

收件人 aureliu...@googlegroups.com

How did you plan to structure your graph? What kind of vertices? What kind of edges? Which properties on which vertices/edges?

Cheers,

Daniel

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/7860f74f-a798-4fa3-8df1-f5691d4ef3ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Laxmikant Patil

未读，

2016年2月4日 22:21:352016/2/4

收件人 Aurelius

Here structure of graph will be as folllows:

Each attribute will be one vertex.

Each name will be associated with one or more record ID as different people can have same name. So each recordID will be linked to name by "has name" property.

Similar structure will be for Phone & Address. i.e Phone, Name & Address will have link to RecordID. There will not be any link in between phone , name & address.

I have millions of records

e.g

(1,John, WA, 456781911)

(2, Anna, WA, 626762657)

(3, Peter,MA, 872783873)

(4, John, WA, 99920280)

Here in above example, there will be only one vertex named John, & WA and they both will be linked to Vertex RecordID 1 & RecordID 4.

456781911

|

1-----------John-------------4-------999920280

| |

WA-----------------------------

How do I go about building this kind of structure with only this CSV file which has millions of records? Do I have to keep track of all the record IDs in memory?

Do these all operation have to happen in one transaction only? Can you give a hint about writing a code in Java?

Thanks.

Daniel Kuppitz

未读，

2016年2月5日 09:55:592016/2/5

收件人 aureliu...@googlegroups.com

I took your sample ...

daniel@cube /tmp $ cat /tmp/lax.txt
1,John,WA,456781911
2,Anna,WA,626762657
3,Peter,MA,872783873
4,John,WA,99920280

... ran a Spark job to bring that file into a better format ...

val textFile = sc.textFile("/tmp/lax.txt")
val maps = textFile.map(line => line.split(",")).
                    map(x => Map("recordId" -> x(0), "name" -> x(1), "address" -> x(2), "phone" -> x(3))).cache()
val records = maps.map(m => Array("recordId", m("recordId"), m("name"), m("phone"), m("address")).mkString("\t"))
val names = maps.map(m => (m("name"), Array(m("recordId")))).keyBy(_._1).mapValues(x => x._2).
                 reduceByKey((x, y) => x++y).map(x => Array("name", x._1, x._2.mkString(",")).mkString("\t"))
val addresses = maps.map(m => (m("address"), Array(m("recordId")))).keyBy(_._1).mapValues(x => x._2).
                     reduceByKey((x, y) => x++y).map(x => Array("address", x._1, x._2.mkString(",")).mkString("\t"))
val phones = maps.map(m => (m("phone"), Array(m("recordId")))).keyBy(_._1).mapValues(x => x._2).
                  reduceByKey((x, y) => x++y).map(x => Array("phone", x._1, x._2.mkString(",")).mkString("\t"))

records.union(names).union(addresses).union(phones).saveAsTextFile("/tmp/patil")

... and ended up with a file that could be used as a bulk loader input:

daniel@cube /tmp $ cat /tmp/patil/part-0000*
recordId    1    John    456781911    WA
recordId    2    Anna    626762657    WA
recordId    3    Peter   872783873    MA
recordId    4    John    99920280    WA
name     Anna     2
name     Peter    3
name     John     1,4
address MA       3
address WA       1,2,4
phone    456781911   1
phone    99920280    4
phone    626762657   2
phone    872783873   3

The key is to have 1 unique vertex per line together with all its edges (incoming and outgoing).

Now all you need is a proper Groovy script to parse the input file (which IMO should be pretty easy with the given structure).

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/e31278d1-0f56-404f-9f26-33cb9808936d%40googlegroups.com.

Jason Plurad

未读，

2016年2月5日 13:47:492016/2/5

收件人 Aurelius

Hi Laxmikant,

If you're looking to use Java code, check out Alex's and Matthew's Marvel graph example:

https://github.com/awslabs/dynamodb-titan-storage-backend/blob/1.0.0/src/main/java/com/amazon/titan/example/MarvelGraphFactory.java

It creates a Titan schema, parses a CSV, and then uses basic Gremlin addVertex() and addEdge() to build the graph. You'll notice that the TitanGraph isn't instantiated in the factory itself, so even though it is inside a Titan-DynamoDB example, you can use this with any Titan backend (Cassandra, HBase, Berkeley).

If your graph data is in the low millions, you could use a Titan-BerkeleyJE graph on your own machine, which might be an easier backend to use at first rather than a Cassandra cluster. I'd recommend that you do not get too caught up on loading a lot of data initially -- get comfortable with how to use Titan and TinkerPop with OLTP first and then move into OLAP approaches.

Enjoy!

-- Jason

Laxmikant Patil

未读，

2016年2月5日 18:42:072016/2/5

收件人 Aurelius

Thanks Daniel. That is really helpful. I will surely try this.

Thanks again.

Stephen Mallette

未读，

2016年2月8日 06:57:542016/2/8

收件人 Aurelius

nice ducati jason :D

To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/111381f9-d8ad-4f15-9b25-2a28379cb625%40googlegroups.com.

Laxmikant Patil

未读，

2016年2月16日 04:04:592016/2/16

收件人 Aurelius

Hi Daniel,

Are there any dependency issues with SparkGraphComputer?

Check out this issue SparkGraphComputerLoading.

SparkGraphComputer.workers(1) method does not exists in Tinkterpop 3.0.1, but it throws error when I run SparkGraphComputer code on Titan 1.0.0 page.

The given graph instance does not allow concurrent access.

How to resolve this conflict then?

On Friday, February 5, 2016 at 7:55:59 AM UTC-7, Daniel Kuppitz wrote:

Daniel Kuppitz

未读，

2016年2月16日 07:45:562016/2/16

收件人 aureliu...@googlegroups.com

Titan does allow concurrent access. Looks like your trying to load your data into TinkerGraph, which will only work in TinkerPop 3.1+.

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/44a07f3c-72bb-4565-a8fa-7c447f35938a%40googlegroups.com.

Laxmikant Patil

未读，

2016年2月16日 12:08:352016/2/16

收件人 Aurelius

Then how to resolve the issue of concurrent error with TitanGraph? I am bulkloading data into TitanGraph from CSV. But it is giving me this error.

Daniel Kuppitz

未读，

2016年2月16日 13:02:592016/2/16

收件人 aureliu...@googlegroups.com

Can you show the whole code and the properties files you're using?

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/e5d29877-0b4a-472a-b145-9e0017c57438%40googlegroups.com.

Laxmikant Patil

未读，

2016年2月16日 13:28:522016/2/16

收件人 Aurelius

I am trying to execute the sample code on Titan 1.0.0 first so that I will get an idea how it works but it does not seem to work in my case, (I am using default package Titan 1.0.0, & have not modified any tinkerpop versions)

# hadoop-load.properties

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.inputLocation=./data/grateful-dead.kryo
gremlin.hadoop.outputLocation=output
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true

#
# GiraphGraphComputer Configuration
#
giraph.minWorkers=2
giraph.maxWorkers=2
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
giraph.maxMessagesInMemory=100000

#
# SparkGraphComputer Configuration
#
spark.master=local[*]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer

// titan-schema-grateful-dead.groovy

def defineGratefulDeadSchema(titanGraph) {
    m = titanGraph.openManagement()
    // vertex labels
    artist = m.makeVertexLabel("artist").make()
    song   = m.makeVertexLabel("song").make()
    // edge labels
    sungBy     = m.makeEdgeLabel("sungBy").make()
    writtenBy  = m.makeEdgeLabel("writtenBy").make()
    followedBy = m.makeEdgeLabel("followedBy").make()
    // vertex and edge properties
    blid         = m.makePropertyKey("bulkLoader.vertex.id").dataType(Long.class).make()
    name         = m.makePropertyKey("name").dataType(String.class).make()
    songType     = m.makePropertyKey("songType").dataType(String.class).make()
    performances = m.makePropertyKey("performances").dataType(Integer.class).make()
    weight       = m.makePropertyKey("weight").dataType(Integer.class).make()
    // global indices
    m.buildIndex("byBulkLoaderVertexId", Vertex.class).addKey(blid).buildCompositeIndex()
    m.buildIndex("artistsByName", Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
    m.buildIndex("songsByName", Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
    // vertex centric indices
    m.buildEdgeIndex(followedBy, "followedByWeight", Direction.BOTH, Order.decr, weight)
    m.commit()
}

gremlin> :load data/grateful-dead-titan-schema.groovy
==>true
==>true
gremlin> graph = TitanFactory.open('conf/titan-cassandra.properties')
==>standardtitangraph[cassandrathrift:[127.0.0.1]]
gremlin> defineGratefulDeadSchema(graph)
==>null
gremlin> graph.close()
==>null
gremlin> hdfs.copyFromLocal('data/grateful-dead.kryo','data/grateful-dead.kryo')
==>null
gremlin> graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
==>hadoopgraph[gryoinputformat->nulloutputformat]
gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph('conf/titan-cassandra.properties').create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader,vertexIdProperty=bulkLoader.vertex.id,userSuppliedIds=false,keepOriginalIds=true,batchSize=0]
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()

On the last statement I get the error,

11:21:05 WARN org.apache.tinkerpop.gremlin.hadoop.process.computer.spark.SparkGraphComputer - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible

11:21:20 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library not loaded

11:21:21 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 0.0 (TID 0)

java.io.EOFException

at java.io.DataInputStream.readByte(DataInputStream.java:267)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.seekToHeader(GryoRecordReader.java:82)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoRecordReader.initialize(GryoRecordReader.java:74)

at org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat.createRecordReader(GryoInputFormat.java:39)

at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)

at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)

at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:56)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

11:21:21 WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.EOFException