Loading 6.2m vertices and into cassandra

49 views
Skip to first unread message

ajay krishnan

unread,
Sep 26, 2016, 2:48:20 PM9/26/16
to Aurelius
Hello
             I am using Titan with cassandra as the backend (single node).
             I have a file which has graph data in GraphSON format. The file size is around 9GB.
            There are 6.2 million vertices and around 5 million edges.
             Edges has hashmap as property value. I have used the Serializer found here https://github.com/pluradj/titan-attribute-serializer

            It took me 40 hours to load this data into cassandra. I did face out-of-memory error a few times, finally was able to run to completion with around 200GB memory ( fortunately for this experiment i had a machine which had lots of RAM)

I looked at a lot posts which talk about BulkLoaderVertex program. I could not quite see an example configuration which resembles my case.
Most of the posts talk about KryoSerializer. But in my case i have a GraphSON file which is an output from another system. I cannot change that. Going forward i will be not be able to use that machine which has lots of RAM.

The application is written in Java. I use titan embedded ( i.e i use the Titan via the jar, don't have a separate titan installation)

i use
TitanGraph.io(IoCore.graphson()) to read the file currently. Does this construct the entire graph in memory first before persisting it or does it incrementally persist, If it does incremental persistence i should not require a lot of RAM right?

I did try setting ids.block-size to 100000.
Titan does not seem to take user supplied IDs. i did set "graph.set-vertex-id" to "true" but there is an Exception thrown

Titan 1.0.0
Cassandra ( datastax version) 3.7.0
Tinkerpop 3.0.1-incubating ( since Titan runs only with this version)

I can use Spark, but not allowed to install Hadoop.

I wish to load this data into cassandara using Titan.

Could someone please help me on how to reduce the load time and reduce RAM consumption.

Thank you
Regards

Daniel Kuppitz

unread,
Sep 26, 2016, 3:01:48 PM9/26/16
to aureliu...@googlegroups.com
You probably haven't seen the OpenFlights migration demo..? Since you already have a GraphSON input file, you can ignore the first part and start reading at Migrating to Titan 1.0.0. It should all work without Hadoop; instead of HDFS you can also use the local file system (I would say that's even better on a single machine).

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraphs+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/b2b4e487-88e2-4953-a90a-cf510aef5aa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ajay krishnan

unread,
Sep 26, 2016, 3:27:08 PM9/26/16
to Aurelius
 Hi Daniel
                   Thank you for the quick reply.

I am using java for this application. The link that you shared talks about a Groovy Script for the ScriptInputFormat. Is there a way i can replace that with a java class? If yes, can you please give me an example.

In the openflights-tp3.propertiesgremlin.hadoop.outputLocation=output is specified. Since the data must go into cassandra, what should i configure here?

Regards
Ajay

ajay krishnan

unread,
Sep 26, 2016, 3:30:26 PM9/26/16
to Aurelius
Just to add, the GraphSON file, which is my input, is Tinkerpop 3.x compliant

Daniel Kuppitz

unread,
Sep 27, 2016, 5:44:10 AM9/27/16
to aureliu...@googlegroups.com
Just to add, the GraphSON file, which is my input, is Tinkerpop 3.x compliant

In this case you won't need ScriptInputFormat and thus you also won't need the Groovy script file (which, btw., could easily be used by your.Java application)..

In the openflights-tp3.properties,  gremlin.hadoop.outputLocation=output is specified. Since the data must go into cassandra, what should i configure here?

The output location only specifies where side-effects get written.

Cheers,
Daniel


On Mon, Sep 26, 2016 at 9:30 PM, ajay krishnan <akris...@gmail.com> wrote:
Just to add, the GraphSON file, which is my input, is Tinkerpop 3.x compliant

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraphs+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages