Hello All,
I'm using titan 0.5.4 with cassandra 2.0.14 and trying to run map reduce job on it using cloudera CDH5. Its a user:user and user:interest graph where close to 16M users are there and close to 1000 interests, its possible that some interest can have as many as 2M users, in future these numbers will increase. The total graph isn't huge, with Replication factor of 3 the 4 node cassandra cluster has combined disk size of close to 28GB.
I've three yarn node managers on which I've given following limitsÂ
yarn-site.xml
     <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>16000</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1000</value>
    </property>
mapred-site.xml
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>6000</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>6000</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx4096m</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx4096m</value>
    </property>
and on the client I've given following limits
mapreduce.map.memory.mb=6000
mapreduce.reduce.memory.mb=6000
mapred.map.child.java.opts=-Xmx4096m
mapred.reduce.child.java.opts=-Xmx4096m
mapred.max.split.size=5242880
mapred.job.reuse.jvm.num.tasks=-1
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.cassandra.TitanCassandraInputFormat
titan.hadoop.input.conf.storage.backend=cassandrathrift
titan.hadoop.input.conf.storage.hostname=lp1,lp3
titan.hadoop.input.conf.storage.port=9160
titan.hadoop.input.conf.storage.cassandra.keyspace=lgpgelsgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.split.size=16384
cassandra.thrift.framed.size_mb=499
cassandra.thrift.message.max_size_mb=500
titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.noop.NoOpOutputFormat
On running simple gremlin queries I quite often get errors like this.Â
10:11:26 INFO  org.apache.hadoop.mapreduce.Job  - Task Id : attempt_1430217400643_0041_m_000179_2, Status : FAILED
Error: java.lang.RuntimeException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.hadoop.mapreduce.lib.chain.Chain.joinAllThreads(Chain.java:526)
    at org.apache.hadoop.mapreduce.lib.chain.ChainMapper.run(ChainMapper.java:169)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.google.common.collect.Sets.newHashSetWithExpectedSize(Sets.java:194)
    at com.google.common.collect.HashMultimap.createCollection(HashMultimap.java:114)
    at com.google.common.collect.HashMultimap.createCollection(HashMultimap.java:49)
    at com.google.common.collect.AbstractMultimap.createCollection(AbstractMultimap.java:156)
    at com.google.common.collect.AbstractMultimap.getOrCreateCollection(AbstractMultimap.java:214)
    at com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:201)
    at com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117)
    at com.google.common.collect.HashMultimap.put(HashMultimap.java:49)
    at com.thinkaurelius.titan.hadoop.FaunusSerializer.readEdges(FaunusSerializer.java:252)
    at com.thinkaurelius.titan.hadoop.FaunusSerializer.readElement(FaunusSerializer.java:143)
    at com.thinkaurelius.titan.hadoop.FaunusSerializer.readPathElement(FaunusSerializer.java:119)
    at com.thinkaurelius.titan.hadoop.FaunusSerializer.readEdges(FaunusSerializer.java:218)
    at com.thinkaurelius.titan.hadoop.FaunusSerializer.readVertex(FaunusSerializer.java:76)
    at com.thinkaurelius.titan.hadoop.FaunusVertex.readFields(FaunusVertex.java:336)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
    at org.apache.hadoop.util.ReflectionUtils.copy(ReflectionUtils.java:296)
    at org.apache.hadoop.mapreduce.lib.chain.Chain$ChainRecordWriter.writeToQueue(Chain.java:264)
    at org.apache.hadoop.mapreduce.lib.chain.Chain$ChainRecordWriter.write(Chain.java:252)
    at org.apache.hadoop.mapreduce.lib.chain.ChainMapContextImpl.write(ChainMapContextImpl.java:110)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
    at com.thinkaurelius.titan.hadoop.mapreduce.transform.VerticesMap$Map.map(VerticesMap.java:59)
    at com.thinkaurelius.titan.hadoop.mapreduce.transform.VerticesMap$Map.map(VerticesMap.java:36)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapreduce.lib.chain.Chain$MapRunner.run(Chain.java:321)
Also I think that once running an identity mapper g._ to move data to HDFS and running subsequent jobs directly from HDFS is a good idea but so far identity mapper has never run successfully, alway dying due to GC overhead exceptions.
Also sometimes I get errors likeÂ
10:21:29 INFO  org.apache.hadoop.mapreduce.Job  - Task Id : attempt_1430217400643_0041_m_000554_0, Status : FAILED
Error: java.lang.IllegalArgumentException: Could not instantiate implementation: com.thinkaurelius.titan.hadoop.formats.util.input.current.TitanHadoopSetupImpl
    at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:55)
    at com.thinkaurelius.titan.hadoop.formats.util.TitanInputFormat.getGraphSetup(TitanInputFormat.java:49)
    at com.thinkaurelius.titan.hadoop.formats.cassandra.TitanCassandraRecordReader.initialize(TitanCassandraRecordReader.java:44)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:44)
    ... 10 more
Caused by: com.thinkaurelius.titan.core.TitanException: A Titan graph with the same instance id [ac14151c32106-nmc-lp31] is already open. Might required forced shutdown.
    at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.<init>(StandardTitanGraph.java:133)
    at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:93)
    at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:83)
    at com.thinkaurelius.titan.hadoop.formats.util.input.current.TitanHadoopSetupImpl.<init>(TitanHadoopSetupImpl.java:39)
    ... 15 more
I've exited gremlin shell without doing g.shutdown() during some of the older runs, can that be the issue? can I find out all running graphs and shut them down?
Thanks & Regards,
Apoorva