# input graph parametersfaunus.graph.input.format=com.thinkaurelius.faunus.formats.edgelist.rdf.RDFInputFormatfaunus.graph.input.rdf.format=n-triplesfaunus.graph.input.rdf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2000/01/rdf-schema#labelfaunus.graph.input.rdf.use-localname=truefaunus.graph.input.rdf.literal-as-property=truefaunus.input.location=LargeGraph.nt# output data parametersfaunus.graph.output.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseOutputFormatfaunus.graph.output.titan.storage.backend=hbasefaunus.graph.output.titan.storage.hostname=Zookeeper1, Zookeeper2faunus.graph.output.titan.storage.port=2181faunus.graph.output.titan.storage.tablename=titanfaunus.graph.output.titan.storage.batch-loading=truefaunus.graph.output.titan.ids.block-size=1000000# longfaunus.graph.output.titan.storage.idauthority-wait-time=1000# longfaunus.graph.output.titan.ids.renew-timeout=21474836400000# intfaunus.graph.output.titan.ids.idauthority-retries=2147483640faunus.graph.output.titan.infer-schema=truefaunus.graph.output.titan.ids.num-partitions=7faunus.graph.output.titan.ids.partition=truefaunus.output.location=outputfaunus.output.location.overwrite=truemapreduce.linerecordreader.maxlength=5242880mapreduce.input.fileinputformat.split.maxsize=5242880mapreduce.map.memory.mb=4098mapreduce.reduce.memory.mb=81920mapreduce.map.java.opts=-Xmx4Gmapreduce.reduce.java.opts=-Xmx80Gmapreduce.job.reuse.jvm.num.tasks=-1mapreduce.job.maxtaskfailures.per.tracker=256mapreduce.map.maxattempts=128mapreduce.reduce.maxattempts=128mapreduce.job.reduces=70mapreduce.job.maps=120mapreduce.task.timeout=54000000
17:02:53 INFO mapreduce.Job: Task Id : attempt_1407466272407_0027_r_000006_0, Status : FAILEDError: java.lang.RuntimeException: java.lang.OutOfMemoryError: Requested array size exceeds VM limitat org.apache.hadoop.mapreduce.lib.chain.Chain.joinAllThreads(Chain.java:526)at org.apache.hadoop.mapreduce.lib.chain.ChainReducer.run(ChainReducer.java:218)at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limitat java.util.Arrays.copyOf(Arrays.java:3230)at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)at java.io.DataOutputStream.write(DataOutputStream.java:107)at java.io.FilterOutputStream.write(FilterOutputStream.java:97)at com.thinkaurelius.faunus.FaunusElement$ElementProperties.write(FaunusElement.java:248)at com.thinkaurelius.faunus.FaunusElement.write(FaunusElement.java:217)at com.thinkaurelius.faunus.FaunusEdge.writeCompressed(FaunusEdge.java:108)at com.thinkaurelius.faunus.FaunusVertex$EdgeMap.write(FaunusVertex.java:396)at com.thinkaurelius.faunus.FaunusVertex.write(FaunusVertex.java:281)at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)at org.apache.hadoop.util.ReflectionUtils.copy(ReflectionUtils.java:292)at org.apache.hadoop.mapreduce.lib.chain.Chain$ChainRecordWriter.writeToQueue(Chain.java:264)at org.apache.hadoop.mapreduce.lib.chain.Chain$ChainRecordWriter.write(Chain.java:252)at org.apache.hadoop.mapreduce.lib.chain.ChainReduceContextImpl.write(ChainReduceContextImpl.java:103)at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)at com.thinkaurelius.faunus.formats.edgelist.EdgeListInputMapReduce$Reduce.reduce(EdgeListInputMapReduce.java:127)at com.thinkaurelius.faunus.formats.edgelist.EdgeListInputMapReduce$Reduce.reduce(EdgeListInputMapReduce.java:112)at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)at org.apache.hadoop.mapreduce.lib.chain.Chain$ReduceRunner.run(Chain.java:349)
mapreduce.reduce.memory.mb=81920mapreduce.reduce.java.opts=-Xmx80G
Hi Dan,
Thanks for your help.
>40 million edges consuming 80 GB of RAM implies a worst
>case of 2 GB / edge, which is insane.
Correct me if I am wrong, but I don't think that that is the case. I believe that we are just limited by the size of Java array (~2GB).40 million edges consume more than 2 GB (in total) and that's the maximum size of Java array. Thus, array-backed DataOutput is unable to allocate an array that is larger than those 2 GB.
IMHO, when FaunusVertex tries to write its data, it writes to DataOutput which is limited to 2 GB because of its array back-end and throws as it tries saving more than 2GB of data.
If I look at the process, it's private working set never gets anywhere close to 80GB (I'll keep looking, but I did not see any java process consuming so much memory in this phase of M/R; in the last phase).(Btw, the number of edges 40 mil was obtained by modifying Faunus code and adding additional logging statements.)There is only two properties on the super-node with like 30 char each. Edges should not have many properties either - just a name which is no more than 100 chars (as it is a uri)
>Sill question: did you configure YARN's resource settings to limit MR to>only one reducer task at a time?
No, but since I've allowed Hadoop to allocate insane amount of RAM to the reducer by specifying
mapreduce.reduce.memory.mb=81920mapreduce.reduce.java.opts=-Xmx80G
per my configuration above, Yarn does not start more than a single reducer at a time on a node.If I remove these config values, I see multiple reducers.I don't have any other specific configurations settings for Yarn.