KiteSDK cli throwing 'java.lang.OutOfMemoryError: GC overhead limit exceeded' when loading json data

Sree Pratheep

unread,

Jun 1, 2015, 2:56:42 AM6/1/15

to cdk...@cloudera.org

We are trying to import json data with around 2,00,000 entries from a file into a hive dataset using the following command we are getting an OutOfMemoryError.
./kite-dataset json-import abc.txt abc

It works when we try to load around 1,00,000 entries. We couldn't find how to increase the java heap size. Can someone tell us how to increase the heap size when running the kite-dataset command.

We get the following OutOfMemoryError

bash-4.1# ./kite-dataset json-import abc.txt abc

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

at com.fasterxml.jackson.databind.node.TextNode.valueOf(TextNode.java:43)

at com.fasterxml.jackson.databind.node.JsonNodeFactory.textNode(JsonNodeFactory.java:273)

at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:210)

at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:59)

at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)

at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:189)

at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:120)

at org.kitesdk.shaded.com.google.common.collect.Iterators$8.next(Iterators.java:811)

at org.kitesdk.data.spi.filesystem.JSONFileReader.next(JSONFileReader.java:121)

at org.kitesdk.shaded.com.google.common.collect.Iterators$7.computeNext(Iterators.java:648)

at org.kitesdk.shaded.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)

at org.kitesdk.shaded.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)

at org.kitesdk.data.spi.filesystem.MultiFileDatasetReader.hasNext(MultiFileDatasetReader.java:125)

at com.google.common.collect.Lists.newArrayList(Lists.java:138)

at com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:256)

at com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:217)

at org.apache.crunch.impl.mem.collect.MemCollection.<init>(MemCollection.java:76)

at org.apache.crunch.impl.mem.MemPipeline.read(MemPipeline.java:151)

at org.kitesdk.tools.TransformTask.run(TransformTask.java:135)

at org.kitesdk.cli.commands.JSONImportCommand.run(JSONImportCommand.java:144)

at org.kitesdk.cli.Main.run(Main.java:178)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.kitesdk.cli.Main.main(Main.java:256)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Thanks,
Sree Pratheep

Joey Echeverria

unread,

Jun 1, 2015, 9:36:09 AM6/1/15

to Sree Pratheep, cdk...@cloudera.org

Hi Sree,

You can set JVM flags by setting the flags environment variable before
running the CLI. For example:

export flags="-Xmx2048m"
kite-dataset ...

- or -

flags="-Xmx2048m" kite-dataset ...

The environment variables you can use to configure the CLI are documented here:

http://kitesdk.org/docs/1.0.0/cli-reference.html#general

-Joey

> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

--
Joey Echeverria
Senior Infrastructure Engineer

Ryan Blue

unread,

Jun 1, 2015, 11:56:56 AM6/1/15

to Sree Pratheep, cdk...@cloudera.org

Joey's fix is a good one if you have the memory for it, but another
work-around is to put the file you're importing in HDFS. Then we will
use a MR job that doesn't have the memory problem.

The cause of this problem is that we were using Crunch's MemPipeline for
local files, which will only run one stage at a time and will keep
everything in memory. So it will do the conversion, keeping all records
in memory, and then write them to disk. This is CDK-898 [1].

We're fixing this in 1.1.0 and using the LocalJobRunner rather than a
MemPipeline. That will run copy or import tasks from local data as they
would run on a cluster, which uses much less memory.

rb

[1]: https://issues.cloudera.org/browse/CDK-898

--
Ryan Blue
Software Engineer
Cloudera, Inc.

ஸ்ரீ பிரதீப்

unread,

Jun 2, 2015, 1:50:14 AM6/2/15

to cdk...@cloudera.org

Thanks Joey for the reply. We tried to set the flags environment variable but that is not working. We got the following error.

bash-4.1# export flags="-Xmx2048m"

bash-4.1# ./kite-dataset json-import abc.txt abc

Exception in thread "main" java.lang.ClassNotFoundException: -Xmx2048m

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:270)

at org.apache.hadoop.util.RunJar.run(RunJar.java:214)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

bash-4.1#

Tried to set HADOOP_CLIENT_OPTS='-Xmx4096m'. That also throws the same out of memory error. Java processes still runs with 1024m heap.

-Sree Pratheep

Ryan Blue

unread,

Jun 2, 2015, 12:10:29 PM6/2/15

to ஸ்ரீ பிரதீப், cdk...@cloudera.org

Hi Sree,

Looks like there's something wrong with the "flags" variable we need to
fix. Sorry about that.

Did you try running with the file in HDFS instead of on local disk? I
think that is another way to fix this.

rb

On 06/01/2015 10:50 PM, ஸ்ரீ பிரதீப் wrote:
> Thanks Joey for the reply. We tried to set the flags environment
> variable but that is not working. We got the following error.
>
> bash-4.1# export flags="-Xmx2048m"
> bash-4.1# ./kite-dataset json-import abc.txt abc
> Exception in thread "main" java.lang.ClassNotFoundException: -Xmx2048m
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> bash-4.1#
>
> Tried to set HADOOP_CLIENT_OPTS='-Xmx4096m'. That also throws the same
> out of memory error. Java processes still runs with 1024m heap.
>
> -Sree Pratheep
>
>
> 2015-06-01 19:06 GMT+05:30 Joey Echeverria <jo...@rocana.com

> <mailto:jo...@rocana.com>>:

> <mailto:cdk-dev%2Bunsu...@cloudera.org>.

> > For more options, visit
> https://groups.google.com/a/cloudera.org/d/optout.
>
>
>
> --
> Joey Echeverria
> Senior Infrastructure Engineer
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cdk-dev+u...@cloudera.org

> <mailto:cdk-dev+u...@cloudera.org>.

> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

--

Rafi Syed

unread,

Jun 4, 2015, 1:58:13 AM6/4/15

to cdk...@cloudera.org

Hi Ryan,

I'm also facing the same issue I've tried using data in hdfs but I'm getting the following error please do the needful.

bash-4.1# ./kite-dataset json-import hdfs:/tmp hungry

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

IO error: Cannot add jar path to distributed cache: /usr/hdp/2.2.0.0-2041/hive/lib

On Monday, 1 June 2015 12:26:42 UTC+5:30, Sree Pratheep wrote:

We are trying to import json data with around 2,00,000 entries from a file into a hive dataset using the following command we are getting an OutOfMemoryError.
./kite-dataset json-import abc.txt abc

It works when we try to load around 1,00,000 entries. We couldn't find how to increase the java heap size. Can someone tell us how to increase the heap size when running the kite-dataset command.

We get the following OutOfMemoryErro

bas
r

Ryan Blue

unread,

Jun 4, 2015, 12:38:45 PM6/4/15

to Rafi Syed, cdk...@cloudera.org

Rafi,

Can you run that command with the verbose flag, -v (just after
kite-dataset), to get the full error message? It looks like a problem
with permissions maybe.

rb

On 06/03/2015 10:58 PM, Rafi Syed wrote:
> Hi Ryan,
> I'm also facing the same issue I've tried using data in
> hdfs but I'm getting the following error please do the needful.
>
>

> bash-4.1#./kite-dataset json-import hdfs:/tmp hungry

Rafi Syed

unread,

Jun 5, 2015, 2:10:11 AM6/5/15

to cdk...@cloudera.org

Hi Ryan

PFB the logs

bash-4.1# ./kite-dataset -v json-import hdfs:/tmp hungry

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

IO error

org.kitesdk.data.DatasetIOException: Cannot add jar path to distributed cache: /usr/hdp/2.2.0.0-2041/hive/lib

at org.kitesdk.tools.TaskUtil$ConfigBuilder.addJarPathForClass(TaskUtil.java:129)

at org.kitesdk.tools.TransformTask.run(TransformTask.java:108)

at org.kitesdk.cli.commands.JSONImportCommand.run(JSONImportCommand.java:144)

at org.kitesdk.cli.Main.run(Main.java:178)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.kitesdk.cli.Main.main(Main.java:256)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Caused by: java.io.IOException: Jar file: /usr/hdp/2.2.0.0-2041/hive/lib/ojdbc6.jar does not exist.

at org.apache.crunch.util.DistCache.addJarToDistributedCache(DistCache.java:115)

at org.apache.crunch.util.DistCache.addJarDirToDistributedCache(DistCache.java:208)

at org.apache.crunch.util.DistCache.addJarDirToDistributedCache(DistCache.java:229)

at org.kitesdk.tools.TaskUtil$ConfigBuilder.addJarPathForClass(TaskUtil.java:127)

... 11 more

On Monday, 1 June 2015 12:26:42 UTC+5:30, Sree Pratheep wrote:

We are trying to importjso n data with around 2,00,000 entries from a file into a hive dataset using the following command we are getting an OutOfMemoryError.
./kite-datasetjso n-import abc.txtab c

It works when we try to load around 1,00,000 entries. We couldn't find how to increase thejav a heap size. Can someone tell us how to increase the heap size when running the kite-dataset command.

We get the following OutOfMemoryErro

bas
r

h-4.1# ./kite-dataset json-import abc.txt abc

SLF4J:Class pat h contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usrhd/p/2.2.0.0-2041zookeepe/r/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Exception in thread "main"jav a.lang.OutOfMemoryError: GC overhead limit exceeded

Erin Dogan

unread,

Jun 5, 2015, 12:10:54 PM6/5/15

to cdk...@cloudera.org

Sree,

I would verify that the ojdbc.jar is actually in the location. I ran into this same issue and the jar was not there. I fixed this by downloading the jar from oracle and putting it in the expected location. This however didn't resolve my issues as I then ran into the CopyTask job failing

job failure(s) occurred:
org.kitesdk.tools.CopyTask: Kite(dataset:hdfs://sandbox.hortonworks.com:8020/tmp/d08d... ID=1 (1/1)(1): Job failed!

logs::
2015-06-05 05:44:04,865 INFO jobhistory.JobSummary (HistoryFileManager.java:moveToDone(372)) - jobId=job_1433477092849_0001,submitTime=1433482990017,launchTime=1433483001006,firstMapTaskLaunchTime=1433483003858,firstReduceTaskLaunchTime=0,finishTime=1433483027484,resourcesPerMap=250,resourcesPerReduce=250,numMaps=1,numReduces=1,user=root,queue=default,status=FAILED,mapSlotSeconds=17,reduceSlotSeconds=0,jobName=org.kitesdk.tools.CopyTask: Kite(dataset:hdfs://sandbox.hortonworks.com:8020/tmp/21f5... ID\=1 (1/1)

Doesn't really tell me why it failed.

Ryan Blue

unread,

Jun 5, 2015, 12:43:36 PM6/5/15

to Rafi Syed, cdk...@cloudera.org

Rafi,

It looks like /usr/hdp/2.2.0.0-2041/hive/lib/ojdbc6.jar is probably a
broken symlink. How else would a file you can list not exist, right?

I'd look into that file more. Kite adds Hive to the distributed cache by
adding everything in the Hive lib directory. If it finds a broken
symlink, then it makes sense that it would fail. I think it should work
without ojdbc6.jar so you might be able to simply remove the symlink.

The problem with that approach is that a broken symlink indicates some
other issue that you should also look into. Maybe you need another
package installed that provides it, or maybe the Hive package you're
using has a bug. I'd contact your Hadoop vendor to find out, and please
let us know on this list what you find so others can get past this problem.

Thanks!

rb

On 06/04/2015 11:10 PM, Rafi Syed wrote:
> Hi Ryan
> PFB the logs
>
>
>
>

> bash-4.1#./kite-dataset -v json-import hdfs:/tmp hungry

> h-4.1#./kite-dataset json-import abc.txt abc

Rafi Syed

unread,

Jun 16, 2015, 6:58:07 AM6/16/15

to cdk...@cloudera.org, rafis...@gmail.com

FYI, I still couldn't load the data into hive using json-import even from hdfs. Removed the symbolic link. Now I am getting the same error that Erin is getting. Following is the full output.

bash-4.1# ./kite-dataset -v json-import hdfs://integration.mycorp.kom:8020/tmp/hungry.txt hungry

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

1 job failure(s) occurred:

org.kitesdk.tools.CopyTask: Kite(dataset:hdfs://integration.mycorp.kom:8020/tmp/defau... ID=1 (1/1)(1): Job failed!

Getting the following errors in map reduce/job logs

2015-06-16 06:46:53,947 INFO [Socket Reader #1 for port 43038] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1434446597569_0003 (auth:SIMPLE)
2015-06-16 06:46:53,958 INFO [IPC Server handler 2 on 43038] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1434446597569_0003_m_000004 asked for a task
2015-06-16 06:46:53,958 INFO [IPC Server handler 2 on 43038] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1434446597569_0003_m_000004 given task: attempt_1434446597569_0003_m_000000_2
2015-06-16 06:46:55,680 FATAL [IPC Server handler 0 on 43038] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1434446597569_0003_m_000000_2 - exited : com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
2015-06-16 06:46:55,680 INFO [IPC Server handler 0 on 43038] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1434446597569_0003_m_000000_2: Error: com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
2015-06-16 06:46:55,681 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1434446597569_0003_m_000000_2: Error: com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
2015-06-16 06:46:55,681 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1434446597569_0003_m_000000_2 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP

Thanks,

Rafi

Liam Mooney

unread,

Jun 16, 2015, 7:57:52 AM6/16/15

to cdk...@cloudera.org

Hi Syed,

Can you try:

export HADOOP_OPTS=-Xmx2g

Seems to work for me, I have HDP 2.2 and KiteSdk 1.0.0 installed locally.

Thanks,
Liam

Sree Pratheep

unread,

Jun 16, 2015, 9:11:22 AM6/16/15

to cdk...@cloudera.org, sreepr...@gmail.com

Hi Ryan,

Will this be part of 1.1.0 release. FYI, ran the binary built locally in my machine from the latest code from https://github.com/kite-sdk/kite. Got the following exception
bash-4.1# ./kite-dataset -v json-import /usr/local/src/hungry.txt hungry

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

1 job failure(s) occurred:
org.kitesdk.tools.CopyTask: Kite(dataset:file:/tmp/default/.temp/7470a17f-2006-42f7-a... ID=1 (1/1)(1): java.io.FileNotFoundException: File file:/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
        at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
        at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:460)
        at org.apache.hadoop.fs.FilterFs.resolvePath(FilterFs.java:157)
        at org.apache.hadoop.fs.FileContext$24.next(FileContext.java:2137)
        at org.apache.hadoop.fs.FileContext$24.next(FileContext.java:2133)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2133)
        at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:595)
        at org.apache.hadoop.mapreduce.JobSubmitter.addMRFrameworkToDistributedCache(JobSubmitter.java:753)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:435)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:329)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:204)
        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:238)
        at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:112)
        at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55)
        at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:83)
        at java.lang.Thread.run(Thread.java:745)
I am running this in hadoop running in sequenceiq/ambari doker image. Let me know if you need any more information.

Thanks,
Sree

Ryan Blue

unread,

Jun 16, 2015, 12:22:59 PM6/16/15

to Sree Pratheep, cdk...@cloudera.org

Sree,

Does file:/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz exist?

I'm not sure what's happening with your setup, but I think you might
have a problem with your install like Rafi. I don't think these files
should be missing.

And thanks to Liam for chiming in with help!

rb

> --
> You received this message because you are subscribed to the Google
> Groups "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cdk-dev+u...@cloudera.org
> <mailto:cdk-dev+u...@cloudera.org>.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Satyam Singh Chandel

unread,

Oct 12, 2015, 9:37:42 AM10/12/15

to CDK Development, sreepr...@gmail.com

Hi,

This thread chain helped me a lot while fixing issues while importing json data in HDFS using kite dataset.

Now I am facing an error when executed below command:

bash-4.1# ./kite-dataset json-import /vagrant/kite/sample.json dataset:hdfs://integcorp.kom:8020/user/falcon/dataset/hgrw

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
1 job failure(s) occurred:

org.kitesdk.tools.CopyTask: Kite(dataset:file:/tmp/dataset/.temp/1d5a3984-d762-4b16-a... ID=1 (1/1)(1): java.io.FileNotFoundException: File does not exist: hdfs://integcorp.kom:8020/tmp/crunch-2009144800/p1/REDUCE
    at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
    at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
    at org.apache.hadoop.fs.FileSystem.resolvePath(FileSystem.java:750)
    at org.apache.hadoop.mapreduce.v2.util.MRApps.parseDistributedCacheArtifacts(MRApps.java:568)
    at org.apache.hadoop.mapreduce.v2.util.MRApps.setupDistributedCache(MRApps.java:460)
    at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:93)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:536)

    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:329)
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:204)
    at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:238)
    at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:112)
    at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55)
    at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:83)
    at java.lang.Thread.run(Thread.java:745)

Kindly help me out.....

Regards,

Ryan Blue

unread,

Oct 12, 2015, 12:27:12 PM10/12/15

to Satyam Singh Chandel, CDK Development, sreepr...@gmail.com

Hi Satyam,

I'm not sure what's going on there. It looks like a problem with the
LocalJobRunner's setup. Could you try loading the source file into HDFS
and re-running the command?

rb

> https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite>.

> > an email to cdk-dev+u...@cloudera.org <javascript:>
> > <mailto:cdk-dev+u...@cloudera.org <javascript:>>.