Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:724) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:549) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/87405325-f08a-49c5-ad7f-d0b4f83fb0ec/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1044) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1316) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:706) ... 9 more
Both of these issues were a result of having empty processing buckets.
I'm receiving this error as well, however the staging process has succeeded and I have logs in my processing bucket.
Included in the second link are several suggested things to check, and I've done those:
1. s3n://<< bucket >>/processing/ definitely exists, is readable by your AWS creds and is definitely in us-east-1 ?
- My processing bucket exists and contains logs
- My processing bucket is definitely readable by my AWS credentials as it was able to be written to during staging
- The bucket claims a region name of "US Standard", which, according to the documentation is essentially an alias for us-east
2. Your << keypair >> keypair was created in the correct region - us-east-1?I've tried two different availability zones by setting :placement: first to us-east-1a, then us-east-1b as my other EC2 instances use this zone and function properly. I didn't really expect this to have any effect, so I wasn't terribly disappointed when it didn't.
I've also tried different AMI versions and setting continue_on_unexpected_error to false and commenting out the error buckets with no success.The S3DistCp error I receive differs from the one described in the link.
I receive the following:
2014-08-27T14:22:27.879Z INFO Fetching jar file.
2014-08-27T14:22:34.910Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2014-08-27T14:22:34.910Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3://my-company-name-processing-bucket/processing/ --dest hdfs:///local/snowplow/raw-events/ --groupBy .*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..* --targetSize 128 --outputCodec lzo --s3Endpoint s3.amazonaws.com
2014-08-27T14:22:46.250Z INFO Execution ended with ret val 1
2014-08-27T14:22:46.251Z WARN Step failed with bad retval
2014-08-27T14:22:52.585Z INFO Step created jobs:
I've been very thorough in checking my config.yml for errors, but to cover all of my bases, I'll include a copy below.
I've changed my bucket names as they contain my company name, however the directory structure is correct.
:logging:
:level: DEBUG # You can optionally switch to INFO for production
:aws:
:access_key_id: XXXXXXXXXXXXXXXXXXX
:secret_access_key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
:s3:
:region: us-east-1
:buckets:
:assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
:log: s3://my-company-name-logging-bucket/enrichment_logs
:raw:
:in: s3://my-company-name-logging-bucket/logs
:processing: s3://my-company-name-processing-bucket/processing
:archive: s3://my-company-name-archive-bucket/raw # e.g. s3://my-company-name-archive-bucket/raw
:enriched:
:good: s3://my-company-name-out-bucket/enriched/good # e.g. s3://my-company-name-out-bucket/enriched/good
:bad: s3://my-company-name-out-bucket/enriched/bad # e.g. s3://my-company-name-out-bucket/enriched/bad
:errors: s3://s3:my-company-name-out-bucket/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
:shredded:
:good: s3://my-company-name-out-bucket/shredded/good # e.g. s3://my-company-name-out-bucket/shredded/good
:bad: s3://my-company-name-out-bucket/shredded/bad # e.g. s3://my-company-name-out-bucket/shredded/bad
:errors: s3://my-company-name-out-bucket/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
:emr:
:ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
:region: us-east-1 # Always set this
:placement: us-east-1a # Set this if not running in VPC. Leave blank otherwise
:ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
:ec2_key_name: My Company Name SnowPlow Key
:software:
:hbase: # To launch on cluster, provide version, "0.92.0", keep quotes
:lingual: # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
:jobflow:
:master_instance_type: m1.small
:core_instance_count: 2
:core_instance_type: m1.small
:task_instance_count: 0 # Increase to use spot instances
:task_instance_type: m1.small
:task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
:etl:
:job_name: My Company Name ETL # Give your job a name
:versions:
:hadoop_enrich: 0.6.0 # Version of the Hadoop Enrichment process
:hadoop_shred: 0.2.0 # Version of the Hadoop Shredding process
:collector_format: cloudfront # Or 'clj-tomcat' for the Clojure Collector
:continue_on_unexpected_error: true # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
:iglu:
:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
:data:
:cache_size: 500
:repositories:
- :name: "Iglu Central"
:priority: 0
:vendor_prefixes:
- com.snowplowanalytics
:connection:
:http:
:uri: http://iglucentral.com
I'm unsure how to debug this. What are your thoughts/suggestions?
Thanks.
The S3DistCp error I receive differs from the one described in the link.
I receive the following:
2014-08-27T14:22:27.879Z INFO Fetching jar file.
2014-08-27T14:22:34.910Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2014-08-27T14:22:34.910Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3://my-company-name-processing-bucket/processing/ --dest hdfs:///local/snowplow/raw-events/ --groupBy .*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..* --targetSize 128 --outputCodec lzo --s3Endpoint s3.amazonaws.com
2014-08-27T14:22:46.250Z INFO Execution ended with ret val 1
2014-08-27T14:22:46.251Z WARN Step failed with bad retval
2014-08-27T14:22:52.585Z INFO Step created jobs:
Let us know how you get on,
Alex
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
2014-08-28 17:22:19,397 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Skipping key 'processing/' because it ends with '/' 2014-08-28 17:22:20,307 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Created 0 files to copy 0 files 2014-08-28 17:22:20,685 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null
However, when viewing the processing bucket in the console, I do see many logs as staging finished successfully.
2014-08-28T17:21:21.955Z INFO Fetching jar file. 2014-08-28T17:21:29.456Z INFO Working dir /mnt/var/lib/hadoop/steps/1 2014-08-28T17:21:29.457Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /mnt/var/lib/hadoop/steps/1/script-runner.jar s3://elasticmapreduce/libs/state-pusher/0.1/fetch 2014-08-28T17:21:46.224Z INFO Execution ended with ret val 0 2014-08-28T17:21:59.894Z INFO Step created jobs: 2014-08-28T17:21:59.895Z INFO Step succeeded
syslog:
2014-08-28 17:21:37,567 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening '/libs/state-pusher/0.1/fetch' for reading
stderr:
+ /etc/init.d/hadoop-state-pusher-control stop + PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + LOG_FILE=/mnt/var/log/hadoop-state-pusher/hadoop-state-pusher.out + SVC_FILE=/mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher + case $1 in + stop + echo 0 /etc/init.d/hadoop-state-pusher-control: line 35: /mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher: No such file or directory + /etc/init.d/hadoop-state-pusher-control start + PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + LOG_FILE=/mnt/var/log/hadoop-state-pusher/hadoop-state-pusher.out + SVC_FILE=/mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher + case $1 in + start ++ dirname /mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher + sudo -u hadoop mkdir -p /mnt/var/lib/hadoop-state-pusher + echo 1 ++ dirname /mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + sudo -u hadoop mkdir -p /mnt/var/run/hadoop-state-pusher ++ dirname /mnt/var/log/hadoop-state-pusher/hadoop-state-pusher.out + sudo -u hadoop mkdir -p /mnt/var/log/hadoop-state-pusher + disown %1 + sleep 5 + sudo -u hadoop /usr/bin/hadoop-state-pusher -server --pidfile /mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + exit 0 Command exiting with ret '0'
Downloading 's3://elasticmapreduce/libs/state-pusher/0.1/fetch' to '/mnt/var/lib/hadoop/steps/1/.'
On Thursday, August 28, 2014 4:17:31 PM UTC-4, Jake Williamson wrote:
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
:in: s3://my-company-name-logging-bucket/logsis pointing to wherever CloudFront is storing your CloudFront access logs.
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
:in: s3://my-company-name-logging-bucketNot as you have now:
:in: s3://my-company-name-logging-bucket/logs
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.