Dear members,
I am trying to reproduce the Common Crawl infrastructure for crawling just a
few sites on a
weekly/nightly basis using AWS, EMR and S3. I am using the CC fork for Nutch at
https://github.com/Aloisius/nutch (cc branch).
I
use the Crawl job and I pass all the paths for s3 buckets. The inject
and fetch steps work perfect, but it fails on the ParseSegment step (see
following stack trace). I have tried with s3, s3n and s3a schemas.
org.apache.nutch.crawl.Crawl s3a://some-bucket/urls -dir s3a://some-bucket/crawl -depth 2 -topN 5
|
V
org.apache.nutch.parse.ParseSegment s3a://some-bucket/crawl/segments/20151022105922Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3a://some-bucket/crawl/segments/20151022105922/crawl_parse, expected: hdfs://ip-152-71-19-40.eu-west-1.compute.internal:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1404)
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:88)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:564)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:224)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:258)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:231)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Is working with S3 only fully supported by all the steps for crawling ? Any clue on what is the problem ?
I guess I could use hdfs for segments but I would like to avoid that, as I want to terminate the cluster as soon as the crawling finishes, and keep the data in S3 for next crawlings.
Thank you so much,
Christian