Can CC fork of Nutch work with S3 only ?

Christian Pérez-Llamas

unread,

Oct 22, 2015, 10:34:33 AM10/22/15

to Common Crawl

Dear members,

I am trying to reproduce the Common Crawl infrastructure for crawling just a few sites on a weekly/nightly basis using AWS, EMR and S3. I am using the CC fork for Nutch at https://github.com/Aloisius/nutch (cc branch).

I use the Crawl job and I pass all the paths for s3 buckets. The inject and fetch steps work perfect, but it fails on the ParseSegment step (see following stack trace). I have tried with s3, s3n and s3a schemas.

org.apache.nutch.crawl.Crawl s3a://some-bucket/urls -dir s3a://some-bucket/crawl -depth 2 -topN 5
|
V
org.apache.nutch.parse.ParseSegment s3a://some-bucket/crawl/segments/20151022105922

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: s3a://some-bucket/crawl/segments/20151022105922/crawl_parse, expected: hdfs://ip-152-71-19-40.eu-west-1.compute.internal:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1404)
	at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:88)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:564)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
	at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:224)
	at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:258)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:231)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Is working with S3 only fully supported by all the steps for crawling ? Any clue on what is the problem ?

I guess I could use hdfs for segments but I would like to avoid that, as I want to terminate the cluster as soon as the crawling finishes, and keep the data in S3 for next crawlings.

Thank you so much,
Christian

Tom Morris

unread,

Oct 22, 2015, 12:27:36 PM10/22/15

to common...@googlegroups.com

On Thu, Oct 22, 2015 at 10:34 AM, Christian Pérez-Llamas <chr...@gmail.com> wrote:

I am trying to reproduce the Common Crawl infrastructure for crawling just a few sites on a weekly/nightly basis using AWS, EMR and S3. I am using the CC fork for Nutch at https://github.com/Aloisius/nutch (cc branch).

It doesn't help with your question, but I think the CC fork is actually at: https://github.com/commoncrawl/nutch/

(although the differences are minimal).

Tom

Christian Pérez-Llamas

unread,

Oct 28, 2015, 10:19:52 AM10/28/15

to Common Crawl

You are right Tom, thank you so much.

Regarding my problem I guess that the only way to go is to backup/restore the crawl HDFS state into/from S3 each time I launch a crawling cluster. Does anyone know if the segments are required for the AdaptiveFetcherScheduler to work properly ? if not I would only backup crawldb and linkdb.

I also read in the following wiki that using S3 to replace HDFS is discouraged:

https://wiki.apache.org/hadoop/AmazonS3

Reply all

Reply to author

Forward