Common Crawl enhancements to Nutch

Peter Dietz

unread,

Mar 10, 2015, 2:39:02 PM3/10/15

to common...@googlegroups.com

I've noticed that the blog post for Common Crawl mentions that they are using nutch to power the crawler. I was wondering if there was a list of enhancements that you have made to nutch (i.e. plugins) to help steer the crawler in the right direction. A different blog post from common crawl says "We use shingling and simhash to do fuzzy deduping of the content we download", I didn't know if this intelligent processing of your content, and nutch are combined, or if it is some post-processing step.

Thanks

Stephen Merity

unread,

Mar 12, 2015, 5:26:46 AM3/12/15

to common...@googlegroups.com

Hi Peter,

The Common Crawl modifications are all publicly viewable at our nutch repository on GitHub. Unfortunately there isn't a clean list of features that we've added. Whilst the modifications were done by Jordan Mendelson, I'll do my best to list some of the major additions / changes.

+ HTTPS support (since implemented on the mainline Nutch branch)
+ The Craw List Generator adds support for sorting according to a numeric ranking [each crawl list prioritizes crawling of the highest ranked pages first]

+ Avoiding rename due to the expense of renaming in S3 (S3 implements it via a copy), such as updating the crawl database results in a new copy identified by timestamp rather replacing the "current" folder
+ WARC export from the internal Nutch format
+ Adaptive crawl delay for hosts that experience issues

The fuzzy deduping is no longer being done as that was part of the older custom platform before moving to Nutch.

I hope that helps give you a broad overview of the kind of changes!

On Tue, Mar 10, 2015 at 11:39 AM, Peter Dietz <pe...@longsight.com> wrote:

I've noticed that the blog post for Common Crawl mentions that they are using nutch to power the crawler. I was wondering if there was a list of enhancements that you have made to nutch (i.e. plugins) to help steer the crawler in the right direction. A different blog post from common crawl says "We use shingling and simhash to do fuzzy deduping of the content we download", I didn't know if this intelligent processing of your content, and nutch are combined, or if it is some post-processing step.

Thanks

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--

Regards,

Stephen Merity

Data Scientist @ Common Crawl

Tom Morris

unread,

Mar 12, 2015, 10:07:07 AM3/12/15

to common...@googlegroups.com

It looks like a couple of years since it's been synced with the Nutch mainline. Have they diverged permanently at this point?

Tom

Reply all

Reply to author

Forward