Hi Peter,
The Common Crawl modifications are all publicly viewable at our
nutch repository on GitHub. Unfortunately there isn't a clean list of features that we've added. Whilst the modifications were done by Jordan Mendelson, I'll do my best to list some of the major additions / changes.
+ HTTPS support (since implemented on the mainline Nutch branch)
+ The Craw List Generator adds support for sorting according to a numeric ranking [each crawl list prioritizes crawling of the highest ranked pages first]
+ Avoiding rename due to the expense of renaming in S3 (S3 implements it via a copy), such as updating the crawl database results in a new copy identified by timestamp rather replacing the "current" folder
+ WARC export from the internal Nutch format
+ Adaptive crawl delay for hosts that experience issues
The fuzzy deduping is no longer being done as that was part of the older custom platform before moving to Nutch.
I hope that helps give you a broad overview of the kind of changes!