The actual common crawl crawler is publicly available...
In particular https://github.com/commoncrawl/commoncrawl-crawler/blob/master/src/org/commoncrawl/mapred/ec2/parser/ParserOutputFormat.java looks like a good place to start
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.