Unfortunately although I really want to, I am not allowed to share our code base..
Generally we have large hadoop jobs running on top of Amazon's EMR that generate lots of NLP stats from the extracted text (all written in Clojure). We used to do all this operation in-house, (crawling, extracting, storing the data, etc..) so finding out about Common Crawl was a real blessing. However, we still didnt get a chance to make a full run on top of Common Crawl, just adopting our algorithms to CC layout, and making small test runs.
Thanks for a wonderful job,
Shlomi