We have a bunch of
I have a bunch of hand categorized domains (over 4k)
For each domain, I need to visit each page from the common crawl archive, and grab the metadata information
I already have the project spec'd out from a developer who has worked with common crawl. This will help me get training data for my app, and if you need language classified by interest / news category, this will be useful to you as well.