I'm writing a new crawler based on Python's newish async/await
syntax. It's intended to be modular enough that it will do a good job
on a variety of crawl tasks, from "crawl this one site" to "crawl the
frontpage of 10 million sites" to (eventually) bulk crawling billions
of pages.
Along the way, I thought it would be fun to produce some kind of
dataset on a monthly basis. Common Crawl already has a news crawl,
so... how about a monthly crawl of the frontpages of the top million
Alexa sites? You can see the kind of information I could collect at
http://buildwith.com/ -- a lot of these stuff is relatively easy to
collect, like "meta generator" or the presence of Open Graph or
Twitter cards markup.
I don't have warc output yet, but eventually I'd add the frontpage
html, robots.txt, and response headers to the dataset.
One thing that I know I'm not very good at is detecting "one page"
websites - does anyone have tips? Or other suggestions.
The code is at:
https://github.com/cocrawler/cocrawler
-- greg