Hi all,
The October 2016 crawl archive is now available. It contains 3.25 billion web pages. Details on
how to access and use the data can be found on our blog [1].
Similar to the September crawl we used sitemaps [2] of popular hosts/domains (popularity ranks from
Common Search [3] and Alexa [4]) to improve coverage and freshness of the crawl.
We are grateful to webxtrakt [5] for donating a list of 14 million verified, DNS-resolvable domain
names of European country-code TLDs (eu, .fr, .be, .de, .ch, .nl, .pl). We included these domains
into the October crawl and we hope for an ongoing partnership with webxtract.
Please, note that we plan to combine the upcoming November and December crawls. We hope to further
increase the size of the crawl and want to add more improvements. We expect to release it mid of
December.
Best,
Sebastian
[1]
http://commoncrawl.org/2016/11/october-2016-crawl-archive-now-available/
[2]
http://www.sitemaps.org/
[3]
https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/
[4]
https://support.alexa.com/hc/en-us/articles/200461990-Can-I-get-a-list-of-top-sites-from-an-API-
[5]
http://webxtrakt.com/