October 2016 crawl archive now available

42 views
Skip to first unread message

Sebastian Nagel

unread,
Nov 7, 2016, 4:31:00 PM11/7/16
to common...@googlegroups.com
Hi all,

​The October 2016 crawl archive is now available. It contains 3.25 billion web pages. Details ​on​
how to access and use the data can be found on our blog [1].

Similar to the September crawl we used sitemaps [2] of popular hosts/domains (popularity ranks from
Common Search [3] and Alexa [4]) to improve coverage and freshness of the crawl.

We are grateful to webxtrakt [5] for donating a list of 14 million verified, DNS-resolvable domain
names of European country-code TLDs (eu, .fr, .be, .de, .ch, .nl, .pl). We included these domains
into the October crawl and we hope for an ongoing partnership with webxtract.

Please, note that we plan to combine the upcoming November and December crawls. We hope to further
increase the size of the crawl and want to add more improvements. We expect to release it mid of
December.

Best,
Sebastian


[1] http://commoncrawl.org/2016/11/october-2016-crawl-archive-now-available/
[2] http://www.sitemaps.org/
[3] https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/
[4] https://support.alexa.com/hc/en-us/articles/200461990-Can-I-get-a-list-of-top-sites-from-an-API-
[5] http://webxtrakt.com/

Robert Meusel

unread,
Nov 9, 2016, 2:30:49 AM11/9/16
to Common Crawl
Great! Thank you for this comprehensive crawl as well as the description how you gathered those. Is the list of domains from webxtrakt available?

Best,
Robert

Sebastian Nagel

unread,
Nov 9, 2016, 4:19:32 AM11/9/16
to common...@googlegroups.com
Hi Robert,

I'm sorry, this list is not available.

But since a couple of months we provide counts and statistics about crawled domains (also hosts,
TLDs, etc.) in
s3://commoncrawl/crawl-analysis/

The breadth of the crawls in terms of domains covered has significantly increased over the last
time, mostly by adding seeds from moz.com, Common Search and webxtrakt:
https://github.com/commoncrawl/cc-crawl-statistics/blob/master/plots/crawlsize_domain.png

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages