Cant get proper indexes from common crawl

103 views

Skip to first unread message

Tal Golan

unread,

Dec 5, 2016, 1:12:58 PM12/5/16

to Common Crawl, Nir Keren

Hey,

Im new to Common Crawl, and Im trying to get information about companies websites using it.

I started to inspect discoverorg.com website and I came across some issues.

When i'm calling the following urls Im getting a rich information about this site (about 900 indexed pages).

http://index.commoncrawl.org/CC-MAIN-2016-44-index?url=*.discoverorg.com%2F*&output=json

http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=*.discoverorg.com%2F*&output=json

But from the index 2016-36 down to 2013-20 i'm getting suspiciously partial results (around 18 indexed pages).

examples:

http://index.commoncrawl.org/CC-MAIN-2016-36-index?url=*.discoverorg.com%2F*&output=json

http://index.commoncrawl.org/CC-MAIN-2016-30-index?url=*.discoverorg.com%2F*&output=json

http://index.commoncrawl.org/CC-MAIN-2013-20-index?url=*.discoverorg.com%2F*&output=json

Am I missing something?

Thanks.

Sebastian Nagel

unread,

Dec 7, 2016, 7:26:14 AM12/7/16

to common...@googlegroups.com

Hi,

> Am I missing something?

No, the numbers are likely to be correct.

Common Crawl is not able to crawl the web exhaustively in every monthly crawl.
Only a sample snapshot is crawled - a subset of sites/hosts/domains and also only
a subset of the pages per site.

In the past the crawler relied on donations of clean seed list (mostly free of duplicates
and spam). This did not work well all the time and because of missing or too small donations
we haven't been able to keep the crawls fresh. Starting with the September crawl this has
improved and that's the reason why the number of pages for discoverorg.com has increased.
But sampling may cause that the number of pages from a particular domain may go up and down
over time.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages