Cant get proper indexes from common crawl

103 views
Skip to first unread message

Tal Golan

unread,
Dec 5, 2016, 1:12:58 PM12/5/16
to Common Crawl, Nir Keren
Hey,

Im new to Common Crawl, and Im trying to get information about companies websites using it.

I started to inspect discoverorg.com website and I came across some issues.

When i'm calling the following urls Im getting a rich information about this site (about 900  indexed pages).


But from the index 2016-36 down to 2013-20 i'm getting suspiciously partial results (around 18 indexed pages).
examples:

.
.
.
.
.


Am I missing something?

Thanks.

Sebastian Nagel

unread,
Dec 7, 2016, 7:26:14 AM12/7/16
to common...@googlegroups.com
Hi,

> Am I missing something?

No, the numbers are likely to be correct.

Common Crawl is not able to crawl the web exhaustively in every monthly crawl.
Only a sample snapshot is crawled - a subset of sites/hosts/domains and also only
a subset of the pages per site.

In the past the crawler relied on donations of clean seed list (mostly free of duplicates
and spam). This did not work well all the time and because of missing or too small donations
we haven't been able to keep the crawls fresh. Starting with the September crawl this has
improved and that's the reason why the number of pages for discoverorg.com has increased.
But sampling may cause that the number of pages from a particular domain may go up and down
over time.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages