Questions about common crawl

297 views
Skip to first unread message

Karthik Shyamsunder

unread,
Mar 13, 2018, 4:01:24 PM3/13/18
to Common Crawl

We want to use common crawls data.  We have a few questions:

 

  • We understand that the Common Crawl Foundation only crawls a subset of the Internet’s webpages on a monthly basis.  What is that subset and how is it determined? 
    • Are certain parts of the internet excluded and, if so, for what reasons? 
    • Does the subset change from month to month or is it static?  
  • Are the crawls performed only by Common Crawl Foundation or also by one or more third parties who do it for you? 
  • One of the restrictions in the terms of use applicable to the crawled data is that  is not permitted to use “the communication systems provided by the Site for any commercial solicitation purposes.”  What communications systems is this provision referring to? 
  • Can we extract data and use for commercial purposes?

Karthik Shyamsunder

unread,
Mar 15, 2018, 2:13:15 PM3/15/18
to Common Crawl
I see a lot of folks looking at the questions.  Can someone from Common Crawl answer these questions please.  I would really appreciate it.

Sebastian Nagel

unread,
Mar 15, 2018, 2:59:16 PM3/15/18
to common...@googlegroups.com
Hi,

> What is that subset and how is it determined?

Please have a look at the monthly crawl announcements on
http://commoncrawl.org/connect/blog/
and the crawl statistics
https://commoncrawl.github.io/cc-crawl-statistics/

> Are certain parts of the internet excluded and, if so, for what reasons?

Yes. Content disallowed by the robots.txt rules is not crawled. There are
also some sites which requested to be excluded but are not able to set up
the robots.txt on their host(s).

> Does the subset change from month to month or is it static?

It changes every month.

> * Are the crawls performed only by Common Crawl Foundation or also by one or more third parties
> who do it for you?

The crawl is performed by the foundation. But we accept seed donations
(verified, almost spam-free URL lists) and are open for cooperations
regarding tools, research and software. The crawler software is published
on https://github.com/commoncrawl/.

> * One of the restrictions in the terms of use applicable to the crawled data is that is not
> permitted to use “the communication systems provided by the Site for any commercial solicitation
> purposes.” What communications systems is this provision referring to?

The "Site" is defined in the terms of use [1] as "commoncrawl.com website".
I would for example count any comment functionality or this group (mailing list)
as one provided communication system. But in doubt you should ask a lawyer.

> * Can we extract data and use for commercial purposes?

Commercial use is not excluded. Of course, you should follow the terms of use [1,2] and
"don’t break" any law.

Best,
Sebastian

[1] http://commoncrawl.org/terms-of-use/full/
[2] http://commoncrawl.org/terms-of-use/


On 03/13/2018 09:01 PM, Karthik Shyamsunder wrote:
> We want to use common crawls data.  We have a few questions:
>
>  
>
> * We understand that the Common Crawl Foundation only crawls a subset of the Internet’s webpages
> on a monthly basis.  What is that subset and how is it determined? 
> o Are certain parts of the internet excluded and, if so, for what reasons? 
> o Does the subset change from month to month or is it static?  
> * Are the crawls performed only by Common Crawl Foundation or also by one or more third parties
> who do it for you? 
> * One of the restrictions in the terms of use applicable to the crawled data is that  is not
> permitted to use “the communication systems provided by the Site for any commercial solicitation
> purposes.”  What communications systems is this provision referring to? 
> * Can we extract data and use for commercial purposes?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Karthik Shyamsunder

unread,
Mar 15, 2018, 7:00:20 PM3/15/18
to Common Crawl
Sebastian,

Thank you so much for the answers.  I was hoping that you will respond since you have the most knowledge from crawl perspective.  

Awesome project! Awesome work!

Sincerely,

Karthik


On Tuesday, March 13, 2018 at 4:01:24 PM UTC-4, Karthik Shyamsunder wrote:

Sameer Thakar

unread,
May 3, 2018, 5:09:46 PM5/3/18
to Common Crawl
The crawl is performed by the foundation. But we accept seed donations (verified, almost spam-free URL lists) and are open for cooperations regarding tools, research and software. The crawler software is published on https://github.com/commoncrawl/


Sebastian

Thank you so much for your previous answers.I work with Karthik and have some follow-up questions.
For the seed donations of the URLs you accept

1. Is there a criteria to accept them ? I mean to make sure that they comply with your policies, have followed robots.txt, have honored exclusion requests etc.
If yes, would you be able to share it ?

2. When you say almost spam-free, do you mean you filter/parse the URLs to make sure that they are non-spam on a best effort basis ?
Could you please share any info around that ?


Best
Sameer
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
May 7, 2018, 10:49:19 AM5/7/18
to common...@googlegroups.com
Hi Sameer,

> 1. Is there a criteria to accept them ? I mean to make sure that they comply with your policies,
> have followed robots.txt, have honored exclusion requests etc.
> If yes, would you be able to share it ?

We have no formal criteria for seed donations. The criteria you mentioned are, of course,
important - it wouldn't make sense to add a large list of seeds which then only increases
the amount of "forbidden" (and consequently skipped) fetch trials. But that's not a strict
requirement as we always check the robots.txt (it may change at any time) and also maintain
our own exclusion rules. In addition,
- the donation should be a representative collection of URLs/links
(this includes that the amount of spam should be limited)
- we need to be confident that the donation contains only publicly visible links,
and no private or secret links.

> 2. When you say almost spam-free, do you mean you filter/parse the URLs to make sure that they are
> non-spam on a best effort basis ?
> Could you please share any info around that ?

We now have blacklisted about 350,000 domains which have been classified as hosting linkspam
by using the webgraph datasets (cf. [1]). The list also includes domain parking sites, yellow
pages and phone books with a large number of subdomains. But we have no ready-to-use tool
to classify spam, manual verification is still required.

Best,
Sebastian



[1] http://commoncrawl.org/2018/05/webgraphs-feb-mar-apr-2018/

Sameer Thakar

unread,
May 8, 2018, 4:37:23 PM5/8/18
to Common Crawl
Thank you so much for the answers Sebastian

Sameer Thakar

unread,
May 18, 2018, 5:50:37 PM5/18/18
to Common Crawl
Sebastian

Are there any plans to (if you haven't already) conduct an analysis around impact of GDPR on your data collection ?

Thanks
Sameer

Sameer Thakar

unread,
May 25, 2018, 6:22:07 AM5/25/18
to Common Crawl
Sebastian

Any luck with this ?

Thanks
Sameer

Sara Crouse

unread,
Jun 5, 2018, 11:56:11 AM6/5/18
to Common Crawl
Hi Sameer, 

Apologies for the delay in reply to your question. We do not have plans at this time to conduct a "formal" analysis re: impact of GDPR on Common Crawl. In the coming month I will gather opinions from a few advisors, review the new regulations in depth, and share any findings of interest on this forum/group. Also, feel free to follow up directly with me anytime; all thoughts/ideas/opinions on this complex topic are welcome!

Best,

Sara

Sameer Thakar

unread,
Jun 9, 2018, 9:02:22 PM6/9/18
to Common Crawl
Thanks.

Best
Sameer
Reply all
Reply to author
Forward
0 new messages