Hi,
> What is that subset and how is it determined?
Please have a look at the monthly crawl announcements on
http://commoncrawl.org/connect/blog/
and the crawl statistics
https://commoncrawl.github.io/cc-crawl-statistics/
> Are certain parts of the internet excluded and, if so, for what reasons?
Yes. Content disallowed by the robots.txt rules is not crawled. There are
also some sites which requested to be excluded but are not able to set up
the robots.txt on their host(s).
> Does the subset change from month to month or is it static?
It changes every month.
> * Are the crawls performed only by Common Crawl Foundation or also by one or more third parties
> who do it for you?
The crawl is performed by the foundation. But we accept seed donations
(verified, almost spam-free URL lists) and are open for cooperations
regarding tools, research and software. The crawler software is published
on
https://github.com/commoncrawl/.
> * One of the restrictions in the terms of use applicable to the crawled data is that is not
> permitted to use “the communication systems provided by the Site for any commercial solicitation
> purposes.” What communications systems is this provision referring to?
The "Site" is defined in the terms of use [1] as "
commoncrawl.com website".
I would for example count any comment functionality or this group (mailing list)
as one provided communication system. But in doubt you should ask a lawyer.
> * Can we extract data and use for commercial purposes?
Commercial use is not excluded. Of course, you should follow the terms of use [1,2] and
"don’t break" any law.
Best,
Sebastian
[1]
http://commoncrawl.org/terms-of-use/full/
[2]
http://commoncrawl.org/terms-of-use/
On 03/13/2018 09:01 PM, Karthik Shyamsunder wrote:
> We want to use common crawls data. We have a few questions:
>
>
>
> * We understand that the Common Crawl Foundation only crawls a subset of the Internet’s webpages
> on a monthly basis. What is that subset and how is it determined?
> o Are certain parts of the internet excluded and, if so, for what reasons?
> o Does the subset change from month to month or is it static?
> * Are the crawls performed only by Common Crawl Foundation or also by one or more third parties
> who do it for you?
> * One of the restrictions in the terms of use applicable to the crawled data is that is not
> permitted to use “the communication systems provided by the Site for any commercial solicitation
> purposes.” What communications systems is this provision referring to?
> * Can we extract data and use for commercial purposes?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.