Subject: Inquiry Regarding Web Crawling Practices and Access Permissions

47 views
Skip to first unread message

Manmohan Nayak

unread,
May 15, 2025, 5:31:42 PMMay 15
to common...@googlegroups.com

Dear Members,

I hope this message finds you well.

I am currently conducting research on large-scale web crawling practices and had a couple of questions I was hoping you could provide some insights on:

  1. How does Common Crawl handle content from paywalled or subscription-based websites?

  2. Are there any organizations or companies that provide blanket approvals or permissions to access websites specifically for large-scale web crawling or data collection purposes?

Any guidance, references, or resources you could share on this topic would be greatly appreciated.

Thank you for your time and assistance.

Best regards,
Manmohan Nayak


Greg Lindahl

unread,
May 15, 2025, 5:50:47 PMMay 15
to common...@googlegroups.com
Manmohan,

The most important thing to know is that our crawler (CCBot) always obeys robots.txt. A paywalled or subscription-only website ought to have one.

Also, our crawler doesn't do anything to try to log in to websites -- we never send cookies, for example. Some websites have paywalls that aren't paywalls if you don't execute javascript. If those websites have not blocked us in robots.txt, our crawler will have no idea that there is a paywall.

As for your second question, a good example is the European Union's TDM exception.

-- greg



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAHsytonypYV5E5MvX4uE9tXzNNDq-ZxBLTYM2KpgA0R7vUWzaQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages