Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Inquiry Regarding Missing Websites in Common Crawl

77 views
Skip to first unread message

Manmohan Nayak

unread,
Mar 31, 2025, 4:25:02 PMMar 31
to common...@googlegroups.com

Dear Common Crawl Team,

We have identified a list of websites that appear to be missing or were not crawled as expected. We are trying to understand the scope and reasons for these omissions.

To further our investigation, we would like to ask:

  1. Is there any way to determine which websites Common Crawl does not crawl, and if so, since when? Specifically, we are interested in whether any historical data exists regarding websites excluded from crawls.

  2. Are there any alternative sources for obtaining data related to websites that are consistently excluded from Common Crawl? Does the common crawl team have any recommendations for other datasets or resources that might provide this information?

Some examples of missing URLs include:

Any insight or guidance the common crawl team can provide would be greatly appreciated.

Thank you for your time and assistance.

Sincerely,

Greg Lindahl

unread,
Apr 1, 2025, 12:32:55 AMApr 1
to common...@googlegroups.com
CCBot doesn't crawl websites that block it in robots.txt. You also don't directly see publishers who send us legal letters asking us to stop crawling.

Hope this helps!

-- greg


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAHsyto%3DdXx4aAL-KvNJJM03JPU0Bxw%3DU4ZpoEP1YOc7jUt5qGA%40mail.gmail.com.

Tom Morris

unread,
Apr 1, 2025, 4:30:16 PMApr 1
to common...@googlegroups.com
To expand on the "what was blocked when" piece of the question, since 2016 the robots.txt files for each crawl have been saved, so you can check to see how the blocking changed over time.


Tom

Manmohan Nayak

unread,
Apr 3, 2025, 6:49:51 PMApr 3
to common...@googlegroups.com
 Thank you for your response. I’ve downloaded some of the archived files, but they all appear to be landing pages. Is there a way to download the files that are linked from these landing pages? In web crawling terms, I’m referring to the ability to "follow the links" and download the content available through those links as well.

Regards
  



--
Best Regards

Tom Morris

unread,
Apr 3, 2025, 9:47:57 PMApr 3
to common...@googlegroups.com
On Thu, Apr 3, 2025 at 6:49 PM Manmohan Nayak <manmoha...@gmail.com> wrote:
 Thank you for your response. I’ve downloaded some of the archived files, but they all appear to be landing pages. Is there a way to download the files that are linked from these landing pages? 

The crawls include more than just home pages, but they're not guaranteed to be a complete crawl of any domain (or even all the links from a crawled page). 

The web is vast and only a small sample is crawled.

Tom
Reply all
Reply to author
Forward
0 new messages