Inquiry Regarding Missing Websites in Common Crawl

Manmohan Nayak

unread,

Mar 31, 2025, 4:25:02 PM3/31/25

to common...@googlegroups.com

Dear Common Crawl Team,

We have identified a list of websites that appear to be missing or were not crawled as expected. We are trying to understand the scope and reasons for these omissions.

To further our investigation, we would like to ask:

Is there any way to determine which websites Common Crawl does not crawl, and if so, since when? Specifically, we are interested in whether any historical data exists regarding websites excluded from crawls.
Are there any alternative sources for obtaining data related to websites that are consistently excluded from Common Crawl? Does the common crawl team have any recommendations for other datasets or resources that might provide this information?

Some examples of missing URLs include:

Any insight or guidance the common crawl team can provide would be greatly appreciated.

Thank you for your time and assistance.

Sincerely,

Greg Lindahl

unread,

Apr 1, 2025, 12:32:55 AM4/1/25

to common...@googlegroups.com

CCBot doesn't crawl websites that block it in robots.txt. You also don't directly see publishers who send us legal letters asking us to stop crawling.

Hope this helps!

-- greg

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAHsyto%3DdXx4aAL-KvNJJM03JPU0Bxw%3DU4ZpoEP1YOc7jUt5qGA%40mail.gmail.com.

Tom Morris

unread,

Apr 1, 2025, 4:30:16 PM4/1/25

to common...@googlegroups.com

To expand on the "what was blocked when" piece of the question, since 2016 the robots.txt files for each crawl have been saved, so you can check to see how the blocking changed over time.

https://commoncrawl.org/blog/robotstxt-and-404-redirect-data-sets

Tom

To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CABQM%2BAyzgr2DADtSua-TJ1kXOjL6C9K9-A07j2oPZE_6-NyZ%2Bw%40mail.gmail.com.

Manmohan Nayak

unread,

Apr 3, 2025, 6:49:51 PM4/3/25

to common...@googlegroups.com

Thank you for your response. I’ve downloaded some of the archived files, but they all appear to be landing pages. Is there a way to download the files that are linked from these landing pages? In web crawling terms, I’m referring to the ability to "follow the links" and download the content available through those links as well.

Regards

To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CABQM%2BAyzgr2DADtSua-TJ1kXOjL6C9K9-A07j2oPZE_6-NyZ%2Bw%40mail.gmail.com.

--

Best Regards

Tom Morris

unread,

Apr 3, 2025, 9:47:57 PM4/3/25

to common...@googlegroups.com

On Thu, Apr 3, 2025 at 6:49 PM Manmohan Nayak <manmoha...@gmail.com> wrote:

Thank you for your response. I’ve downloaded some of the archived files, but they all appear to be landing pages. Is there a way to download the files that are linked from these landing pages?

The crawls include more than just home pages, but they're not guaranteed to be a complete crawl of any domain (or even all the links from a crawled page).