Dear Common Crawl Team,
We have identified a list of websites that appear to be missing or were not crawled as expected. We are trying to understand the scope and reasons for these omissions.
To further our investigation, we would like to ask:
Is there any way to determine which websites Common Crawl does not crawl, and if so, since when? Specifically, we are interested in whether any historical data exists regarding websites excluded from crawls.
Are there any alternative sources for obtaining data related to websites that are consistently excluded from Common Crawl? Does the common crawl team have any recommendations for other datasets or resources that might provide this information?
Some examples of missing URLs include:
Any insight or guidance the common crawl team can provide would be greatly appreciated.
Thank you for your time and assistance.
Sincerely,
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAHsyto%3DdXx4aAL-KvNJJM03JPU0Bxw%3DU4ZpoEP1YOc7jUt5qGA%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CABQM%2BAyzgr2DADtSua-TJ1kXOjL6C9K9-A07j2oPZE_6-NyZ%2Bw%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CABQM%2BAyzgr2DADtSua-TJ1kXOjL6C9K9-A07j2oPZE_6-NyZ%2Bw%40mail.gmail.com.
Thank you for your response. I’ve downloaded some of the archived files, but they all appear to be landing pages. Is there a way to download the files that are linked from these landing pages?