Geographical location of the crawler

93 views
Skip to first unread message

david r

unread,
Nov 5, 2021, 12:04:01 PM11/5/21
to Common Crawl
Hi everyone,


First of all, many thanks for creating the Common Crawl dataset.


From which geographical region does the crawling take place? This can be relevant as sites may change web page content based on the geographical region.


The optional warcinfo ip field specified in the warc specification seems to be not present in the Common Crawl .warc files. The FAQ on commoncrawl.org specifies the old IP range used by a previous version of the crawler and states the current version of the crawler crawls from Amazon AWS, but no geographical location.


Given that the common crawl data are stored on the us-east-1 AWS location, is it correct to assume that the crawler crawls from that location in the US? Or can the crawling happen from any AWS data center around the world?

Thanks,

David

Sebastian Nagel

unread,
Nov 7, 2021, 8:59:57 AM11/7/21
to common...@googlegroups.com
Hi David,

yes, the crawler is run from Northern Virginia (AWS region us-east-1).
And yes, it's true, the crawler may see different content when run
from a different location or via a proxy.

Because we use short-lived spot instances and there's a shuffle step
between the fetching and the WARC writing, it does not really make sense
to add the IP address to the warcinfo record. The IP adress of the WARC
writer task is not that of the fetcher tasks.

Best,
Sebastian

david r

unread,
Nov 8, 2021, 9:41:36 AM11/8/21
to Common Crawl
Hi Sebastian,

Thank you for the answer! I now understand it makes little sense to add an IP address to the warcinfo record. Given the fact that the crawler is run from a single AWS region (without proxy), the geographical location of the fetching is known, which is, at least for my question, specific enough. Thank you for confirming!

Best,
David
Reply all
Reply to author
Forward
0 new messages