is every snapshot takes only updates

38 views
Skip to first unread message

hany elshafey

unread,
Aug 3, 2022, 5:58:46 PM8/3/22
to Common Crawl
Hello cc
First of all, I want to thank everyone who is involved in building useful open source projects like Common Crawl.
thanks, CC team.

from my point of view, the common crawl makes snapshots scraping for all internet sites is it right?
if I want to crawl any site
Is every snapshot takes only updates or does it take the whole site again?
Is retrieved data are sampled data or whole site data ?


Hany elshafey
spatial data scientist

Sebastian Nagel

unread,
Aug 4, 2022, 1:26:41 AM8/4/22
to common...@googlegroups.com
Hi Hany,

> for all internet sites

Every snapshot includes many web sites but definitely not all of them.
Recent snapshot crawls include around 45 million sites (unique host
names) or 35 million registered domains.

> Is every snapshot takes only updates or does it take the whole site
> again? Is retrieved data are sampled data or whole site data ?

Web pages (or URLs) are sampled. Newly discovered URLs/links have a
higher probability to be selected during sampling. But pages are
revisited after some time.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages