Hi,
> 1. does the bot does not crawl the exact same url which has done
> before? (if so, the bot checks whether it has done before through the
> internal CC database that stores some timestamp data?)
URLs may be revisited. The revisit schedule depends on the score
associated with that URL but also on the response status, the detected
MIME type, whether the page content has changed in previous fetches and
a random factor.
> 2. let say, even if the bot is guaranteed not to crawl the exact same
> url
This is not guaranteed. Even not desired. For example, the homepage of
a site, might get fetched anew from time to time. In recent crawls,
about 60% of the page captures are revisits.
> Which means, cc itself does not do any deduplication process?
Not directly. A page capture is not discarded because the content a
duplicate of another capture. However, if duplicates are detected the
likelihood that one of the duplicates is fetched again becomes lower.
By now, only exact duplicates are detected.
Best,
Sebastian
On 5/8/23 06:44, Soyeon Kim wrote:
> Hi,
>
> I have a question when CC bot crawls the website.
> I see it crawls periodically(link
> <
https://skeptric.com/common-crawl-time-ranges/>) and there are several
> major domain where CC bot crawls from(link
> <
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains>).