Question about the process of crawling

144 views
Skip to first unread message

Soyeon Kim

unread,
May 8, 2023, 12:44:17 AM5/8/23
to Common Crawl
Hi, 

I have a question when CC bot crawls the website.
I see it crawls periodically(link) and there are several major domain where CC bot crawls from(link).

A question is 
1. does the bot does not crawl the exact same url which has done before? (if so, the bot checks whether it has done before through the internal CC database that stores some timestamp data?)
2. let say, even if the bot is guaranteed not to crawl the exact same url,  but still is there a possibility to have the exact same documents when the web author published same documents(but it will be a different url). Which means, cc itself does not do any deduplication process?

thank you in advance!


Sebastian Nagel

unread,
May 8, 2023, 9:02:34 AM5/8/23
to common...@googlegroups.com
Hi,

> 1. does the bot does not crawl the exact same url which has done
> before? (if so, the bot checks whether it has done before through the
> internal CC database that stores some timestamp data?)

URLs may be revisited. The revisit schedule depends on the score
associated with that URL but also on the response status, the detected
MIME type, whether the page content has changed in previous fetches and
a random factor.

> 2. let say, even if the bot is guaranteed not to crawl the exact same
> url

This is not guaranteed. Even not desired. For example, the homepage of
a site, might get fetched anew from time to time. In recent crawls,
about 60% of the page captures are revisits.

> Which means, cc itself does not do any deduplication process?

Not directly. A page capture is not discarded because the content a
duplicate of another capture. However, if duplicates are detected the
likelihood that one of the duplicates is fetched again becomes lower.
By now, only exact duplicates are detected.

Best,
Sebastian


On 5/8/23 06:44, Soyeon Kim wrote:
> Hi,
>
> I have a question when CC bot crawls the website.
> I see it crawls periodically(link
> <https://skeptric.com/common-crawl-time-ranges/>) and there are several
> major domain where CC bot crawls from(link
> <https://commoncrawl.github.io/cc-crawl-statistics/plots/domains>).

Soyeon Kim

unread,
May 8, 2023, 9:49:53 AM5/8/23
to Common Crawl
Thank you for the detailed explanation first of all! :)!!

To summarize..
> 2. let say, even if the bot is guaranteed not to crawl the exact same
> url

This is not guaranteed. Even not desired. For example, the homepage of
a site, might get fetched anew from time to time. In recent crawls,
about 60% of the page captures are revisits.
-> Even visiting the exact same url is not desired, the same page( which means having the exact or the nearly similar contents) might be crawled.
and it happens quite often considering 60 % mentioned. 

> Which means, cc itself does not do any deduplication process?

Not directly. A page capture is not discarded because the content a
duplicate of another capture. However, if duplicates are detected the
likelihood that one of the duplicates is fetched again becomes lower.
By now, only exact duplicates are detected.

-> I actually don't get the answer "if duplicates are detected the likelihood that one of the duplicates is fetched again becomes lower." 
also you mentioned captured page is not discarded in the first sentence, but for the last line, 'only exact dupliactes are detected' is stated. which tells me that anyway if it is the exact document it will be discarded.! please could you clarify it?

Give you so much thanks sustaining all the issues & qa

Cheers
soyeon

Sebastian Nagel

unread,
May 8, 2023, 10:36:06 AM5/8/23
to common...@googlegroups.com
Hi,

> -> I actually don't get the answer "if duplicates are detected the
> likelihood that one of the duplicates is fetched again becomes lower."

There are two heuristics to achieve this:

(a) if the crawler detects that a revisit of a patch resulted in a
duplicate (same content or HTTP 304 not modified), the revisit interval
of that URL is increased.

(b) if there are two or more URLs leading to the same content, only one
of them is kept as "non-duplicate". The others are marked as "duplicate"
with a significantly increased revisit schedule.

> which tells me that anyway if it is the exact document it will be
> discarded.

No capture is discarded. It it is made it is written to the web
archives, even if it is a duplicate of another capture. Maybe the
strategy is better described as "do not make the same decision again
which led to a duplicate".

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages