Questions on Deduplication in CommonCrawl's Data Processing Pipeline

48 views
Skip to first unread message

ertertsdgd sdfsdfsd fsdfsdfsd

unread,
May 25, 2025, 8:49:13 AMMay 25
to Common Crawl
Dear CommonCrawl Team,

I am writing to seek your expertise regarding the handling of Common Crawl data. I am researching data preprocessing strategies for large-scale web corpora and have a few specific questions about CommonCrawl's built-in deduplication mechanisms.

My questions are as follows:

1.Does CommonCrawl perform any deduplication on crawled pages before releasing its datasets?
If so, what methods are used?

2.If deduplication is applied, does it consider temporal dimensions?
For example, if the same URL is crawled across multiple snapshots (e.g., 2023-01 and 2023-06), or furthermore, if the same URL is crawled in different CommonCrawl snapshots (e.g., 2023-01 and 2023-06) and also the content is exactly identical, how are duplicates handled?
Is only a certain(e.g. the last version)/random version retained, or are all historical versions preserved?

I would greatly appreciate any insights or documentation you could share on this topic. Thank you for your time and for maintaining such a valuable public resource!

Greg Lindahl

unread,
May 25, 2025, 9:22:43 AMMay 25
to common...@googlegroups.com
Love the keyboard-smash username!

We don't do any deduplication. If we did, the standard for WARCs is that we would create revisit records, which includes the time that the page was revisited and the content was the same as a previous fetch.

This is a subtle but important point -- in the web archiving world, the name of a webpage is its url and the time it was fetched.



Greg Lindahl

unread,
May 27, 2025, 12:25:50 AMMay 27
to common...@googlegroups.com
... and to correct an error I made last message:

Our Nutch crawler does send if-modified headers and will spit out a WARC revisit record if the remote server replies 304.

-- greg

Reply all
Reply to author
Forward
0 new messages