ertertsdgd sdfsdfsd fsdfsdfsd
unread,May 25, 2025, 8:49:13 AMMay 25Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Dear CommonCrawl Team,
I am writing to seek your expertise regarding the handling of Common Crawl data. I am researching data preprocessing strategies for large-scale web corpora and have a few specific questions about CommonCrawl's built-in deduplication mechanisms.
My questions are as follows:
1.Does CommonCrawl perform any deduplication on crawled pages before releasing its datasets?
If so, what methods are used?
2.If deduplication is applied, does it consider temporal dimensions?
For example, if the same URL is crawled across multiple snapshots (e.g., 2023-01 and 2023-06), or furthermore, if the same URL is crawled in different CommonCrawl snapshots (e.g., 2023-01 and 2023-06) and also the content is exactly identical, how are duplicates handled?
Is only a certain(e.g. the last version)/random version retained, or are all historical versions preserved?
I would greatly appreciate any insights or documentation you could share on this topic. Thank you for your time and for maintaining such a valuable public resource!