Help getting latest version of repeated documents

87 views

Skip to first unread message

Bárbara Castro

unread,

Dec 18, 2024, 11:46:30 PM12/18/24

to Common Crawl

Hello everybody!

I am using the cdx_toolkit tool to download all warc documents from a specific website. I have a Python script similar to the one shown as an example in the Github repository. I need to download all the documents from this specific website, from all the crawls, but I guess there are documents that are repeated from crawl to crawl. How can I get only the latest version of the repeated documents?

Thanks in advance!

Sebastian Nagel

unread,

Dec 20, 2024, 5:36:44 AM12/20/24

to common...@googlegroups.com

Hi Bárbara,

generally, the URL indexes include the capture time of a web page
- CDX index: field "timestamp"
- columnar index [1]: column "fetch_time"

So, it's possible to pick the newest capture for all captures
of the same URL.

I'm not aware that cdx_toolkit provides this functionality out of the
box. You'd need to implement it yourself. Sorry about that - :(

In case, the columnar index is an option for you, selecting the newest
capture could be done using a SQL Window function [2]. We provide one
SQL example where this done [3]. The idea is to enumerate rows of the
same URL by fetch time in descending order, then select the first
enumerated record (last line in the SQL query: `WHERE allrobots.n = 1`
Please, let us know if you need an example more tailored to your
specific use case. The robots.txt example might be too complex for
the beginning.

Best,
Sebastian

[1]
https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
[2] https://en.wikipedia.org/wiki/Window_function_(SQL)
[3]
https://github.com/commoncrawl/cc-index-table/blob/08d441c3716b62270c959a54ba5514d2e20ae2d1/src/sql/examples/cc-index/get-records-robotstxt.sql#L35

On 12/19/24 05:46, Bárbara Castro wrote:
> Hello everybody!
>

> I am using the cdx_toolkit <https://github.com/cocrawler/

Reply all

Reply to author

Forward

0 new messages