Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Help getting latest version of repeated documents

83 views
Skip to first unread message

Bárbara Castro

unread,
Dec 18, 2024, 11:46:30 PM12/18/24
to Common Crawl
Hello everybody!

I am using the cdx_toolkit tool to download all warc documents from a specific website. I have a Python script similar to the one shown as an example in the Github repository. I need to download all the documents from this specific website, from all the crawls, but I guess there are documents that are repeated from crawl to crawl. How can I get only the latest version of the repeated documents?

Thanks in advance!


Sebastian Nagel

unread,
Dec 20, 2024, 5:36:44 AM12/20/24
to common...@googlegroups.com
Hi Bárbara,

generally, the URL indexes include the capture time of a web page
- CDX index: field "timestamp"
- columnar index [1]: column "fetch_time"

So, it's possible to pick the newest capture for all captures
of the same URL.

I'm not aware that cdx_toolkit provides this functionality out of the
box. You'd need to implement it yourself. Sorry about that - :(

In case, the columnar index is an option for you, selecting the newest
capture could be done using a SQL Window function [2]. We provide one
SQL example where this done [3]. The idea is to enumerate rows of the
same URL by fetch time in descending order, then select the first
enumerated record (last line in the SQL query: `WHERE allrobots.n = 1`
Please, let us know if you need an example more tailored to your
specific use case. The robots.txt example might be too complex for
the beginning.

Best,
Sebastian


[1]
https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
[2] https://en.wikipedia.org/wiki/Window_function_(SQL)
[3]
https://github.com/commoncrawl/cc-index-table/blob/08d441c3716b62270c959a54ba5514d2e20ae2d1/src/sql/examples/cc-index/get-records-robotstxt.sql#L35

On 12/19/24 05:46, Bárbara Castro wrote:
> Hello everybody!
>
> I am using the cdx_toolkit <https://github.com/cocrawler/
Reply all
Reply to author
Forward
0 new messages