You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello everybody!
I am using the cdx_toolkit tool to download all warc documents from a specific website. I have a Python script similar to the one shown as an example in the Github repository. I need to download all the documents from this specific website, from all the crawls, but I guess there are documents that are repeated from crawl to crawl. How can I get only the latest version of the repeated documents?
Thanks in advance!
Sebastian Nagel
unread,
Dec 20, 2024, 5:36:44 AM12/20/24
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Bárbara,
generally, the URL indexes include the capture time of a web page
- CDX index: field "timestamp"
- columnar index [1]: column "fetch_time"
So, it's possible to pick the newest capture for all captures
of the same URL.
I'm not aware that cdx_toolkit provides this functionality out of the
box. You'd need to implement it yourself. Sorry about that - :(
In case, the columnar index is an option for you, selecting the newest
capture could be done using a SQL Window function [2]. We provide one
SQL example where this done [3]. The idea is to enumerate rows of the
same URL by fetch time in descending order, then select the first
enumerated record (last line in the SQL query: `WHERE allrobots.n = 1`
Please, let us know if you need an example more tailored to your
specific use case. The robots.txt example might be too complex for
the beginning.