Hi Emiliyan,
thanks for the question!
Actually, approach 1 does scale very well. I've tried it with up to 10
millions of records in a single **JOIN** query.
1. create a list of your URLs and upload it to S3
Recommendation:
- one list per crawl
- restrict to crawl and `subset = 'warc'`
- if there is a second column which allows to restrict your
query efficiently, do a join on two or more columns
This could be the domain name or TLD.
2. do the join:
...
FROM ccindex.ccindex AS cc
INNER JOIN myjoin.urls AS my
ON my.url = cc.url
...
3. fetch and process the WARC records per WARC filename and offsets
To minimize network latency, I strongly recommend to do this
on AWS in the us-east-1 region - same for the Athena queries.
Because you already know the exact URL from the WET records,
using the "url" column gives you 100% recall. Otherwise,
you might use the column "url_surtkey" which allows for minor
variations in the URL (for example http:// instead of https://).
[1,2] give examples how to do this using Athena.
We are in the process of generalizing this procedure to select subsets
of the data by "annotations". You'll find this project on [3] (it's
still at an early stage).
Best,
Sebastian
[1]
https://github.com/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb
[2]
https://github.com/commoncrawl/robotstxt-experiments/blob/main/src/jupyter/data-preparation-top-k-sample.ipynb
[3]
https://github.com/commoncrawl/cc-index-annotations
On 5/28/25 12:22, Emiliyan Pavlov wrote:
> Hi everyone,
>
> I'm working with a large-scale corpus of WET records and am looking for
> an efficient way to retrieve the corresponding WARC records at scale.
>
> So far, I’ve explored two main approaches:
>
> 1.
>
> *Using AWS Athena* to query the Common Crawl indexes (like cc-index-
> table) to map WET records to their WARC counterparts based on url,
> warc_filename, warc_offset, and warc_length.
>
> 2.
>
> *Mapping WET segment records to WARC records manually* by parsing