Retrieving WARC Records from WET Corpus at Scale

89 views
Skip to first unread message

Emiliyan Pavlov

unread,
May 28, 2025, 6:26:34 AMMay 28
to Common Crawl

Hi everyone,

I'm working with a large-scale corpus of WET records and am looking for an efficient way to retrieve the corresponding WARC records at scale.

So far, I’ve explored two main approaches:

  1. Using AWS Athena to query the Common Crawl indexes (like cc-index-table) to map WET records to their WARC counterparts based on url, warc_filename, warc_offset, and warc_length.

  2. Mapping WET segment records to WARC records manually by parsing the WET file, identifying the target URLs, and looking them up in the Common Crawl Index to obtain the necessary byte offset and length for extraction from the original WARC files.

Both approaches seem viable on a small scale, but I’m unsure how well they would hold up operationally at very large scale (i.e., millions of records per dump). Ideally, I’d like a solution that is cost-effective.

Has anyone in the group tackled this type of WET-to-WARC mapping before at scale? Also open to tooling or open-source projects that could help.

Thanks in advance!


Sebastian Nagel

unread,
May 28, 2025, 7:04:43 AMMay 28
to common...@googlegroups.com
Hi Emiliyan,

thanks for the question!

Actually, approach 1 does scale very well. I've tried it with up to 10
millions of records in a single **JOIN** query.

1. create a list of your URLs and upload it to S3
Recommendation:
- one list per crawl
- restrict to crawl and `subset = 'warc'`
- if there is a second column which allows to restrict your
query efficiently, do a join on two or more columns
This could be the domain name or TLD.

2. do the join:
...
FROM ccindex.ccindex AS cc
INNER JOIN myjoin.urls AS my
ON my.url = cc.url
...

3. fetch and process the WARC records per WARC filename and offsets

To minimize network latency, I strongly recommend to do this
on AWS in the us-east-1 region - same for the Athena queries.


Because you already know the exact URL from the WET records,
using the "url" column gives you 100% recall. Otherwise,
you might use the column "url_surtkey" which allows for minor
variations in the URL (for example http:// instead of https://).

[1,2] give examples how to do this using Athena.

We are in the process of generalizing this procedure to select subsets
of the data by "annotations". You'll find this project on [3] (it's
still at an early stage).

Best,
Sebastian

[1]
https://github.com/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb
[2]
https://github.com/commoncrawl/robotstxt-experiments/blob/main/src/jupyter/data-preparation-top-k-sample.ipynb
[3] https://github.com/commoncrawl/cc-index-annotations

On 5/28/25 12:22, Emiliyan Pavlov wrote:
> Hi everyone,
>
> I'm working with a large-scale corpus of WET records and am looking for
> an efficient way to retrieve the corresponding WARC records at scale.
>
> So far, I’ve explored two main approaches:
>
> 1.
>
> *Using AWS Athena* to query the Common Crawl indexes (like cc-index-
> table) to map WET records to their WARC counterparts based on url,
> warc_filename, warc_offset, and warc_length.
>
> 2.
>
> *Mapping WET segment records to WARC records manually* by parsing

Emiliyan Pavlov

unread,
May 29, 2025, 10:18:17 AMMay 29
to Common Crawl

Hi Sebastian,

Thank you for the quick and very helpful reply, this worked like charm. I am also very pleasantly surprised by the speed and final cost of fetching the length and offsets for this scale.

I think also the subset selection generalization is a very nice idea, so good luck!

Best regards,
Emiliyan

Sebastian Nagel

unread,
May 29, 2025, 3:40:59 PMMay 29
to common...@googlegroups.com
Hi Emiliyan,

thanks for reporting back!

> I am also very pleasantly surprised by the speed and final cost of
> fetching the length and offsets for this scale.

Indeed, if you restrict the query by partition columns and select only
the columns you really need, Athena is great - both fast and with
negligible costs.

> I think also the subset selection generalization is a very nice idea,
> so good luck!

Thanks!

Best,
Sebastian
> table/bulk-url-lookups-by-table-joins.ipynb <https://github.com/
> commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-
> by-table-joins.ipynb>
> [2]
> https://github.com/commoncrawl/robotstxt-experiments/blob/main/src/
> jupyter/data-preparation-top-k-sample.ipynb <https://github.com/
> commoncrawl/robotstxt-experiments/blob/main/src/jupyter/data-
> preparation-top-k-sample.ipynb>
> [3] https://github.com/commoncrawl/cc-index-annotations <https://
> github.com/commoncrawl/cc-index-annotations>
Reply all
Reply to author
Forward
0 new messages