Hi everyone,
I'm working with a large-scale corpus of WET records and am looking for an efficient way to retrieve the corresponding WARC records at scale.
So far, I’ve explored two main approaches:
Using AWS Athena to query the Common Crawl indexes (like cc-index-table) to map WET records to their WARC counterparts based on url, warc_filename, warc_offset, and warc_length.
Mapping WET segment records to WARC records manually by parsing the WET file, identifying the target URLs, and looking them up in the Common Crawl Index to obtain the necessary byte offset and length for extraction from the original WARC files.
Both approaches seem viable on a small scale, but I’m unsure how well they would hold up operationally at very large scale (i.e., millions of records per dump). Ideally, I’d like a solution that is cost-effective.
Has anyone in the group tackled this type of WET-to-WARC mapping before at scale? Also open to tooling or open-source projects that could help.
Thanks in advance!