How to efficiently parse WET files, when I have WARC offsets?

Shahbaz Chaudhary

unread,

Feb 20, 2023, 11:46:41 PM2/20/23

to Common Crawl

This was originally posted to Stack Overflow, removing from there since this is a more appropriate forum. I'm removing a bunch of context asking this simple question:

Say I want all html files for apple.com, what is the most efficient way for me to get them (on a personal computer, no the cloud)? I'm assuming I should query the index for apple.com, find the corresponding files, then use the right record numbers? But index points to WARC files, not WET files. WET files are much easier to work with.

--Original SO message---

Common crawl provides provides warc files, which contain the most amount of data and wet files, which are MUCH smaller amount of data (and relevant for my purpose).

I downloaded the parquet index, which I can query using sql. Say I'm looking for all of apple.com, I can just query the right rows and the index will give me the warc files I need to parse (saving me, literally terabytes of data).

Given the warc file, I can find the corresponding wet file, just by doing some text replaces. However, even these wet/warc files contain tens of thousands of urls.

The index provides offsets and record lengths, but they are for warc files, not wet files. Is there a correspondence? Is there I can jump to jus the url I'm interested in, rather than having to go through the whole wet file?

Sebastian Nagel

unread,

Feb 23, 2023, 11:15:41 AM2/23/23

to common...@googlegroups.com

Hi,

unfortunately, the offsets of WET files are not in the index, so you'd
need to process the WARC records and parse the HTML to extract the plain
text.

Best,
Sebastian

Shahbaz Chaudhary

unread,

Feb 25, 2023, 6:41:24 PM2/25/23

to Common Crawl

Thanks! It actually wasn't clear to me that I don't have to download the WARC OR the WET files. I didn't realize that once I know the warc files which contain my urls of interest, I can use the offsets to download just the parts of the file I need. Immensely helpful.

I'm still having issues with getting the warc file header, instead of just the payload (html or pdf), even though I'm doing "START to START + OFFSET - 1", but I suppose I'll paste that under a different question.

Reply all

Reply to author

Forward