This was originally posted to Stack Overflow, removing from there since this is a more appropriate forum. I'm removing a bunch of context asking this simple question:
Say I want all html files for apple.com, what is the most efficient way for me to get them (on a personal computer, no the cloud)? I'm assuming I should query the index for apple.com, find the corresponding files, then use the right record numbers? But index points to WARC files, not WET files. WET files are much easier to work with.
--Original SO message---
Common crawl provides provides warc files, which contain the most amount of data and wet files, which are MUCH smaller amount of data (and relevant for my purpose).
I downloaded the parquet index, which I can query using sql. Say I'm looking for all of apple.com, I can just query the right rows and the index will give me the warc files I need to parse (saving me, literally terabytes of data).
Given the warc file, I can find the corresponding wet file, just by doing some text replaces. However, even these wet/warc files contain tens of thousands of urls.
The index provides offsets and record lengths, but they are for warc files, not wet files. Is there a correspondence? Is there I can jump to jus the url I'm interested in, rather than having to go through the whole wet file?