downlading WET files for a list of URLs

169 views
Skip to first unread message

Wesam Al-Nabki

unread,
May 29, 2023, 4:07:38 PM5/29/23
to Common Crawl
Hi all,

I'm trying to get the WET files for a list of URLs. 

My current approach is working but it is very slow:
1- Call the CommonCrawel index for each URL, i.e. send a GET request to the index e.g.: "https://index.commoncrawl.org/CC-MAIN-2019-39-index?url=https://www.patizonet.com/"

The response is something like: 

com,patizonet)/ 20190919123400 {"url": "https://patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "302", "digest": "2V5U3BEPSDZBGAORUF7CNVHRSXNYH3QS", "length": "1182", "offset": "18290096", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514573519.72/crawldiagnostics/CC-MAIN-20190919122032-20190919144032-00410.warc.gz"}
com,patizonet)/ 20190919123401 {"url": "https://www.patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "47ZXP2V2EQBL3P54IMLHZVGFAZOGJYC2", "length": "7841", "offset": "977366041", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514573519.72/warc/CC-MAIN-20190919122032-20190919144032-00513.warc.gz", "languages": "eng", "encoding": "UTF-8"}
com,patizonet)/ 20190920201906 {"url": "https://www.patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "LFKVWZYQAFAY3WWPJTN2SW5V52L3IFYK", "length": "8013", "offset": "968622132", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514574077.39/warc/CC-MAIN-20190920200607-20190920222607-00513.warc.gz", "languages": "eng", "encoding": "UTF-8"}



2- Then, I parse the response to get the warc file and replace it with WET to get the WET file.

3- download the WET file and get the needed text. 

This process is extremely slow, especially, when I was to get several images of the same website, i.e. different scraps of the same website. 

Do you propose me any faster approach? 

I really appreciate any help you can provide. 
Wesam

Sebastian Nagel

unread,
May 30, 2023, 10:31:14 AM5/30/23
to common...@googlegroups.com
Hi Wesam,

> This process is extremely slow, especially, when I was to get several
> images of the same website, i.e. different scraps of the same website.

Unfortunately, the index does not include offsets into the WET files.
As a consequence, you need to download one WET file to extract just
a single or few records from the file.

It's much more efficient to fetch the single WARC record by sending an
HTTP range request and parse the HTML to extract the textual content.
See
https://commoncrawl.org/access-the-data/
or
https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ
and
https://groups.google.com/g/common-crawl/c/tAO6VaAw3WA/m/Untz_qt6EwAJ

Let us know if you need more information!

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages