Hi all,
I'm trying to get the WET files for a list of URLs.
My current approach is working but it is very slow:
The response is something like:
com,patizonet)/ 20190919123400 {"url": "
https://patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "302", "digest": "2V5U3BEPSDZBGAORUF7CNVHRSXNYH3QS", "length": "1182", "offset": "18290096", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514573519.72/crawldiagnostics/CC-MAIN-20190919122032-20190919144032-00410.warc.gz"}
com,patizonet)/ 20190919123401 {"url": "
https://www.patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "47ZXP2V2EQBL3P54IMLHZVGFAZOGJYC2", "length": "7841", "offset": "977366041", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514573519.72/warc/CC-MAIN-20190919122032-20190919144032-00513.warc.gz", "languages": "eng", "encoding": "UTF-8"}
com,patizonet)/ 20190920201906 {"url": "
https://www.patizonet.com/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "LFKVWZYQAFAY3WWPJTN2SW5V52L3IFYK", "length": "8013", "offset": "968622132", "filename": "crawl-data/CC-MAIN-2019-39/segments/1568514574077.39/warc/CC-MAIN-20190920200607-20190920222607-00513.warc.gz", "languages": "eng", "encoding": "UTF-8"}
2- Then, I parse the response to get the warc file and replace it with WET to get the WET file.
3- download the WET file and get the needed text.
This process is extremely slow, especially, when I was to get several images of the same website, i.e. different scraps of the same website.
Do you propose me any faster approach?
I really appreciate any help you can provide.
Wesam