{"urlkey": "com,cnn)/2015/01/01/world/asia/airasia-disaster/index.html?hpt=hp_t2&nbd=5_things", "timestamp": "20180119215103", "digest": "BQWPW2IOX5L6MEWRL5TKT2P6QWZWX2W3", "length": "52129", "mime": "text/html", "offset": "417354319", "status": "200", "mime-detected": "text/html", "url": "http://www.cnn.com/2015/01/01/world/asia/airasia-disaster/index.html?hpt=hp_t2&nbd=5_things", "filename": "crawl-data/CC-MAIN-2018-05/segments/1516084888135.38/warc/CC-MAIN-20180119204427-20180119224427-00509.warc.gz"}
It is pointing to a chunk of data in common crawl specified by the "filename:xxx" which starts with "offset:xxx" and has length "length:xxx", right?
Now if I want to get the corresponding WET data for this search result, I now I can use the following filename:
crawl-data/CC-MAIN-2018-05/segments/1516084888135.38/wet/CC-MAIN-20180119204427-20180119224427-00509.warc.wet.gz
But what about the offset and the length? Is it the same of the warc file? How can I get the corresponding wet file then?
Thanks!
Yuheng