Offset and Length of warc segment

289 views

Skip to first unread message

Yuheng Du

unread,

Mar 28, 2018, 2:48:45 PM3/28/18

to Common Crawl

Hi,

I am searching on the index 2018-5 and an example record looks like below:

{"urlkey": "com,cnn)/2015/01/01/world/asia/airasia-disaster/index.html?hpt=hp_t2&nbd=5_things", "timestamp": "20180119215103", "digest": "BQWPW2IOX5L6MEWRL5TKT2P6QWZWX2W3", "length": "52129", "mime": "text/html", "offset": "417354319", "status": "200", "mime-detected": "text/html", "url": "http://www.cnn.com/2015/01/01/world/asia/airasia-disaster/index.html?hpt=hp_t2&nbd=5_things", "filename": "crawl-data/CC-MAIN-2018-05/segments/1516084888135.38/warc/CC-MAIN-20180119204427-20180119224427-00509.warc.gz"}

It is pointing to a chunk of data in common crawl specified by the "filename:xxx" which starts with "offset:xxx" and has length "length:xxx", right?

Now if I want to get the corresponding WET data for this search result, I now I can use the following filename:

crawl-data/CC-MAIN-2018-05/segments/1516084888135.38/wet/CC-MAIN-20180119204427-20180119224427-00509.warc.wet.gz

But what about the offset and the length? Is it the same of the warc file? How can I get the corresponding wet file then? 

Thanks!

Yuheng

Sebastian Nagel

unread,

Mar 28, 2018, 3:24:06 PM3/28/18

to common...@googlegroups.com

Hi,

> How can I get the corresponding wet file then?

WET files are not indexed, so there is no way to get offset and length for the corresponding record
in the WET file.

If it's only about few records, the easiest way is probably to fetch the WARC record and parse it
yourself. E.g.

curl -s -r417354319-$((417354319+52129-1))
"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084888135.38/warc/CC-MAIN-20180119204427-20180119224427-00509.warc.gz"
\
| gzip -cd \
| perl -lne 'if (/^\r?$/) { $body = 1 if $httpheader; $httpheader = 1; next } print if $body' \
| lynx -dump -stdin -width=200

Cf. https://groups.google.com/d/msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages