$ dd if=CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz bs=1 skip=$((480161397-1)) count=9051 of=chunk.gz
$ gunzip --keep chunk.gz
gunzip: chunk.gz: not in gzip format
I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:
$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]
$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc
here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.
wumpus answered:
the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.
curl -r 768421563-768431515 "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz" | zless