get an WARC archive with all files from a domain

278 views
Skip to first unread message

David Portabella

unread,
Sep 8, 2016, 5:47:38 AM9/8/16
to Common Crawl
Continuing here the discussion from: https://github.com/ikreymer/cdx-index-client/issues/3

So how I get the uncompressed WARC chunk from there? I tried as follows:
$ dd if=CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz bs=1 skip=$((480161397-1)) count=9051 of=chunk.gz
$ gunzip
--keep chunk.gz
gunzip
: chunk.gz: not in gzip format



---
David wrote:

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.


wumpus answered:

the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.


zbagz

unread,
Sep 8, 2016, 11:58:20 AM9/8/16
to Common Crawl
You should be using ranged requests to avoid downloading the whole archive. Try this:

curl -r 768421563-768431515 "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz" | zless


zbagz

unread,
Sep 8, 2016, 12:02:15 PM9/8/16
to Common Crawl
Also, if you are using some other library you will have to add the range header. E.g.:

"Range: bytes=768421563-768431515"

David Portabella

unread,
Sep 8, 2016, 12:50:06 PM9/8/16
to Common Crawl
thanks, this works!

thanks for the curl -r trick; that was my next step.

the problem I had with `dd` is that I messed up the offset while testing on different versions of the dataset, sorry (and I was skipping one byte less).
so, just for completion, this also works:
dd if=CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz bs=1 skip=768421563 count=9953 of=chunk.gz

Mantas Zimnickas

unread,
Sep 25, 2016, 6:19:24 AM9/25/16
to Common Crawl
2016-09-08 18:58:20 UTC+3, zbagz wrote:
> You should be using ranged requests to avoid downloading the whole archive. Try this:
>
>     curl -r 768421563-768431515 "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz" | zless

Nice!

Currently index.commoncrawl.org returns only offsets for WARC files, is it possible to somehow get offsets in the WET files?

Christian Lund

unread,
Sep 25, 2016, 8:53:14 AM9/25/16
to Common Crawl
If possible also the offset for WAT files (same question posted here https://groups.google.com/forum/?hl=en#!topic/common-crawl/cw4D__LO890).

Sebastian Nagel

unread,
Sep 27, 2016, 11:39:27 AM9/27/16
to common...@googlegroups.com
Hi Christian, hi Mantas,

> Currently index.commoncrawl.org <http://index.commoncrawl.org> returns only offsets for WARC
> files, is it possible to somehow get offsets in the WET files?

> If possible also the offset for WAT files (same question posted
> here https://groups.google.com/forum/?hl=en#!topic/common-crawl/cw4D__LO890).

Thanks for the suggestions. We would have to implement this in
https://github.com/commoncrawl/webarchive-indexing
ideally, in a way that we do not need to re-index the WARCs.

There is a 1:1 correspondence between WARC and WAT/WET files. At least, it's possible to find the
right WAT/WET file and estimate of the offset.

Best,
Sebastian

On 09/25/2016 02:53 PM, Christian Lund wrote:
> If possible also the offset for WAT files (same question posted
> here https://groups.google.com/forum/?hl=en#!topic/common-crawl/cw4D__LO890).
>
>
> On Sunday, September 25, 2016 at 12:19:24 PM UTC+2, Mantas Zimnickas wrote:
>
> 2016-09-08 18:58:20 UTC+3, zbagz wrote:
> > You should be using ranged requests to avoid downloading the whole archive. Try this:
> >
> > curl -r 768421563-768431515
> "https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"
> | zless
>
> Nice!
>
> Currently index.commoncrawl.org <http://index.commoncrawl.org> returns only offsets for WARC
> files, is it possible to somehow get offsets in the WET files?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages