Partially download a warc file? I need only a few websites off it.

Skip to first unread message


May 24, 2021, 2:49:17 PMMay 24
to Common Crawl
Hello guys! I have read the info on, which talks about what I am trying to achieve, however the warc file I need to partially download, I am not finding these parameters.

I am working on a personal project based off the My Opera archive project.

Particularly, from this warc:
I have found a few blogs on the .cdx file that I would like to extract.

I googled and tried with curl and warcat, but didn't have any success.

The format on the .cdx file is the following:

com,myopera,files)/06acetil/albums/7756502/thumbs/img0094a.jpg_thumb.jpg 20140219145255 image/jpeg 200 6F73FESWVN3XIMQYYJVIWWT7NCKHUKJV - - 14663 43128250718 archiveteam_myopera_20140221001359/myopera_20140221001359.megawarc.warc.gz

Also, how do I do to download all pages saved from  ?

Sorry for all the questions. I'm a warc noob and been googling around all day but kinda got stuck in here.

Thanks for any help and have a nice day!

Sebastian Nagel

May 24, 2021, 3:34:15 PMMay 24

it's basically the same procedure taking filename, offset and length from
the CDX record:

curl --verbose --location -r43128250718-$((43128250718+14663-1)) \ \

One tiny but important point: redirects to an actual wayback server,
so you need to add `--location` to make curl follow the redirect.

Then you can extract the image from /tmp/my.warc.gz eg. using warcio [1]

warcio extract --payload /tmp/t.jpg.warc.gz 0 >/tmp/my.jpg

> Also, how do I do to download all pages saved from ?

First, query the CDX server with the query (URL prefix)*

Second, iterate over all returned CDX records...

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> <>.
> To view this discussion on the web visit
> <>.


May 24, 2021, 5:55:13 PMMay 24
to Common Crawl
Thank you Sebastian! I will try these steps.

It is much appreciated!

Reply all
Reply to author
0 new messages