Partially download a warc file? I need only a few websites off it.

138 views
Skip to first unread message

iPodClassic

unread,
May 24, 2021, 2:49:17 PM5/24/21
to Common Crawl
Hello guys! I have read the info on https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI, which talks about what I am trying to achieve, however the warc file I need to partially download, I am not finding these parameters.

I am working on a personal project based off the My Opera archive project.

Particularly, from this warc:
I have found a few blogs on the .cdx file that I would like to extract.

I googled and tried with curl and warcat, but didn't have any success.

The format on the .cdx file is the following:

com,myopera,files)/06acetil/albums/7756502/thumbs/img0094a.jpg_thumb.jpg 20140219145255 http://files.myopera.com/06acetil/albums/7756502/thumbs/IMG0094A.jpg_thumb.jpg image/jpeg 200 6F73FESWVN3XIMQYYJVIWWT7NCKHUKJV - - 14663 43128250718 archiveteam_myopera_20140221001359/myopera_20140221001359.megawarc.warc.gz



Also, how do I do to download all pages saved from http://files.myopera.com/06acetil/albums  ?

Sorry for all the questions. I'm a warc noob and been googling around all day but kinda got stuck in here.

Thanks for any help and have a nice day!

Sebastian Nagel

unread,
May 24, 2021, 3:34:15 PM5/24/21
to common...@googlegroups.com
Hi,

it's basically the same procedure taking filename, offset and length from
the CDX record:

curl --verbose --location -r43128250718-$((43128250718+14663-1)) \
https://archive.org/download/archiveteam_myopera_20140221001359/myopera_20140221001359.megawarc.warc.gz \
>/tmp/my.warc.gz

One tiny but important point: archive.org redirects to an actual wayback server,
so you need to add `--location` to make curl follow the redirect.

Then you can extract the image from /tmp/my.warc.gz eg. using warcio [1]

warcio extract --payload /tmp/t.jpg.warc.gz 0 >/tmp/my.jpg

> Also, how do I do to download all pages saved from http://files.myopera.com/06acetil/albums ?

First, query the CDX server with the query (URL prefix)
files.myopera.com/06acetil/albums/*

Second, iterate over all returned CDX records...

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/581b2b42-0ef9-4b25-936a-9538c7121b48n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/581b2b42-0ef9-4b25-936a-9538c7121b48n%40googlegroups.com?utm_medium=email&utm_source=footer>.

iPodClassic

unread,
May 24, 2021, 5:55:13 PM5/24/21
to Common Crawl
Thank you Sebastian! I will try these steps.

It is much appreciated!


Regards.
Reply all
Reply to author
Forward
0 new messages