Hi Eduard,
the WARC files are gzip compressed per record and it's possible to decompress single WARC records:
curl -s -r 914482005-914528236
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400192887.19/warc/CC-MAIN-20200919204805-20200919234805-00275.warc.gz
\
| gzip -dc
You then only need to split the WARC and HTTP headers from the payload (HTML).
There's no need to download the entire file.
Note: the gzip spec [1] allows to concatenate multiple gzip-compressed files or chunks
into a single file. If you know the offsets you can start to decompress from a given position.
Best,
Sebastian
[1]
https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
On 11/20/20 6:44 PM, Eduard K wrote:
> Hi guys,
>
> I'm afraid I am unable to recreate the solution above successfully anymore.
>
> Example -
> I'm making a GET request for the following file using AXIOS on Nodejs.
>
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400192887.19/warc/CC-MAIN-20200919204805-20200919234805-00275.warc.gz
>
> I'm providing the following headers:
> headers: { 'Range': 'bytes=914482005-914528236', 'Content-Type':'text/html' }
>
> The range is based on an index query for
www.imdb.com using Athena, whereby I received back the aforementioned URL + warc_record_offset +
> warc_record_length.
>
> Yet the result I am getting is not the WARC WET record I was expecting but a long string of scrambled letters:
> Looks like this
> ...
> �R♦☺�☺�☼�E�*�"Sɏt�A►߶�`j��<
> ...
>
> I suspect the issue is that the*partial result (206)* that I am getting is just fine, but the file is _GZ compressed_.
> The last time I tackled this issue, I can almost swear that I did in fact get back HTML and not zipped gibberish...
> Anyhow, after lost hours searching the web for a solution, it seems that to unzip a GZ compressed file, I must download the entire file.
> The whole idea of the exercise is to find a way to only download the one WARC WET record, fast and without having to download GB's of data.
>
> Anybody else has a working solution?
> Has anyone else encountered this issue?
> Am I going about this all wrong somehow?
>
> Hoping to get some help :)
>
> Thanks, guys!
> Ed.
>
> On Friday, 6 November 2020 at 17:24:59 UTC+2 Eduard K wrote:
>
> Thanks Tom!
>
> Thanks for pointing me in the right direction, I have it working now.
> All the best !!!
> Was not aware of the ability to pass byte-range via header, cool feature!! :-D
>
> Ed.
> On Friday, 6 November 2020 at 16:30:53 UTC+2
tfmo...@gmail.com wrote:
>
> You need to do a byte range request by sending a Range header. Node should be able to do that. You could also use the AWS SDK for
> Node, although I'm not sure how much value it adds for such a simple task.
https://aws.amazon.com/sdk-for-node-js/
> <
https://aws.amazon.com/sdk-for-node-js/>
>
> Tom
>
> On Fri, Nov 6, 2020 at 9:00 AM Eddie Kleiner <
edikl...@gmail.com> wrote:
>
> Hi guys,
>
> I Kindly ask for your advice.
> I am currently using AWS Athena to find WARC records.
>
> _The problem - _
> The file containing the WARC file is massive and contains not only the requested item but also many others.
>
> _What I want to achieve - _
> I know that there are *offset & length* parameters to fetch the specific website I am looking for in the WARC record, however, I
> am don't want to download the entire WARC record, unzip it, then fetch the website by offset & length. Instead, */I am trying to
> somehow download just the required snippet of WARC containing the specific website and nothing else./*
>
> _Technology - _
> I am working in Node.js and would really prefer a method using node.js but if it's not possible to do so, I can also work with
> Python/ other technologies.
>
> *Is this even possible?* :)
>
> _Example -_
> /Let's assume I want to download just the WARC record for the following website./
> /If I download the file (warc_filename) - I need to download XXX, as that file contains the target website and countless more.
> How would I download just the one file in node.js?/
> /
> /
> *URL:*
www.academyforlife.va <
http://www.academyforlife.va>
> *path*: /content/pav/en/news/2019/the-future-of-the-family-in-africa.html
> *warc_filename: *crawl-data/CC-MAIN-2020-40/segments/1600400210996.32/warc/CC-MAIN-20200923113029-20200923143029-00608.warc.gz
> *warc_record_offset*: 138475000
> *warc_record_length*: 6647
> *warc_segment: *1600400210996.32
> <
https://groups.google.com/d/msgid/common-crawl/078da0d4-edaa-41ab-8249-f47c80de0d90n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/af0935b5-92f0-4035-ab6c-97461a85ca56n%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/af0935b5-92f0-4035-ab6c-97461a85ca56n%40googlegroups.com?utm_medium=email&utm_source=footer>.