Fetching only one website - partial WARC fetching (single website)

144 views
Skip to first unread message

Eddie Kleiner

unread,
Nov 6, 2020, 9:00:12 AM11/6/20
to Common Crawl
Hi guys, 

I Kindly ask for your advice.
I am currently using AWS Athena to find WARC records. 

The problem - 
The file containing the WARC file is massive and contains not only the requested item but also many others. 

What I want to achieve - 
I know that there are offset & length parameters to fetch the specific website I am looking for in the WARC record, however, I am don't want to download the entire WARC record, unzip it, then fetch the website by offset & length. Instead, I am trying to somehow download just the required snippet of WARC containing the specific website and nothing else.

Technology - 
I am working in Node.js and would really prefer a method using node.js but if it's not possible to do so, I can also work with Python/ other technologies. 

Is this even possible? :)

Example -
Let's assume I want to download just the WARC record for the following website.
If I download the file (warc_filename) - I need to download XXX, as that file contains the target website and countless more. How would I download just the one file in node.js?

path: /content/pav/en/news/2019/the-future-of-the-family-in-africa.html
warc_filename: crawl-data/CC-MAIN-2020-40/segments/1600400210996.32/warc/CC-MAIN-20200923113029-20200923143029-00608.warc.gz 
warc_record_offset: 138475000
warc_record_length:  6647
warc_segment: 1600400210996.32



Many thanks in advance for your time,
Ed.

Tom Morris

unread,
Nov 6, 2020, 9:30:53 AM11/6/20
to common...@googlegroups.com
You need to do a byte range request by sending a Range header. Node should be able to do that. You could also use the AWS SDK for Node, although I'm not sure how much value it adds for such a simple task. https://aws.amazon.com/sdk-for-node-js/

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/078da0d4-edaa-41ab-8249-f47c80de0d90n%40googlegroups.com.

Eduard K

unread,
Nov 6, 2020, 10:24:59 AM11/6/20
to Common Crawl
Thanks Tom!

Thanks for pointing me in the right direction, I have it working now.
All the best !!!
Was not aware of the ability to pass byte-range via header, cool feature!! :-D

Ed.

Eduard K

unread,
Nov 20, 2020, 12:44:02 PM11/20/20
to Common Crawl
Hi guys,

I'm afraid I am unable to recreate the solution above successfully anymore. 

Example - 
I'm making a GET request for the following file using AXIOS on Nodejs.

 I'm providing the following headers: 
headers: { 'Range': 'bytes=914482005-914528236', 'Content-Type':'text/html' }

The range is based on an index query for www.imdb.com using Athena, whereby I received back the aforementioned URL + warc_record_offset + warc_record_length.

Yet the result I am getting is not the WARC WET record I was expecting but a long string of scrambled letters:
Looks like this
 ...
�R♦☺�☺�☼�E�*�"Sɏt�A►߶�`j��<
...

I suspect the issue is that the partial result (206) that I am getting is just fine, but the file is GZ compressed.
The last time I tackled this issue, I can almost swear that I did in fact get back HTML and not zipped gibberish... 
Anyhow, after lost hours searching the web for a solution, it seems that to unzip a GZ compressed file, I must download the entire file.
The whole idea of the exercise is to find a way to only download the one WARC WET record, fast and without having to download GB's of data.

Anybody else has a working solution? 
Has anyone else encountered this issue? 
Am I going about this all wrong somehow? 

Hoping to get some help :)

Thanks, guys!
Ed.

Sebastian Nagel

unread,
Nov 20, 2020, 2:12:55 PM11/20/20
to Common Crawl
Hi Eduard,

the WARC files are gzip compressed per record and it's possible to decompress single WARC records:

curl -s -r 914482005-914528236
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400192887.19/warc/CC-MAIN-20200919204805-20200919234805-00275.warc.gz
\
| gzip -dc

You then only need to split the WARC and HTTP headers from the payload (HTML).
There's no need to download the entire file.

Note: the gzip spec [1] allows to concatenate multiple gzip-compressed files or chunks
into a single file. If you know the offsets you can start to decompress from a given position.

Best,
Sebastian

[1] https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage


On 11/20/20 6:44 PM, Eduard K wrote:
> Hi guys,
>
> I'm afraid I am unable to recreate the solution above successfully anymore. 
>
> Example - 
> I'm making a GET request for the following file using AXIOS on Nodejs.
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400192887.19/warc/CC-MAIN-20200919204805-20200919234805-00275.warc.gz
>
>  I'm providing the following headers: 
> headers: { 'Range': 'bytes=914482005-914528236', 'Content-Type':'text/html' }
>
> The range is based on an index query for www.imdb.com using Athena, whereby I received back the aforementioned URL + warc_record_offset +
> warc_record_length.
>
> Yet the result I am getting is not the WARC WET record I was expecting but a long string of scrambled letters:
> Looks like this
>  ...
> �R♦☺�☺�☼�E�*�"Sɏt�A►߶�`j��<
> ...
>
> I suspect the issue is that the*partial result (206)* that I am getting is just fine, but the file is _GZ compressed_.
> The last time I tackled this issue, I can almost swear that I did in fact get back HTML and not zipped gibberish... 
> Anyhow, after lost hours searching the web for a solution, it seems that to unzip a GZ compressed file, I must download the entire file.
> The whole idea of the exercise is to find a way to only download the one WARC WET record, fast and without having to download GB's of data.
>
> Anybody else has a working solution? 
> Has anyone else encountered this issue? 
> Am I going about this all wrong somehow? 
>
> Hoping to get some help :)
>
> Thanks, guys!
> Ed.
>
> On Friday, 6 November 2020 at 17:24:59 UTC+2 Eduard K wrote:
>
> Thanks Tom!
>
> Thanks for pointing me in the right direction, I have it working now.
> All the best !!!
> Was not aware of the ability to pass byte-range via header, cool feature!! :-D
>
> Ed.
> On Friday, 6 November 2020 at 16:30:53 UTC+2 tfmo...@gmail.com wrote:
>
> You need to do a byte range request by sending a Range header. Node should be able to do that. You could also use the AWS SDK for
> Node, although I'm not sure how much value it adds for such a simple task. https://aws.amazon.com/sdk-for-node-js/
> <https://aws.amazon.com/sdk-for-node-js/>
>
> Tom
>
> On Fri, Nov 6, 2020 at 9:00 AM Eddie Kleiner <edikl...@gmail.com> wrote:
>
> Hi guys, 
>
> I Kindly ask for your advice.
> I am currently using AWS Athena to find WARC records. 
>
> _The problem - _
> The file containing the WARC file is massive and contains not only the requested item but also many others. 
>
> _What I want to achieve - _
> I know that there are *offset & length* parameters to fetch the specific website I am looking for in the WARC record, however, I
> am don't want to download the entire WARC record, unzip it, then fetch the website by offset & length. Instead, */I am trying to
> somehow download just the required snippet of WARC containing the specific website and nothing else./*
>
> _Technology - _
> I am working in Node.js and would really prefer a method using node.js but if it's not possible to do so, I can also work with
> Python/ other technologies. 
>
> *Is this even possible?* :)
>
> _Example -_
> /Let's assume I want to download just the WARC record for the following website./
> /If I download the file (warc_filename) - I need to download XXX, as that file contains the target website and countless more.
> How would I download just the one file in node.js?/
> /
> /
> *URL:* www.academyforlife.va <http://www.academyforlife.va>
> *path*: /content/pav/en/news/2019/the-future-of-the-family-in-africa.html
> *warc_filename: *crawl-data/CC-MAIN-2020-40/segments/1600400210996.32/warc/CC-MAIN-20200923113029-20200923143029-00608.warc.gz 
> *warc_record_offset*: 138475000
> *warc_record_length*:  6647
> *warc_segment: *1600400210996.32
>
>
>
> Many thanks in advance for your time,
> Ed.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/078da0d4-edaa-41ab-8249-f47c80de0d90n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/078da0d4-edaa-41ab-8249-f47c80de0d90n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/af0935b5-92f0-4035-ab6c-97461a85ca56n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/af0935b5-92f0-4035-ab6c-97461a85ca56n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Eduard K

unread,
Nov 20, 2020, 2:55:52 PM11/20/20
to Common Crawl
Thanks, Sebastian,

I'll go ahead and try that, thanks for clearing it up!

Ed.

Reply all
Reply to author
Forward
0 new messages