How to obtain the text content from WET files for specific Homepages?

1,490 views
Skip to first unread message

trader...@gmail.com

unread,
Jan 28, 2019, 1:49:38 PM1/28/19
to Common Crawl
I am really curious about getting the text from the WET files. Everything I found is just about the WARC files.

If I use http://index.commoncrawl.org the filenames also refers just to WARC files. How can I get the according WET file?

Many Thanks for help!

yo...@yossi.at

unread,
Jan 28, 2019, 4:13:24 PM1/28/19
to common...@googlegroups.com

Hi,

 

Have you looked at http://commoncrawl.org/the-data/get-started/? It has links to a page per monthly crawl, which contains a link to the WET files.

The above page also explains how to build the URL to the WET index from the YYYY-DD prefix of a specific month.

 

                Yossi.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

trader...@gmail.com

unread,
Jan 29, 2019, 4:34:31 AM1/29/19
to Common Crawl
Thanks. Is thhe following the right way to find the proper WET file for a specific homepage. 


you get eg a filename like: crawl-data/CC-MAIN-2018-51/segments/1544376823318.33/warc/CC-MAIN-20181210055518-20181210081018-00484.warc.gz

Now I use the part C-MAIN-20181210055518-20181210081018-00484 and look within wet.paths to obtain proper wet file?

This is a bit unclear and cannot find anything on the homepage

Sebastian Nagel

unread,
Jan 29, 2019, 4:43:25 AM1/29/19
to common...@googlegroups.com
Hi,

> Now I use the part C-MAIN-20181210055518-20181210081018-00484 and look within wet.paths
> to obtain proper wet file?

Yes, this will work.

Alternatively, you could replace
/warc/ -> /wet/
.warc.gz -> .warc.wet.gz

and you get:
crawl-data/CC-MAIN-2018-51/segments/1544376823318.33/wet/CC-MAIN-20181210055518-20181210081018-00484.warc.wet.gz

Best,
Sebastian

On 1/29/19 10:34 AM, trader...@gmail.com wrote:
> Thanks. Is thhe following the right way to find the proper WET file for a specific homepage. 
>
> Looking at http://index.commoncrawl.org
>
> you get eg a filename
> like: crawl-data/CC-MAIN-2018-51/segments/1544376823318.33/warc/CC-MAIN-20181210055518-20181210081018-00484.warc.gz
>
> Now I use the part C-MAIN-20181210055518-20181210081018-00484 and look within wet.paths to obtain
> proper wet file?
>
> This is a bit unclear and cannot find anything on the homepage
>
>
>
> Am Montag, 28. Januar 2019 22:13:24 UTC+1 schrieb Yossi Tamari:
>
> Hi,
>
>  
>
> Have you looked at http://commoncrawl.org/the-data/get-started/
> <http://commoncrawl.org/the-data/get-started/>? It has links to a page per monthly crawl, which
> contains a link to the WET files.
>
> The above page also explains how to build the URL to the WET index from the YYYY-DD prefix of a
> specific month.
>
>  
>
>                 Yossi.
>
>  
>
> *From:*common...@googlegroups.com <javascript:> <common...@googlegroups.com <javascript:>> *On
> Behalf Of *trader...@gmail.com <javascript:>
> *Sent:* Monday, 28 January 2019 20:50
> *To:* Common Crawl <common...@googlegroups.com <javascript:>>
> *Subject:* [cc] How to obtain the text content from WET files for specific Homepages?
>
>  
>
> I am really curious about getting the text from the WET files. Everything I found is just about
> the WARC files.
>
>  
>
> If I use http://index.commoncrawl.org <http://index.commoncrawl.org> the filenames also refers
> just to WARC files. How can I get the according WET file?
>
>  
>
> Many Thanks for help!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

trader...@gmail.com

unread,
Jan 29, 2019, 9:20:27 AM1/29/19
to Common Crawl
Thanks. SO in one wt.gz are many hompegaes? That's why I need to filter for my specific pages. But using 'WARC-Target-URI' to filter on my pages is not possible as this key is not available:

print(record.header['WARC-Target-URI'])

KeyError: 'warc-target-uri'

Sebastian Nagel

unread,
Jan 29, 2019, 10:09:35 AM1/29/19
to common...@googlegroups.com
Hi,

which Python module is used for the WARC (or WET) parsing?

In case of "warc" (https://pypi.org/project/warc/) it should be

record['WARC-Target-URI']

or for "warcio" (https://pypi.org/project/warcio/)

record.rec_headers.get_header('WARC-Target-URI')

Other modules may provide different methods to get the field 'WARC-Target-URI'.

Best,
Sebastian

trader...@gmail.com

unread,
Jan 29, 2019, 10:45:30 AM1/29/19
to Common Crawl
I just installed warc via pip and still get:

 File "word_count.py", line 8, in process_record
    print(record['WARC-Target-URI'])
  File "C:\Users\chris1\Anaconda3\envs\crawl\lib\site-packages\warc\warc.py", line 199, in __getitem__
    return self.header[name]
  File "C:\Users\chris1\Anaconda3\envs\crawl\lib\site-packages\warc\utils.py", line 34, in __getitem__
    return self._d[name.lower()]
KeyError: 'warc-target-uri'


I followed instruction from here:

jay patel

unread,
Jan 29, 2019, 10:17:05 PM1/29/19
to Common Crawl
I suggest you use another fork of warc package such as this one https://github.com/erroneousboat/warc3

just download the zip file, unzip and cd into the directory; and run:

pip install .

to install the package. Make sure you have uninstalled the older package before you install it again though.

Sebastian Nagel

unread,
Jan 30, 2019, 4:50:23 AM1/30/19
to common...@googlegroups.com
Hi,

I think I got it: the first record of a WARC, WAT or WET file
is the "warcinfo" record which does not have a "WARC-Target-URI".

There is a check in the word_count.py which skips over the warcinfo
record accepting only records of "Content-Type" == "text/plain".
The error should go away if the "WARC-Target-URI" is only accessed
after this check (see line 7+8 in [1]).


> I suggest you use another fork of warc package such as this one
> https://github.com/erroneousboat/warc3


@Jay: Does "warc3" work seamlessly on both Python 2 and 3?
I haven't used it so far, but it could be an easy solution for [2].


Best and thanks,
Sebastian



[1] https://github.com/commoncrawl/cc-mrjob/blob/master/word_count.py#L7
[2] https://github.com/commoncrawl/cc-mrjob/issues/11
> https://github.com/commoncrawl/cc-mrjob <https://github.com/commoncrawl/cc-mrjob>

trader...@gmail.com

unread,
Jan 30, 2019, 5:39:28 AM1/30/19
to Common Crawl
Thanks. Still I try to find out a way to extract the proper parts of the wet files.

I try the following using the offset and length from the warc file to request the wet file. But it says that it is not the proper range. First to get the wet file I did twio replaces of the string:

offset, length = int(record['offset']), int(record['length'])
    offset_end = offset + length - 1
   
resp = requests.get(prefix + record['filename'].replace("/warc/","/wet/").replace(".warc.gz",".warc.wet.gz"), headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})


gives 

b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidRange</Code><Message>The requested range is not satisfiable</Message><RangeRequested>bytes=847111823-847128797</RangeRequested><ActualObjectSize>144936160</ActualObjectSize><RequestId>B2FACC3B4B986CB1</RequestId><HostId>rhOJvMJX/tay0JK953e5KFdOK9TJcWLeN6Z677/jdJrpHFGGXE15ijxn8S7GdmKIx1vlHuG4joU=</HostId></Error>'

Can I use instead the warc-Uri inside the header information?

Sebastian Nagel

unread,
Jan 30, 2019, 6:04:34 AM1/30/19
to common...@googlegroups.com
Hi,

the offsets in the URL index are offsets to the WARC files and cannot be used for WET files.

As suggested by Jay Patel in the other thread "Can you obtain specific parts (urls) from requesting
the single WET file?", using the WARC file and parsing the HTML is the easiest way to go. Parsing
the HTML content requires only a few lines in Python, see
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py#L38

Best,
Sebastian


On 1/30/19 11:39 AM, trader...@gmail.com wrote:
> Thanks. Still I try to find out a way to extract the proper parts of the wet files.
>
> I try the following using the offset and length from the warc file to request the wet file. But it
> says that it is not the proper range. First to get the wet file I did twio replaces of the string:
>
> |
> offset, length = int(record['offset']), int(record['length'])
>     offset_end = offset + length - 1
>    
> resp =requests.get(prefix
> +record['filename'].replace("/warc/","/wet/").replace(".warc.gz",".warc.wet.gz"),headers={'Range':'bytes={}-{}'.format(offset,offset_end)})
>
> |
>
> gives 
>
> |
> b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidRange</Code><Message>The requested
> range is not
> satisfiable</Message><RangeRequested>bytes=847111823-847128797</RangeRequested><ActualObjectSize>144936160</ActualObjectSize><RequestId>B2FACC3B4B986CB1</RequestId><HostId>rhOJvMJX/tay0JK953e5KFdOK9TJcWLeN6Z677/jdJrpHFGGXE15ijxn8S7GdmKIx1vlHuG4joU=</HostId></Error>'
> |
>
> Can I use instead the warc-Uri inside the header information?
>
>
> Am Montag, 28. Januar 2019 19:49:38 UTC+1 schrieb trader...@gmail.com:
>
> I am really curious about getting the text from the WET files. Everything I found is just about
> the WARC files.
>
> If I use http://index.commoncrawl.org <http://index.commoncrawl.org> the filenames also refers
> just to WARC files. How can I get the according WET file?
>
> Many Thanks for help!
>

trader...@gmail.com

unread,
Jan 30, 2019, 6:08:57 AM1/30/19
to Common Crawl
But what is actually returned by requesting the wet file? Doesn't it contains some headers or is it a json where you can retrieve specific keys or where header has the correct URI?


trader...@gmail.com

unread,
Jan 30, 2019, 6:15:11 AM1/30/19
to Common Crawl
Are those parts before the text content header infos such that I can retrieve those by calling
http://commoncrawl.org/2014/04/navigating-the-warc-file-format/


    resp = requests.get(prefix + record['filename'].replace("/warc/","/wet/").replace(".warc.gz",".warc.wet.gz"), headers={WARC-Target-URI= http://advocatehealth.com/condell/emergencyservices3})



 
 

jay patel

unread,
Jan 30, 2019, 6:55:01 AM1/30/19
to Common Crawl

@Jay: Does "warc3" work seamlessly on both Python 2 and 3?
I haven't used it so far, but it could be an easy solution for [2].

unfortunately, I have not tested it with python2; but it works fine with python 3.6.

Sebastian Nagel

unread,
Jan 30, 2019, 7:30:50 AM1/30/19
to common...@googlegroups.com
Hi,

the WARC-Target-URI is part of the capture metadata and cannot be used to selectively fetch a part
of the WARC (WAT or WET) file.

It might be confusing because WARC (WAT and WET) also use the HTTP-like
"Key: Value" fields to encode the capture metadata and also re-use some header (Key) names, e.g.
"Content-Type". But these are different layers:

3. the HTTP requests and responses to fetch WARC files or records from Amazon S3,
see https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html

2. the capture metadata (WARC-Target-URI, etc.) which is defined in the WARC standard
http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

1. the HTTP request and response headers "mirrored" in the WARC files

0. the archived payload, i.e. the HTML content


Best,
Sebastian

Sebastian Nagel

unread,
Jan 30, 2019, 7:41:40 AM1/30/19
to common...@googlegroups.com
Thanks, Jay!

On 1/30/19 12:55 PM, jay patel wrote:
>
> @Jay: Does "warc3" work seamlessly on both Python 2 and 3?
> I haven't used it so far, but it could be an easy solution for [2].
>
> unfortunately, I have not tested it with python2; but it works fine with python 3.6.
>
> On Wednesday, January 30, 2019 at 3:20:23 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi,
>
> I think I got it: the first record of a WARC, WAT or WET file
> is the "warcinfo" record which does not have a "WARC-Target-URI".
>
> There is a check in the word_count.py which skips over the warcinfo
> record accepting only records of "Content-Type" == "text/plain".
> The error should go away if the "WARC-Target-URI" is only accessed
> after this check (see line 7+8 in [1]).
>
>
> > I suggest you use another fork of warc package such as this one
> >  https://github.com/erroneousboat/warc3 <https://github.com/erroneousboat/warc3>
>
>
> @Jay: Does "warc3" work seamlessly on both Python 2 and 3?
> I haven't used it so far, but it could be an easy solution for [2].
>
>
> Best and thanks,
> Sebastian
>
>
>
> [1] https://github.com/commoncrawl/cc-mrjob/blob/master/word_count.py#L7
> <https://github.com/commoncrawl/cc-mrjob/blob/master/word_count.py#L7>
> [2] https://github.com/commoncrawl/cc-mrjob/issues/11
> <https://github.com/commoncrawl/cc-mrjob/issues/11>
>
>
> On 1/30/19 4:17 AM, jay patel wrote:
> > I suggest you use another fork of warc package such as this one
> https://github.com/erroneousboat/warc3 <https://github.com/erroneousboat/warc3>
> >
> > just download the zip file, unzip and cd into the directory; and run:
> >
> > pip install .
> >
> > to install the package. Make sure you have uninstalled the older package before you install it
> again
> > though.
> >
> > On Tuesday, January 29, 2019 at 9:15:30 PM UTC+5:30, trader...@gmail.com wrote:
> >
> >     I just installed warc via pip and still get:
> >
> >      File "word_count.py", line 8, in process_record
> >         print(record['WARC-Target-URI'])
> >       File "C:\Users\chris1\Anaconda3\envs\crawl\lib\site-packages\warc\warc.py", line 199, in
> >     __getitem__
> >         return self.header[name]
> >       File "C:\Users\chris1\Anaconda3\envs\crawl\lib\site-packages\warc\utils.py", line 34, in
> >     __getitem__
> >         return self._d[name.lower()]
> >     KeyError: 'warc-target-uri'
> >
> >
> >     I followed instruction from here:
> >     https://github.com/commoncrawl/cc-mrjob <https://github.com/commoncrawl/cc-mrjob>
> <https://github.com/commoncrawl/cc-mrjob <https://github.com/commoncrawl/cc-mrjob>>
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

trader...@gmail.com

unread,
Jan 30, 2019, 7:53:01 AM1/30/19
to Common Crawl
Thanks Sebastian. But can you show me a way how read out the specific text for my wet files ? It is very confusing.

Also confusing becuse it says:

But one more problem arises, CC also provides us with offsets for WARC files, but not for WET files. Turns out it was also not a problem. WET file essentially is a WARC file without tags. So you can use this library


Now it seems possible to use the offset? 

Am Montag, 28. Januar 2019 19:49:38 UTC+1 schrieb trader...@gmail.com:

trader...@gmail.com

unread,
Jan 30, 2019, 7:53:42 AM1/30/19
to Common Crawl
I don't want to parse the html from warc file.

Sebastian Nagel

unread,
Jan 30, 2019, 8:29:46 AM1/30/19
to common...@googlegroups.com
I'm sorry, but at present there are only these two options:
(1) use the WET file and read it from the beginning until the WARC-Target-URI is found
(2) use the WARC file together with offset and length and parse the HTML

If it's only about "few" pages (up to several millions) I would recommend to
use option (2).

Best,
Sebastian

On 1/30/19 1:53 PM, trader...@gmail.com wrote:
> I don't want to parse the html from warc file.
>
> Am Mittwoch, 30. Januar 2019 13:53:01 UTC+1 schrieb trader...@gmail.com:
>
> Thanks Sebastian. But can you show me a way how read out the specific text for my wet files ? It
> is very confusing.
>
> Also confusing becuse it says:
>
> But one more problem arises, CC also provides us with offsets for WARC files, but not for
> WET files. Turns out it was also not a problem. WET file essentially is a WARC file without
> tags. So you can use this library <https://github.com/erroneousboat/warc3.git>. 
>
>
> from https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
> <https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands>
>
> Now it seems possible to use the offset? 
>
> Am Montag, 28. Januar 2019 19:49:38 UTC+1 schrieb trader...@gmail.com:
>
> I am really curious about getting the text from the WET files. Everything I found is just
> about the WARC files.
>
> If I use http://index.commoncrawl.org <http://index.commoncrawl.org> the filenames also
> refers just to WARC files. How can I get the according WET file?
>
> Many Thanks for help!
>

TradPhy

unread,
Jan 30, 2019, 8:44:11 AM1/30/19
to Common Crawl
Thanks! I will parse...

And it works. Of course there is a bunch of garbage like links or cuszom page tags within divs you cannot remove when yoe don't want to look at every html code each...

How can I remove the first part until text content like:

['WARC/1.0', 'WARC-Type: response', 'WARC-Date: 2018-12-12T22:00:33Z', 'WARC-Record-ID:  ', 'Content-Length: 132928', 'Content-Type: application/http; msgtype=response', 'WARC-Warcinfo-ID:  ]

I handover from the request the content part:
data = resp.content

TradPhy

unread,
Jan 30, 2019, 8:48:27 AM1/30/19
to Common Crawl
(1) use the WET file and read it from the beginning until the WARC-Target-URI is found 

Do you mean to first read the whole file in memory or is there really a way to read step by step? 

TradPhy

unread,
Jan 30, 2019, 11:59:38 AM1/30/19
to Common Crawl
Besides removing the header information I want to ignore reuqest with HTTP 404 or 302 if page is not available. How can I manage that?


Many thanks!

Sebastian Nagel

unread,
Jan 31, 2019, 3:25:05 AM1/31/19
to common...@googlegroups.com
Hi,

> removing the header information

You need to pass the fetched WARC record first to a WARC parser.
The you can get the payload (the HTML document) and pass this
to a HTML parser.

> ignore reuqest with HTTP 404 or 302

The HTTP status is contained in the index. The easiest way is to filter
on the "status" field and process only page captures with status "200".

Best,
Sebastian

TradPhy

unread,
Jan 31, 2019, 4:05:07 AM1/31/19
to Common Crawl
Could you give an example?


Am Donnerstag, 31. Januar 2019 09:25:05 UTC+1 schrieb Sebastian Nagel:
Hi,

> removing the header information

You need to pass the fetched WARC record first to a WARC parser.
The you can get the payload (the HTML document) and pass this
to a HTML parser.

> ignore reuqest with HTTP 404 or 302

The HTTP status is contained in the index. The easiest way is to filter
on the "status" field and process only page captures with status "200".

Best,
Sebastian


On 1/30/19 5:59 PM, TradPhy wrote:
> Besides removing the header information I want to ignore reuqest with HTTP 404 or 302 if page is not
> available. How can I manage that?
>
>
> Many thanks!
>
> Am Mittwoch, 30. Januar 2019 14:48:27 UTC+1 schrieb TradPhy:
>
>         (1) use the WET file and read it from the beginning until the WARC-Target-URI is found 
>
>
>     Do you mean to first read the whole file in memory or is there really a way to read step by step? 
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
Reply all
Reply to author
Forward
0 new messages