How to Downloa Wet Files

Dakila

unread,

Oct 1, 2017, 9:47:25 PM10/1/17

to Common Crawl

Hello,

I'm running a Python script I found on the Internet to download wet files. But I'm getting an error running the code. Here's a sample run and the error. How can I fix this error? Thank you for the help.

>>> import warc

>>> import requests

>>> from contextlib import closing

>>> from StringIO import StringIO

>>>

>>> def get_partial_warc_file(url, num_bytes=1024 * 10):

... """

... Download the first part of a WARC file and return a warc.WARCFile instance.

...

... url: the url of a gzipped WARC file

... num_bytes: the number of bytes to download. Default is 10KB

...

... return: warc.WARCFile instance

... """

... with closing(requests.get(url, stream=True)) as r:

... buf = StringIO(r.raw.read(num_bytes))

... return warc.WARCFile(fileobj=buf, compress=True)

...

>>> urls = {

... 'warc': 'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz',

... 'wat': 'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wat/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wat.gz',

... 'wet': 'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wet/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wet.gz'

... }

>>>

>>> files = {file_type: get_partial_warc_file(url=url) for file_type, url in urls.items()}

>>> # this line can be used if you want to download the whole file

... # files = {file_type: warc.open(url) for file_type, url in urls.items()}

...

>>> def get_record_with_header(warc_file, header, value):

... for record, _, _ in warc_file.browse():

... if record.header.get(header) == value:

... return record

...

>>> warc_record = get_record_with_header(

... files['warc'],

... header='WARC-Type',

... value='response'

... )

Traceback (most recent call last):

File "<stdin>", line 4, in <module>

File "<stdin>", line 2, in get_record_with_header

File "/Library/Python/2.7/site-packages/warc/warc.py", line 295, in browse

for record in self.reader:

File "/Library/Python/2.7/site-packages/warc/warc.py", line 390, in __iter__

record = self.read_record()

File "/Library/Python/2.7/site-packages/warc/warc.py", line 367, in read_record

fileobj = self.fileobj.read_member()

File "/Library/Python/2.7/site-packages/warc/gzip2.py", line 104, in read_member

BaseGzipFile._read(self, 1)

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 303, in _read

self._read_gzip_header()

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 197, in _read_gzip_header

raise IOError, 'Not a gzipped file'

IOError: Not a gzipped file <<< - - - Error here

>>> wat_record = get_record_with_header(

... files['wat'],

... header='WARC-Refers-To',

... value=warc_record.header['WARC-Record-ID']

... )

Traceback (most recent call last):

File "<stdin>", line 4, in <module>

NameError: name 'warc_record' is not defined

>>>

>>> wet_record = get_record_with_header(

... files['wet'],

... header='WARC-Refers-To',

... value=warc_record.header['WARC-Record-ID']

... )

Traceback (most recent call last):

File "<stdin>", line 4, in <module>

NameError: name 'warc_record' is not defined

>>>

- - - - - -

Section of gzip.py that throws the error

def _read_gzip_header(self):

magic = self.fileobj.read(2)

if magic != '\037\213':

raise IOError, 'Not a gzipped file'

Sebastian Nagel

unread,

Oct 2, 2017, 4:48:57 AM10/2/17

to common...@googlegroups.com

Hi,

the URLs point to the old data location which has been changed more than a year ago, see [1].
The data is now located on
s3://commoncrawl/
resp.
https://commoncrawl.s3.amazonaws.com/ (without "common-crawl/")

Accordingly, the first of the sample URLs should be

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz

Best,
Sebastian

[1] https://groups.google.com/d/msg/common-crawl/nKuQK68rebo/x0TEUAaYCQAJ
[2] http://commoncrawl.org/the-data/get-started/
[3] http://commoncrawl.org/the-data/examples/

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Dakila

unread,

Oct 2, 2017, 12:38:00 PM10/2/17

to Common Crawl

Sir Sebastian, thank you very much for the reply!

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Reply all

Reply to author

Forward