How to Downloa Wet Files

344 views
Skip to first unread message

Dakila

unread,
Oct 1, 2017, 9:47:25 PM10/1/17
to Common Crawl
Hello,

I'm running a Python script I found on the Internet to download wet files. But I'm getting an error running the code. Here's a sample run and the error. How can I fix this error? Thank you for the help. 


>>> import warc
>>> import requests
>>> from contextlib import closing
>>> from StringIO import StringIO
>>>
>>> def get_partial_warc_file(url, num_bytes=1024 * 10):
...     """
...     Download the first part of a WARC file and return a warc.WARCFile instance.
...
...     url: the url of a gzipped WARC file
...     num_bytes: the number of bytes to download. Default is 10KB
...
...     return: warc.WARCFile instance
...     """
...     with closing(requests.get(url, stream=True)) as r:
...         buf = StringIO(r.raw.read(num_bytes))
...     return warc.WARCFile(fileobj=buf, compress=True)
...
>>> urls = {
... }
>>>
>>> files = {file_type: get_partial_warc_file(url=url) for file_type, url in urls.items()}
>>> # this line can be used if you want to download the whole file
... # files = {file_type: warc.open(url) for file_type, url in urls.items()}
...
>>> def get_record_with_header(warc_file, header, value):
...     for record, _, _ in warc_file.browse():
...         if record.header.get(header) == value:
...             return record
...
>>> warc_record = get_record_with_header(
...     files['warc'],
...     header='WARC-Type',
...     value='response'
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "<stdin>", line 2, in get_record_with_header
  File "/Library/Python/2.7/site-packages/warc/warc.py", line 295, in browse
    for record in self.reader:
  File "/Library/Python/2.7/site-packages/warc/warc.py", line 390, in __iter__
    record = self.read_record()
  File "/Library/Python/2.7/site-packages/warc/warc.py", line 367, in read_record
    fileobj = self.fileobj.read_member()
  File "/Library/Python/2.7/site-packages/warc/gzip2.py", line 104, in read_member
    BaseGzipFile._read(self, 1)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 303, in _read
    self._read_gzip_header()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 197, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file                     <<< - - -  Error here
>>> wat_record = get_record_with_header(
...     files['wat'],
...     header='WARC-Refers-To',
...     value=warc_record.header['WARC-Record-ID']
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
NameError: name 'warc_record' is not defined
>>>
>>> wet_record = get_record_with_header(
...     files['wet'],
...     header='WARC-Refers-To',
...     value=warc_record.header['WARC-Record-ID']
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
NameError: name 'warc_record' is not defined
>>>

- - - - - - 

 Section of gzip.py that throws the error

def _read_gzip_header(self):
        magic = self.fileobj.read(2)
        if magic != '\037\213':
            raise IOError, 'Not a gzipped file'

Sebastian Nagel

unread,
Oct 2, 2017, 4:48:57 AM10/2/17
to common...@googlegroups.com
Hi,

the URLs point to the old data location which has been changed more than a year ago, see [1].
The data is now located on
s3://commoncrawl/
resp.
https://commoncrawl.s3.amazonaws.com/ (without "common-crawl/")

Accordingly, the first of the sample URLs should be

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz

Best,
Sebastian

[1] https://groups.google.com/d/msg/common-crawl/nKuQK68rebo/x0TEUAaYCQAJ
[2] http://commoncrawl.org/the-data/get-started/
[3] http://commoncrawl.org/the-data/examples/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Dakila

unread,
Oct 2, 2017, 12:38:00 PM10/2/17
to Common Crawl
Sir Sebastian, thank you very much for the reply!
Reply all
Reply to author
Forward
0 new messages