Iterate through warc.gz without downloading it

488 views
Skip to first unread message

Bogdan Metea

unread,
Apr 5, 2018, 11:23:20 AM4/5/18
to Common Crawl
Hi guys,

I'm trying to iterate through the warc records contained by crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz

I know how to download the file and then iterate through the records.

But

I'd like to open the file straight from s3. I tried using boto and smart_open but I couldn't get it to work

here's what I've tried:

key = boto.connect_s3().get_bucket('commoncrawl').get_key(
   'crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz')

with smart_open.smart_open(key) as fin:
for record in fin:
if record['Warc-type'] == 'warcinfo':
pass
else:
print(record['warc-target-uri'])


Is anyone aware if this is possible?



Sebastian Nagel

unread,
Apr 5, 2018, 12:02:15 PM4/5/18
to common...@googlegroups.com
Hi Bogdan,

smart_open will iterate over lines not WARC records. You need to pass the stream
to a WARC parser, e.g.:


import smart_open

from warcio.archiveiterator import ArchiveIterator

warc_input =
smart_open.smart_open('s3://commoncrawl/crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz')

for record in ArchiveIterator(warc_input):
if record.rec_type == 'response':
print(record.rec_headers.get_header('WARC-Target-URI'))


However, buffering in a local temporary file (cf. [1]) avoids stale downloads resp.
HTTP connections. At least, my experience is that those may happen if you're slower
in processing the data than it is streamed by the S3 endpoint.
I would also recommend to use warcio [2] (in place of warc [3]) and boto3. Looks like
the recent version of smart_open already uses boto3.

Best,
Sebastian

[1] https://github.com/commoncrawl/cc-pyspark/blob/master/sparkcc.py#L169
[2] https://pypi.python.org/pypi/warcio
[3] https://pypi.python.org/pypi/warc
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Bogdan Metea

unread,
Apr 5, 2018, 12:44:14 PM4/5/18
to Common Crawl
Hi Sebastian,

I am honestly shocked how quick and good your replies are. Thank you for doing all of this!

I'll try both example and I'll come back with an update.

Thanks,
Bogdan

Reply all
Reply to author
Forward
0 new messages