Iterate through warc.gz without downloading it

Bogdan Metea

unread,

Apr 5, 2018, 11:23:20 AM4/5/18

to Common Crawl

Hi guys,

I'm trying to iterate through the warc records contained by crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz

I know how to download the file and then iterate through the records.

But

I'd like to open the file straight from s3. I tried using boto and smart_open but I couldn't get it to work

here's what I've tried:

key = boto.connect_s3().get_bucket('commoncrawl').get_key(
	    'crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz')

	with smart_open.smart_open(key) as fin:
		for record in fin:
			if record['Warc-type'] == 'warcinfo':
				pass
			else:
				print(record['warc-target-uri'])

Is anyone aware if this is possible?

Sebastian Nagel

unread,

Apr 5, 2018, 12:02:15 PM4/5/18

to common...@googlegroups.com

Hi Bogdan,

smart_open will iterate over lines not WARC records. You need to pass the stream
to a WARC parser, e.g.:

import smart_open

from warcio.archiveiterator import ArchiveIterator

warc_input =
smart_open.smart_open('s3://commoncrawl/crawl-data/CC-NEWS/2018/04/CC-NEWS-20180405091124-00174.warc.gz')

for record in ArchiveIterator(warc_input):
if record.rec_type == 'response':
print(record.rec_headers.get_header('WARC-Target-URI'))

However, buffering in a local temporary file (cf. [1]) avoids stale downloads resp.
HTTP connections. At least, my experience is that those may happen if you're slower
in processing the data than it is streamed by the S3 endpoint.
I would also recommend to use warcio [2] (in place of warc [3]) and boto3. Looks like
the recent version of smart_open already uses boto3.

Best,
Sebastian

[1] https://github.com/commoncrawl/cc-pyspark/blob/master/sparkcc.py#L169
[2] https://pypi.python.org/pypi/warcio
[3] https://pypi.python.org/pypi/warc

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Bogdan Metea

unread,

Apr 5, 2018, 12:44:14 PM4/5/18

to Common Crawl

Hi Sebastian,

I am honestly shocked how quick and good your replies are. Thank you for doing all of this!

I'll try both example and I'll come back with an update.

Thanks,

Bogdan

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Reply all

Reply to author

Forward