Having Trouble working on Common Crawl Data

205 views

Skip to first unread message

Gopika Bhardwaj

unread,

Jun 28, 2017, 7:06:33 AM6/28/17

to Common Crawl

Could someone share python code for extracting plain-text data from the wet files ?

Until now, I manually downloaded a wet.paths.gz document and downloaded the .wet file by appending path from the wet.paths.gz document to https://commoncrawl.s3.amazonaws.com/...

After collecting the data, I intend to apply a topic model, the Latent Dirichlet Allocation on the data.

I am unable to code a python script that input from a file.

Can someone help me with that as well?

Thanks

Sebastian Nagel

unread,

Jun 28, 2017, 7:34:14 AM6/28/17

to common...@googlegroups.com

Hi,

here a minimal solution based on warcio [1] adapted from [2]:

import sys
from warcio.archiveiterator import ArchiveIterator, ArchiveLoadFailed

for filename in sys.argv[1:]:
try:
stream = open(filename, 'rb')
except IOError as exception:
sys.stderr.write('Failed to read {}: {}\n'.format(filename, exception))
continue
try:
for record in ArchiveIterator(stream):
if (record.rec_type == 'conversion' and
record.content_type == 'text/plain'):
text = record.content_stream().read().decode('utf-8')
print(text)
except ArchiveLoadFailed as exception:
sys.stderr.write('Failed to process WARC file {}: {}\n'.format(
filename, exception))

% python3 wet2text.py
crawl-data/CC-MAIN-2017-13/segments/1490218186353.38/wet/CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.wet.gz
| head -3
～たまに撮れる瞬間を求めて～ SL
～たまに撮れる瞬間を求めて～
塩分控えめ…

Best,
Sebastian

[1] https://pypi.python.org/pypi/warcio
[2] https://github.com/commoncrawl/cc-pyspark/blob/master/sparkcc.py

On 06/28/2017 01:06 PM, Gopika Bhardwaj wrote:
> Could someone share python code for extracting plain-text data from the wet files ?
>
> Until now, I manually downloaded a wet.paths.gz document and downloaded the .wet file by appending
> path from the wet.paths.gz document to https://commoncrawl.s3.amazonaws.com/.

> <https://commoncrawl.s3.amazonaws.com/>..

>
> After collecting the data, I intend to apply a topic model, the Latent Dirichlet Allocation on the data.
> I am unable to code a python script that input from a file.
> Can someone help me with that as well?
>
> Thanks
>

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages