Hi,
here a minimal solution based on warcio [1] adapted from [2]:
import sys
from warcio.archiveiterator import ArchiveIterator, ArchiveLoadFailed
for filename in sys.argv[1:]:
try:
stream = open(filename, 'rb')
except IOError as exception:
sys.stderr.write('Failed to read {}: {}\n'.format(filename, exception))
continue
try:
for record in ArchiveIterator(stream):
if (record.rec_type == 'conversion' and
record.content_type == 'text/plain'):
text = record.content_stream().read().decode('utf-8')
print(text)
except ArchiveLoadFailed as exception:
sys.stderr.write('Failed to process WARC file {}: {}\n'.format(
filename, exception))
% python3 wet2text.py
crawl-data/CC-MAIN-2017-13/segments/1490218186353.38/wet/CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.wet.gz
| head -3
~たまに撮れる瞬間を求めて~ SL
~たまに撮れる瞬間を求めて~
塩分控えめ…
Best,
Sebastian
[1]
https://pypi.python.org/pypi/warcio
[2]
https://github.com/commoncrawl/cc-pyspark/blob/master/sparkcc.py
On 06/28/2017 01:06 PM, Gopika Bhardwaj wrote:
> Could someone share python code for extracting plain-text data from the wet files ?
>
> Until now, I manually downloaded a wet.paths.gz document and downloaded the .wet file by appending
> path from the wet.paths.gz document to
https://commoncrawl.s3.amazonaws.com/.
> <
https://commoncrawl.s3.amazonaws.com/>..
>
> After collecting the data, I intend to apply a topic model, the Latent Dirichlet Allocation on the data.
> I am unable to code a python script that input from a file.
> Can someone help me with that as well?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.