unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
But it's easy to run the WET extractor on the WARC files, see:
That's what you have to do:
# download the WARC files and place them in a directory "warc/"
# create sibling folders wat and wet
# |-- warc/
# | |-- CC-NEWS-20161001224340-00008.warc.gz
# | |-- CC-NEWS-20161017145313-00000.warc.gz
# | `-- ...
# |-- wat/
# `-- wet/
git clone https://github.com/commoncrawl/ia-web-commons
git clone https://github.com/commoncrawl/ia-hadoop-tools
java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
-strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
The folders wat/ and wet/ will then contain the exports.
On 07/05/2017 08:43 AM, Spider99 wrote:
> Hi Sebastian,
> Actually i am looking for WET files i.e, text version of news data.
> For example: i have all the WARC paths of news data
> (crawl-data/CC-NEWS/2016/08/CC-NEWS-20160826124520-00000.warc.gz) it basically downloads WARC file
> which has html content, but actually i needed WET files or paths for WET file so that i can work
> with only text version of news data.
> Hope this clarifies.
> On Tuesday, July 4, 2017 at 2:51:42 PM UTC+5:30, Sebastian Nagel wrote:
> could you specify what are you exactly looking for
> and provide some examples or references?
> On 07/04/2017 10:53 AM, Spider99 wrote:
> > Hi,
> > Where can i find WET files/paths for news-archive crawled till date?.
> > Please help on this.
> > Thanks.
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com
> > To post to this group, send email to common...@googlegroups.com
> > <mailto:common...@googlegroups.com
> > For more options, visit https://groups.google.com/d/optout
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to