WET file for news-archive

646 views
Skip to first unread message

Spider99

unread,
Jul 4, 2017, 4:53:51 AM7/4/17
to Common Crawl
Hi,

Where can i find WET files/paths for news-archive crawled till date?.
Please help on this.

Thanks.

Sebastian Nagel

unread,
Jul 4, 2017, 5:21:42 AM7/4/17
to common...@googlegroups.com
Hi,

could you specify what are you exactly looking for
and provide some examples or references?

Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Spider99

unread,
Jul 5, 2017, 2:43:01 AM7/5/17
to Common Crawl
Hi Sebastian,

Actually i am looking for WET files i.e, text version of news data. 

For example: i have all the WARC paths of news data (crawl-data/CC-NEWS/2016/08/CC-NEWS-20160826124520-00000.warc.gz) it basically downloads WARC file which has html content, but actually i needed WET files or paths for WET file so that i can work with only text version of news data. 

Hope this clarifies. 

Thanks, 

On Tuesday, July 4, 2017 at 2:51:42 PM UTC+5:30, Sebastian Nagel wrote:
Hi,

could you specify what are you exactly looking for
and provide some examples or references?

Thanks,
Sebastian


On 07/04/2017 10:53 AM, Spider99 wrote:
> Hi,
>
> Where can i find WET files/paths for news-archive crawled till date?.
> Please help on this.
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Sebastian Nagel

unread,
Jul 5, 2017, 3:54:01 AM7/5/17
to common...@googlegroups.com
Hi,

unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.

But it's easy to run the WET extractor on the WARC files, see:
https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion

That's what you have to do:


# download the WARC files and place them in a directory "warc/"
# create sibling folders wat and wet
# |
# |-- warc/
# | |-- CC-NEWS-20161001224340-00008.warc.gz
# | |-- CC-NEWS-20161017145313-00000.warc.gz
# | `-- ...
# |
# |-- wat/
# |
# `-- wet/

git clone https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn install

cd ..
git clone https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package

java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
-strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz

The folders wat/ and wet/ will then contain the exports.

Best,
Sebastian

On 07/05/2017 08:43 AM, Spider99 wrote:
> Hi Sebastian,
>
> Actually i am looking for WET files i.e, text version of news data.
>
> For example: i have all the WARC paths of news data
> (crawl-data/CC-NEWS/2016/08/CC-NEWS-20160826124520-00000.warc.gz) it basically downloads WARC file
> which has html content, but actually i needed WET files or paths for WET file so that i can work
> with only text version of news data.
>
> Hope this clarifies.
>
> Thanks,
>
> On Tuesday, July 4, 2017 at 2:51:42 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi,
>
> could you specify what are you exactly looking for
> and provide some examples or references?
>
> Thanks,
> Sebastian
>
>
> On 07/04/2017 10:53 AM, Spider99 wrote:
> > Hi,
> >
> > Where can i find WET files/paths for news-archive crawled till date?.
> > Please help on this.
> >
> > Thanks.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Spider99

unread,
Jul 28, 2017, 2:27:38 AM7/28/17
to Common Crawl

Thanks Sebastian
>     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>     > To post to this group, send email to common...@googlegroups.com <javascript:>
>     > <mailto:common...@googlegroups.com <javascript:>>.
>     > Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

olaf.c...@gmail.com

unread,
May 3, 2018, 8:55:14 PM5/3/18
to Common Crawl
Hello, are the WET files still not available? I am afraid I do not know how to run the run the WET extractors on my local machine...
Many thanks for your help

O

Sebastian Nagel

unread,
May 7, 2018, 9:18:40 AM5/7/18
to Common Crawl
No, there are no WET files for the news data.

With Java 8 SDK, Git and Maven installed, it's only about 10 commands
to run from command-line to produce the WET files.

Best,
Sebastian

olaf.c...@gmail.com

unread,
May 7, 2018, 9:23:47 AM5/7/18
to Common Crawl
thank you sebastian but I am afraid that a noobie like me does not know how to do that.... is there a tutorial somewhere? cant i do that with a regular python or R program?
thanks afain

Gordon V. Cormack

unread,
Dec 12, 2019, 2:07:26 PM12/12/19
to Common Crawl
It seems that, as of several months ago, this software no longer works for the news crawl.  It crashes on "request" records that were introduced to the WARC files.

Thoughts on a workaround?

thanks,
Gordon

Sebastian Nagel

unread,
Dec 12, 2019, 2:55:36 PM12/12/19
to common...@googlegroups.com
Hi Gordon,

ok, I see the HTTP request message in the CC-NEWS WARC files lacks the HTTP version:
GET /
instead of
GET / HTTP/1.1

Well, that's a bug:
- I'll fix it in the WARC module of Stormcrawler
- a work-around / quick is available for ia-web-commons:
just upgrade to the latest master or see [1]
- eventually the WARC files get fixed

Thanks,
Sebastian

[1] https://github.com/commoncrawl/ia-web-commons/commit/428022bc9a2638337ecb9f08fea5f89d155e2443

On 12/12/19 8:07 PM, Gordon V. Cormack wrote:
> It seems that, as of several months ago, this software no longer works for the news crawl.  It
> crashes on "request" records that were introduced to the WARC files.
>
> Thoughts on a workaround?
>
> thanks,
> Gordon
>
> On Thursday, 3 May 2018 20:55:14 UTC-4, olaf.c...@gmail.com wrote:
>
> Hello, are the WET files still not available? I am afraid I do not know how to run the run the
> WET extractors on my local machine...
> Many thanks for your help
>
> O
>
> On Friday, 28 July 2017 02:27:38 UTC-4, Spider99 wrote:
>
>
> Thanks Sebastian
>
>
> On Wednesday, July 5, 2017 at 1:24:01 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi,
>
> unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
>
> But it's easy to run the WET extractor on the WARC files, see:
>   https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
> <https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion>
>   https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
> >     <mailto:common-crawl...@googlegroups.com <javascript:>>.
> >     > To post to this group, send email to common...@googlegroups.com <javascript:>
> >     > <mailto:common...@googlegroups.com <javascript:>>.
> >     > Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>
> >     <https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>>.
> >     > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common
> Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> > To post to this group, send email to common...@googlegroups.com
> > <mailto:common...@googlegroups.com>.
> > Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/900d7d06-1699-4a14-9b0c-ed20b6c3f6d4%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/900d7d06-1699-4a14-9b0c-ed20b6c3f6d4%40googlegroups.com?utm_medium=email&utm_source=footer>.

Gordon V. Cormack

unread,
Dec 12, 2019, 3:04:23 PM12/12/19
to common...@googlegroups.com
thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/bff894b6-8fb7-5927-84eb-643d4733bf18%40commoncrawl.org.

Alex Xue

unread,
Jun 9, 2020, 5:00:23 PM6/9/20
to Common Crawl
Hi. I've followed all of those steps listed, but when I run 

java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
   -strictMode -skipExisting batch-id-xyz ../warc/*.warc.gz

I get the error:java.io.IOException: java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.

I am performing this with this WARC file: CC-NEWS-20160826124520-00000.warc.gz .

I have tried also using [1] as you stated (taking out the lines in httprequestparser.jar)

Any help is appreciated, thanks!
>             >     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>             >     > To post to this group, send email to common...@googlegroups.com <javascript:>
>             >     > <mailto:common...@googlegroups.com <javascript:>>.
>             >     > Visit this group at https://groups.google.com/group/common-crawl
>             <https://groups.google.com/group/common-crawl>
>             >     <https://groups.google.com/group/common-crawl
>             <https://groups.google.com/group/common-crawl>>.
>             >     > For more options, visit https://groups.google.com/d/optout
>             <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
>             <https://groups.google.com/d/optout>>.
>             >
>             > --
>             > You received this message because you are subscribed to the Google Groups "Common
>             Crawl" group.
>             > To unsubscribe from this group and stop receiving emails from it, send an email to
>             > To post to this group, send email to common...@googlegroups.com
>             > <mailto:common...@googlegroups.com>.
>             > Visit this group at https://groups.google.com/group/common-crawl
>             <https://groups.google.com/group/common-crawl>.
>             > For more options, visit https://groups.google.com/d/optout
>             <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Sebastian Nagel

unread,
Jun 9, 2020, 5:46:03 PM6/9/20
to common...@googlegroups.com
Hi Alex,

the WARC files written by the news crawler during the first weeks (61 WARC files)
lack the header "WARC-Filename" in the warcinfo record (the first record).
That's causing the error when generating the WET files.

I'll try to fix the WAT/WET generator tomorrow. Fixing the WARC files
might be a longer work.

The first CC-NEWS WARC files with a complete warcinfo record dates
to Oct 2016:
s3://commoncrawl/crawl-data/CC-NEWS/2016/10/CC-NEWS-20161017145313-00000.warc.gz

Maybe you can start for now from this file? I'll let you know when the WET generator
is fixed.

Thanks for the notice and your patience!

Best,
Sebastian
> >             >     <mailto:common-crawl...@googlegroups.com <javascript:>>.
> >             >     > To post to this group, send email to common...@googlegroups.com <javascript:>
> >             >     > <mailto:common...@googlegroups.com <javascript:>>.
> >             >     > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>
> >             >     <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>>.
> >             >     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
> >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>>.
> >             >
> >             > --
> >             > You received this message because you are subscribed to the Google Groups "Common
> >             Crawl" group.
> >             > To unsubscribe from this group and stop receiving emails from it, send an email to
> >             > common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> >             > To post to this group, send email to common...@googlegroups.com
> >             > <mailto:common...@googlegroups.com>.
> >             > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>.
> >             > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
> >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/900d7d06-1699-4a14-9b0c-ed20b6c3f6d4%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/900d7d06-1699-4a14-9b0c-ed20b6c3f6d4%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/e9484aaf-24d7-49b1-9f4a-34b8d2b49a01o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/e9484aaf-24d7-49b1-9f4a-34b8d2b49a01o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alex Xue

unread,
Jun 9, 2020, 6:25:44 PM6/9/20
to Common Crawl
Hi Sebastian,

Thank you for the quick response! 

I'm extracting the latest CC-news crawl right now as per your recommendation, it seems to be running fine as you said. Another thing if I may add: while I was looking into the existence of this "WARC-Filename" field as per the getting started page (https://commoncrawl.org/the-data/get-started/), in the "full WARC extract" file (https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-warc) I also noticed that the "WARC-Filename" isn't found there. I've just begun my research into Common Crawl / News Crawl so I apologize if I'm not referring to the right file here, but that sample extract should also include the "WARC-Filename" right? (As long as we are referring to the most recent crawl). It might be worth updating that just in case someone comes across a similar issue as I did, where the fields are slightly difference based on how old the crawl is. 

Thanks again for the quick response, keep me in touch if you manage to get that WAT/WET generator fix going.

On a side note - I'd love to learn how to contribute to this repo if possible, I'll be sending you message from my personal email!

Alex
>     >             >     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>     >             >     > To post to this group, send email to common...@googlegroups.com <javascript:>
>     >             >     > <mailto:common...@googlegroups.com <javascript:>>.
>     >             >     > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
>     >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>
>     >             >     <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
>     >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>>.
>     >             >     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
>     >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>> <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>>.
>     >             >
>     >             > --
>     >             > You received this message because you are subscribed to the Google Groups "Common
>     >             Crawl" group.
>     >             > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >             > To post to this group, send email to common...@googlegroups.com
>     >             > <mailto:common...@googlegroups.com>.
>     >             > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
>     >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>.
>     >             > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
>     >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>.
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an email to
> To unsubscribe from this group and stop receiving emails from it, send an email to common...@googlegroups.com
> <mailto:common-crawl+unsub...@googlegroups.com>.

Sebastian Nagel

unread,
Jun 11, 2020, 10:25:45 AM6/11/20
to common...@googlegroups.com
Hi Alex,

please see
https://github.com/commoncrawl/ia-web-commons/issues/23
and for a fix
https://github.com/commoncrawl/ia-web-commons/pull/24

I'll merge the fix soon but waiting for feedback on the corresponding
upstream issue
https://github.com/iipc/webarchive-commons/issues/88

Let me know if you need help to apply the fix to your installation.

Best
Sebastian
> >     >             >     <mailto:common-crawl...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/optout>>> <https://groups.google.com/d/optout <https://groups.google.com/d/optout>
> >     <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>
> >     >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>>>.
> >     >             >
> >     >             > --
> >     >             > You received this message because you are subscribed to the Google Groups "Common
> >     >             Crawl" group.
> >     >             > To unsubscribe from this group and stop receiving emails from it, send an email to
> >     >             > common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> >     >             > To post to this group, send email to common...@googlegroups.com
> >     >             > <mailto:common...@googlegroups.com>.
> >     >             > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>
> >     >             <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>>.
> >     >             > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
> <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>
> >     >             <https://groups.google.com/d/optout <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>>.
> >     >
> >     > --
> >     > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> >     > To unsubscribe from this group and stop receiving emails from it, send an email to
> >     > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> > <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/e9484aaf-24d7-49b1-9f4a-34b8d2b49a01o%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/e9484aaf-24d7-49b1-9f4a-34b8d2b49a01o%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/48c58ede-40ec-414a-a988-7eefbec05a33o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/48c58ede-40ec-414a-a988-7eefbec05a33o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Miguel Arana

unread,
Mar 15, 2021, 5:34:36 PMMar 15
to Common Crawl
Hi,

I found the following problem in the installation. I share it here in case anybody had it before and have more information:

Thanks!

Miguel Arana Catania

unread,
Mar 16, 2021, 8:00:42 AMMar 16
to common...@googlegroups.com
And it was fixed extremely fast! :)

Sharing here in case anybody needs the fix:
https://github.com/iipc/webarchive-commons/pull/91

El lun, 15 mar 2021 a las 21:34, Miguel Arana
(<cronopioe...@gmail.com>) escribió:
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/hsb90GHq6to/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/19d93160-fcec-4caf-9870-231be6b7645fn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages