News Dataset Available

Sebastian Nagel

unread,

Oct 5, 2016, 3:43:56 AM10/5/16

to common...@googlegroups.com

Hi all,

we've released a new dataset containing news articles from news sites all over the world.

More details and instructions how to access the dataset can be found in our blog at
http://commoncrawl.org/2016/10/news-dataset-available/

We release the dataset at an early stage:

- there are a couple of things to do, see https://github.com/commoncrawl/news-crawl/issues

- as of today we crawl news articles from over 1,000 news sites, 50,000 articles per day,
1 GB compressed content in WARC files. We hope to increase the coverage soon.

- seeds are by now mostly RSS feeds mined from dmoz.org,
see https://github.com/commoncrawl/news-crawl/issues/8

The source code of the crawler is open, including scripts to run it in a container or on AWS.
Feel free to use it or to adapt it and build a focused crawler for your domain of interest.
Help us to improve it by reporting issues, providing bug fixes or sharing improvements. We
appreciate it!

Our special thanks go to Julien Nioche who had the initial idea to start the news crawl project and
volunteered to support initial crawler setup and testing.

Best,
Sebastian

Spider99

unread,

Oct 13, 2016, 6:27:09 AM10/13/16

to Common Crawl

Hi Sebastian,

Thanks for the news WARC files, but when i try to get HTML from WARC files using python warc package i am getting this error, can you help me on this. thanks

IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Sebastian Nagel

unread,

Oct 13, 2016, 7:06:23 AM10/13/16

to common...@googlegroups.com

Hi,

the news crawler uses a different way to write WARC files, so it may be possible that something is
wrong or, at least, not 100% correct. Could you share more details? Which package/module and which
version is used, which WARC file causes the problem, etc. If possible, please, open an issue at
https://github.com/commoncrawl/news-crawl/issues

Thanks,
Sebastian

On 10/13/2016 12:27 PM, Spider99 wrote:
> Hi Sebastian,
> Thanks for the news WARC files, but when i try to get HTML from WARC files using python warc package
> i am getting this error, can you help me on this. thanks
>
> IOError: Expected '\r\n', found 'WARC/1.0\r\n'
>
>
>
> On Wednesday, October 5, 2016 at 1:13:56 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi all,
>
> we've released a new dataset containing news articles from news sites all over the world.
>
> More details and instructions how to access the dataset can be found in our blog at
> http://commoncrawl.org/2016/10/news-dataset-available/
> <http://commoncrawl.org/2016/10/news-dataset-available/>
>
> We release the dataset at an early stage:
>
> - there are a couple of things to do, see https://github.com/commoncrawl/news-crawl/issues
> <https://github.com/commoncrawl/news-crawl/issues>
>
> - as of today we crawl news articles from over 1,000 news sites, 50,000 articles per day,
> 1 GB compressed content in WARC files. We hope to increase the coverage soon.
>

> - seeds are by now mostly RSS feeds mined from dmoz.org <http://dmoz.org>,
> see https://github.com/commoncrawl/news-crawl/issues/8

> <https://github.com/commoncrawl/news-crawl/issues/8>
>
> The source code of the crawler is open, including scripts to run it in a container or on AWS.
> Feel free to use it or to adapt it and build a focused crawler for your domain of interest.
> Help us to improve it by reporting issues, providing bug fixes or sharing improvements. We
> appreciate it!
>
> Our special thanks go to Julien Nioche who had the initial idea to start the news crawl project and
> volunteered to support initial crawler setup and testing.
>
>
> Best,
> Sebastian
>

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Spider99

unread,

Oct 13, 2016, 7:49:37 AM10/13/16

to Common Crawl

Thanks Sebastian,

Raised the issue on git, https://github.com/commoncrawl/news-crawl/issues/11

Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package
snippet to read warc file:
import warc
f = warc.open("CC-NEWS-20161001224340-00008.warc")
for record in f:
if record['Content-Type'] == 'application/http; msgtype=response':
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
if 'Content-Type: text/html' in headers:
#do my processing with html content (body)

But when i run this i am getting this error:
Traceback (most recent call last):
warc_process.py", line 69, in
read_entire_warc("CC-NEWS-20160926211809-00000.warc")
File "warc_process.py", line 54, in read_entire_warc
for record in f:
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)

IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz
CC-NEWS-20161001224340-00008.warc.gz
CC-NEWS-20161002224346-00009.warc.gz
CC-NEWS-20161003130443-00010.warc.gz
CC-NEWS-20161004130444-00011.warc.gz
CC-NEWS-20161005130450-00012.warc.gz
CC-NEWS-20161005152607-00013.warc.gz
CC-NEWS-20161006152607-00014.warc.gz
CC-NEWS-20161006191324-00015.warc.gz
CC-NEWS-20161007191326-00016.warc.gz
CC-NEWS-20161008015559-00017.warc.gz
CC-NEWS-20161009015614-00018.warc.gz
CC-NEWS-20161010001731-00019.warc.gz

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Reply all

Reply to author

Forward