Location of the News Dataset

92 views
Skip to first unread message

Marc

unread,
Feb 11, 2022, 10:52:08 AMFeb 11
to Common Crawl
Hi community,

first of all: a big thank you to Sebastian and the whole CommonCrawl team for their dedication and great work. Lately I have started to get an overview about the steps needed to read out CommonCrawl WARC files, and began by using the warcio Python library, which works fine for small local experiments. So as a first test, I downloaded the file


which is given as an example on the CC website and I was able to read out the full HTML content of all the 60.288 webpages included in that file. So far so good.

But now I have some comprehension questions:

Where can I find all the other WARC files from the News Dataset? On the CC website, it reads that the News data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/ and that it can be accessed in the same way as the WARC files from the Main dataset. But how exactly? I tried different things but it seems I am missing something obvious here or I understood something wrong. Let's say, I want the first package of the first news dataset from 2021, I would assume this path to be something like:


But how would I know that timestamp (if it even is a timestamp) and the amount of files/sub-packages for that day (the range 00000-?????) ?

And shouldn't the full package of a News crawl (which then includes all the 00000-????? sub-packages) be accessible via something like:


For the Main dataset, you offer such a list that shows the past crawls on commoncrawl.org/the-data/get-started/, together with the link to the download page.

But for the News dataset?

Confused but thankful greetings

Marc

Sebastian Nagel

unread,
Feb 11, 2022, 11:21:13 AMFeb 11
to common...@googlegroups.com
Hi Marc,

see the instructions on
https://commoncrawl.org/2016/10/news-dataset-available/

In short:

- install the AWS CLI (https://aws.amazon.com/cli/)

- and run
aws --no-sign-request s3 ls \
s3://commoncrawl/crawl-data/CC-NEWS/2021/01/

Note: the option --no-sign-request is required if you haven't set up
an AWS account.

- you could also list all WARC files of a specific day:
aws --no-sign-request s3 ls \
s3://commoncrawl/crawl-data/CC-NEWS/2021/01/CC-NEWS-2021013

- the option --recursive allows to list even an entire year
aws ... s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2021/
(or even all files)

> And shouldn't the full package of a News crawl (which then includes
> all the 00000-????? sub-packages) be accessible via something like:
>
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/CC-NEWS-2021-01-01/warc.paths.gz

The news dataset is continuously growing and not released as a closed
set. So, it does not really make sense to provide a frequently changing
list of all WARC files. The S3 API is by far more flexible. There are
also SDKs for a multitude of programming languages, see
https://aws.amazon.com/tools/

Best,
Sebastian

Sebastian Nagel

unread,
Mar 24, 2022, 6:27:18 PMMar 24
to common...@googlegroups.com
Hi,

a short update on this question:

1. we now provide WARC file listings for the news dataset,
see the updated instructions on
https://commoncrawl.org/2016/10/news-dataset-available/
and
https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html

2. users without an AWS account must rely on the provided listings
starting from April 4th. Please see

https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
https://commoncrawl.org/access-the-data/

Best,
Sebastian

Marc

unread,
Apr 10, 2022, 4:45:22 PMApr 10
to Common Crawl
Thanks a lot Sebastian.
Reply all
Reply to author
Forward
0 new messages