Location of the News Dataset

Skip to first unread message


Feb 11, 2022, 10:52:08 AMFeb 11
to Common Crawl
Hi community,

first of all: a big thank you to Sebastian and the whole CommonCrawl team for their dedication and great work. Lately I have started to get an overview about the steps needed to read out CommonCrawl WARC files, and began by using the warcio Python library, which works fine for small local experiments. So as a first test, I downloaded the file

which is given as an example on the CC website and I was able to read out the full HTML content of all the 60.288 webpages included in that file. So far so good.

But now I have some comprehension questions:

Where can I find all the other WARC files from the News Dataset? On the CC website, it reads that the News data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/ and that it can be accessed in the same way as the WARC files from the Main dataset. But how exactly? I tried different things but it seems I am missing something obvious here or I understood something wrong. Let's say, I want the first package of the first news dataset from 2021, I would assume this path to be something like:

But how would I know that timestamp (if it even is a timestamp) and the amount of files/sub-packages for that day (the range 00000-?????) ?

And shouldn't the full package of a News crawl (which then includes all the 00000-????? sub-packages) be accessible via something like:

For the Main dataset, you offer such a list that shows the past crawls on commoncrawl.org/the-data/get-started/, together with the link to the download page.

But for the News dataset?

Confused but thankful greetings


Sebastian Nagel

Feb 11, 2022, 11:21:13 AMFeb 11
to common...@googlegroups.com
Hi Marc,

see the instructions on

In short:

- install the AWS CLI (https://aws.amazon.com/cli/)

- and run
aws --no-sign-request s3 ls \

Note: the option --no-sign-request is required if you haven't set up
an AWS account.

- you could also list all WARC files of a specific day:
aws --no-sign-request s3 ls \

- the option --recursive allows to list even an entire year
aws ... s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2021/
(or even all files)

> And shouldn't the full package of a News crawl (which then includes
> all the 00000-????? sub-packages) be accessible via something like:

The news dataset is continuously growing and not released as a closed
set. So, it does not really make sense to provide a frequently changing
list of all WARC files. The S3 API is by far more flexible. There are
also SDKs for a multitude of programming languages, see


Sebastian Nagel

Mar 24, 2022, 6:27:18 PMMar 24
to common...@googlegroups.com

a short update on this question:

1. we now provide WARC file listings for the news dataset,
see the updated instructions on

2. users without an AWS account must rely on the provided listings
starting from April 4th. Please see




Apr 10, 2022, 4:45:22 PMApr 10
to Common Crawl
Thanks a lot Sebastian.
Reply all
Reply to author
0 new messages