Hi community,
first of all: a big thank you to Sebastian and the whole CommonCrawl team for their dedication and great work. Lately I have started to get an overview about the steps needed to read out CommonCrawl WARC files, and began by using the warcio Python library, which works fine for small local experiments. So as a first test, I downloaded the file
which is given as an example on the CC website and I was able to read out the full HTML content of all the 60.288 webpages included in that file. So far so good.
But now I have some comprehension questions:
Where can I find all the other WARC files from the News Dataset? On the CC website, it reads that the News data is available on AWS S3 in the commoncrawl bucket at /crawl-data/CC-NEWS/ and that it can be accessed in the same way as the WARC files from the Main dataset. But how exactly? I tried different things but it seems I am missing something obvious here or I understood something wrong. Let's say, I want the first package of the first news dataset from 2021, I would assume this path to be something like:
But how would I know that timestamp (if it even is a timestamp) and the amount of files/sub-packages for that day (the range 00000-?????) ?
And shouldn't the full package of a News crawl (which then includes all the 00000-????? sub-packages) be accessible via something like:
But for the News dataset?
Confused but thankful greetings
Marc