Hi Julius,
> This returned only URLs to blogs, and some of the pages could not be
> found/were no longer accessible.
Did you browse through all result pages?
If I run the query and extract the host names from all sampled URL,
there is a broad variety of host names. Although, the first result page
shown in the Athena console includes only sites from
blogspot.com.
But this is by accident - sampling does not necessarily also mean
shuffling.
Note: you eventually want to adjust the sampling rate -
0.5%
of 3 billion captures (successful fetches *and* 404s, redirects, etc.)
=> 15 million
> The latter I assume is because this crawl is two years old.
Yes, of course. Just replace 'CC-MAIN-2020-34' by 'CC-MAIN-2022-27'
or the newest crawl. For a list of crawls, see
https://data.commoncrawl.org/crawl-data/index.html
or run the Athena query
SHOW PARTITIONS ccindex;
This will show all available partitions as combination <crawl,subset>.
Best,
Sebastian