May 2022 crawl archive now available

Skip to first unread message

Sebastian Nagel

Jun 2, 2022, 6:47:33 AMJun 2
Hi all,

the crawl archives of May 2022 are now available.
The data was crawled May 16 – 29 and contains 3.45 billion web
pages or 420 TiB of uncompressed content. It includes page captures
of 1.35 billion new URLs, not visited in any of our prior crawls.

As usual, more details about the crawl and information how to access
and use the data can be found on the Common Crawl blog [1].



Bpm Tips

Jun 15, 2022, 2:11:28 PM (11 days ago) Jun 15
to Common Crawl
We have downloaded the common crawl columnar index locally and are in the process of screenshotting entire 45 million websites, to build a front page only search engine of entire web available at which displays websites and shows the front page/home page. If you need to run certain spark sql queries on the columnar index let us know we can publicly post the query results in csv format.

e.g. the list of all domains with count of number of urls is available at the following link.

query used 

val sqlDF = sqlContext.sql("SELECT distinct url_host_name as domain, count(*) as size from urls order by size desc")

Reply all
Reply to author
0 new messages