May 2022 crawl archive now available

88 views

Skip to first unread message

Sebastian Nagel

unread,

Jun 2, 2022, 6:47:33 AM6/2/22

to common...@googlegroups.com

Hi all,

the crawl archives of May 2022 are now available.
The data was crawled May 16 – 29 and contains 3.45 billion web
pages or 420 TiB of uncompressed content. It includes page captures
of 1.35 billion new URLs, not visited in any of our prior crawls.

As usual, more details about the crawl and information how to access
and use the data can be found on the Common Crawl blog [1].

Best,
Sebastian

[1] https://commoncrawl.org/2022/06/may-2022-crawl-archive-now-available/

Bpm Tips

unread,

Jun 15, 2022, 2:11:28 PM6/15/22

to Common Crawl

We have downloaded the common crawl columnar index locally and are in the process of screenshotting entire 45 million websites, to build a front page only search engine of entire web available at https://front-page.com which displays websites and shows the front page/home page. If you need to run certain spark sql queries on the columnar index let us know we can publicly post the query results in csv format.

e.g. the list of all domains with count of number of urls is available at the following link.

query used

val sqlDF = sqlContext.sql("SELECT distinct url_host_name as domain, count(*) as size from urls order by size desc")

https://4.ipv6.systems/data/domainsize.csv/part-00000-48a10a18-3491-4935-89d3-18801c18530e-c000.csv

Reply all

Reply to author

Forward

0 new messages