Let me take an opportunity to personally thank this Listserv, Common Crawl foundation and Sebestian Nagel for all their help and support over the years. I mentioned it in the acknowledgements section but I wanted to reiterate here.
I can honestly say that without common crawl datasets, it would've been almost impossible to write this book. We extensively used common crawl datasets to process web crawl data at scale to develop a email database (like
hunter.io), website similarity database (like Alexa does), a technology profiler (something like
builtwith.com), domain authority and ranking (like Moz, Ahrefs etc.), and a page level webgraph or a backlinks database.
The great thing about common crawl datasets is that for each use case outlined here, we could present clean example code by simply using preprocessed files such as WAT files (for graph level examples), WET files (for text similarity) etc.
Instead of showing a toy example of working with Parquet files, we simply demonstrated that with Common crawl index files; so even rather boring portions of common crawl dataset gave us a teachable opportunity.