New book based on common crawl datasets

111 views

Skip to first unread message

jaypa...@gmail.com

unread,

Nov 18, 2020, 8:11:04 AM11/18/20

to Common Crawl

I am a author of a new book that came out today titled "Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale", (2020) published by Apress (Springer Imprint).

Let me take an opportunity to personally thank this Listserv, Common Crawl foundation and Sebestian Nagel for all their help and support over the years. I mentioned it in the acknowledgements section but I wanted to reiterate here.

I can honestly say that without common crawl datasets, it would've been almost impossible to write this book. We extensively used common crawl datasets to process web crawl data at scale to develop a email database (like hunter.io), website similarity database (like Alexa does), a technology profiler (something like builtwith.com), domain authority and ranking (like Moz, Ahrefs etc.), and a page level webgraph or a backlinks database.

The great thing about common crawl datasets is that for each use case outlined here, we could present clean example code by simply using preprocessed files such as WAT files (for graph level examples), WET files (for text similarity) etc.

Instead of showing a toy example of working with Parquet files, we simply demonstrated that with Common crawl index files; so even rather boring portions of common crawl dataset gave us a teachable opportunity.

Jason Duke

unread,

Nov 18, 2020, 8:18:55 AM11/18/20

to common...@googlegroups.com

Huge congratulations

I've just ordered your book and look forward to reading it :)

Jason Duke

https://StrangeLogic.com/ - Wisdom & Experience is Strangely Logical

https://MyProxy.io/ - Use your Android phones as 4G and 5G proxies

https://The.Domain.Name/ - Domains with value, at scale

https://DoneForYou.Network/ - Want a link network? We'll do it ALL for you!

https://WhoCoo.com/ - Where magic and data comes together

Email: ja...@strangelogic.com

Email: ja...@the.domain.name

Twitter: @JasonD

LinkedIn: http://uk.linkedin.com/in/jasonduke1

Skype:JasonD

Mob:+44 (0)7595 924 934

The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright. If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender. This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised. You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

Strange Logic Limited is a company registered in England and Wales (Company No. 10995931 ) with its registered address being 1 Alfriston Park, Seaford, East Sussex. BN25 3LS

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/0b45bdba-14af-4cda-9548-db9f7390f755n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages