Finding a subset of Common Crawl by Keyword

Peter Bleackley

unread,

Jun 30, 2021, 10:37:06 AM6/30/21

to Common Crawl

I'm thinking of using Common Crawl in a project, and I'd like to find a sample of documents that are relevant to the project - ie pages that contain certain keywords. I've been looking at the example projects, but I haven't seen a way of doing this. Can anyone suggest a way of doing it?

Amirouche Boubekki

unread,

Jun 30, 2021, 12:14:19 PM6/30/21

to common...@googlegroups.com

I have something like that. Do you need the output as WARC files or
something else ?

Jay Patel

unread,

Jun 30, 2021, 12:21:32 PM6/30/21

to common...@googlegroups.com

I think its a pretty typical use case for common crawl datasets.

Briefly, you have two options. You can process all ~60,000 files of WET from one monthly crawl that contains extracted text from webpages and find the keywords you want from that.

Alternatively, you can filter webpages on the basis of domain ranks etc using common crawl index and process WARC files that contain HTML source of webpages.

I discuss both these things in chapter 6 and 7 of my book. On the implementation side, I prefer running these on my own custom code on a cluster of c5.2xlarge AWS EC2 servers but I am sure you can find a lot of other options by searching through this forum.

Jay

On Wed, Jun 30, 2021 at 8:07 PM Peter Bleackley <peter.b...@playfultechnology.co.uk> wrote:

I'm thinking of using Common Crawl in a project, and I'd like to find a sample of documents that are relevant to the project - ie pages that contain certain keywords. I've been looking at the example projects, but I haven't seen a way of doing this. Can anyone suggest a way of doing it?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/67978c02-5890-403c-a8ad-a30fc18d7cban%40googlegroups.com.

Reply all

Reply to author

Forward