Best way to get index for a large set of specific pages?

41 views
Skip to first unread message

Bob

unread,
Jul 27, 2021, 9:09:15 PMJul 27
to Common Crawl
I am trying to pull the index files for just over 2,000,000 pages (specific urls, I only want the most recently scraped instance with a 200 response of each). What is the preferred method for doing this?

I'd prefer to not use Athena as this is a personal project and I'm somewhat sure that the Athena route will bankrupt me. I've been playing around with a clone of the cc-index-server, but since this is a one time usage I figure its best asking before I set up the network infrastructure to handle the concurrency needed to do this (hopefully) within a few hours.

Any suggestions or links are greatly appreciated! Happy to clarify where needed!

Bob

unread,
Jul 27, 2021, 9:24:29 PMJul 27
to Common Crawl
Clarification on the first sentence: By "Index files" I mean the preferred way to get specific index of each individual page. 

e.x.:
1.) example.com/path/to/page1
2.) potato.com/news/tasty-potatoes
...
2,000,000.) notpotato.com/iamapotato


Sebastian Nagel

unread,
Jul 28, 2021, 12:53:24 PMJul 28
to common...@googlegroups.com
Hi Bob,

> I'd prefer to not use Athena as this is a personal project
> and I'm somewhat sure that the Athena route will bankrupt me.

If done right, it should cost you not more than $0.50 to intersect
the list of URLs with one monthly crawl using Athena.

The idea is to create a second table which holds your URL list
in the form of SURT URLs
(https://example.com/path/page.html -> com,example)/path/page.html)
and then do a join on the SURT columns of the two tables.

The SURT URL is used to avoid mismatches on the URL by common variations:
https vs http, a trailing slash, swapped URL query params.

I'm currently doing similar work - compare the URL database of a
focused crawl with the CC index to compare the two crawls and evaluate
the coverage. I'll share how to do this but I'd need a couple of days
to write it down.

A table join on the domain column is described here:
https://github.com/commoncrawl/cc-index-table/blob/master/src/sql/examples/cc-index/count-domains-alexa-top-1m.sql

Let me know whether this drafted solution sounds good to you and ping me in case
you wait for a quick response.

Best,
Sebastian

Bob

unread,
Jul 28, 2021, 9:12:31 PMJul 28
to Common Crawl
Thank you Sir! This is exactly what I was looking for. I am going to give it a try this weekend and I'll gladly report back with results!  
Reply all
Reply to author
Forward
0 new messages