Inconsistency between cluster.idx and index search

39 views
Skip to first unread message

Александр Великий

unread,
Jan 28, 2023, 1:50:24 PM1/28/23
to Common Crawl
I'm trying to find URLs on a given domain. As I understand there are two options:
* Querying https://index.commoncrawl.org/<index>-index?url=example.com
* Reading https://data.commoncrawl.org/cc-index/collections/<index>/indexes/cluster.idx

However I found following inconsistencies:

The idea to look into cluster.idx came to me after reading https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ

Can you please explain why there is an inconsistency in the responses and is looking into cluster.idx file a good idea to look for URLs for a given domain?

Sebastian Nagel

unread,
Jan 28, 2023, 3:29:58 PM1/28/23
to common...@googlegroups.com
Hi Alexander,

> The idea to look into *cluster.idx *came to me after reading
> *https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ*

It's explained, if you read down through the thread or
https://github.com/webrecorder/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx

In short: a binary search in the cluster.idx tells you in which block(s)
of which cdx-*.gz to look further. PyWB (the software running
index.commoncrawl.org) performs this second lookup fetching the block(s)
from s3://commoncrawl/.

Because only one of 3000 URLs is included, the absence of a domain name
in the cluster.idx only means that there are *less than 3000* captures
of that domain, maybe no capture at all.

Best,
Sebastian

On 1/28/23 19:50, Александр Великий wrote:
> I'm trying to find URLs on a given domain. As I understand there are two
> options:
> * Querying https://index.commoncrawl.org/<index>-index?url=example.com
> * Reading
> https://data.commoncrawl.org/cc-index/collections/<index>/indexes/cluster.idx
>
> However I found following inconsistencies:
> Response from
> *https://index.commoncrawl.org/CC-MAIN-2022-49-index?url=example.com*
> contains 144 entries for example.com (all for / path)
> However,
> *https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2022-49/indexes/cluster.idx* contains only two entries for example.com (both with non / paths)
>
> Additionally, response from
> *https://index.commoncrawl.org/CC-MAIN-2021-43-index?url=api.remitano.com&output=json* returns a single entry for api.remitano.com
> However,
> *https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2021-43/indexes/cluster.idx* does not contain entries for api.remitano.com
>
> The idea to look into *cluster.idx *came to me after reading
> *https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ*
>
> Can you please explain why there is an inconsistency in the responses
> and is looking into *cluster.idx *file a good idea to look for URLs for
> a given domain?
>
Reply all
Reply to author
Forward
0 new messages