Inconsistency between cluster.idx and index search

39 views

Skip to first unread message

Александр Великий

unread,

Jan 28, 2023, 1:50:24 PM1/28/23

to Common Crawl

I'm trying to find URLs on a given domain. As I understand there are two options:

* Querying https://index.commoncrawl.org/<index>-index?url=example.com

* Reading https://data.commoncrawl.org/cc-index/collections/<index>/indexes/cluster.idx

However I found following inconsistencies:

Response from https://index.commoncrawl.org/CC-MAIN-2022-49-index?url=example.com
contains 144 entries for example.com (all for / path)
However, https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2022-49/indexes/cluster.idx contains only two entries for example.com (both with non / paths)

Additionally, response from https://index.commoncrawl.org/CC-MAIN-2021-43-index?url=api.remitano.com&output=json returns a single entry for api.remitano.com
However, https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2021-43/indexes/cluster.idx does not contain entries for api.remitano.com

The idea to look into cluster.idx came to me after reading https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ

Can you please explain why there is an inconsistency in the responses and is looking into cluster.idx file a good idea to look for URLs for a given domain?

Sebastian Nagel

unread,

Jan 28, 2023, 3:29:58 PM1/28/23

to common...@googlegroups.com

Hi Alexander,

> The idea to look into *cluster.idx *came to me after reading
> *https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ*

It's explained, if you read down through the thread or
https://github.com/webrecorder/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx

In short: a binary search in the cluster.idx tells you in which block(s)
of which cdx-*.gz to look further. PyWB (the software running
index.commoncrawl.org) performs this second lookup fetching the block(s)
from s3://commoncrawl/.

Because only one of 3000 URLs is included, the absence of a domain name
in the cluster.idx only means that there are *less than 3000* captures
of that domain, maybe no capture at all.

Best,
Sebastian

On 1/28/23 19:50, Александр Великий wrote:
> I'm trying to find URLs on a given domain. As I understand there are two
> options:
> * Querying https://index.commoncrawl.org/<index>-index?url=example.com
> * Reading
> https://data.commoncrawl.org/cc-index/collections/<index>/indexes/cluster.idx
>
> However I found following inconsistencies:
> Response from

> *https://index.commoncrawl.org/CC-MAIN-2022-49-index?url=example.com*

> contains 144 entries for example.com (all for / path)
> However,

> *https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2022-49/indexes/cluster.idx* contains only two entries for example.com (both with non / paths)
>
> Additionally, response from
> *https://index.commoncrawl.org/CC-MAIN-2021-43-index?url=api.remitano.com&output=json* returns a single entry for api.remitano.com
> However,
> *https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2021-43/indexes/cluster.idx* does not contain entries for api.remitano.com
>
> The idea to look into *cluster.idx *came to me after reading
> *https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ*

>
> Can you please explain why there is an inconsistency in the responses

> and is looking into *cluster.idx *file a good idea to look for URLs for
> a given domain?
>

Reply all

Reply to author

Forward

0 new messages