results limit in index search

Premraj Narkhede

unread,

Oct 28, 2016, 8:50:44 AM10/28/16

to Common Crawl

Hi Guys

Thanks for awesome work you are doing @ commoncrawl.

I had one query about index search results

Is there a limit in hits that you show on index search?

e.g. I was searching for bloomberg articles in all indexes.

I see that one particular result which i obtained by searching specifically
{"urlkey": "com,bloomberg)/news/articles/2015-01-21/discovery-said-to-fight-murdoch-s-sky-for-english-soccer-rights", "timestamp": "20160727110640", "status": "200", "url": "http://www.bloomberg.com/news/articles/2015-01-21/discovery-said-to-fight-murdoch-s-sky-for-english-soccer-rights", "filename": "crawl-data/CC-MAIN-2016-30/segments/1469257826759.85/warc/CC-MAIN-20160723071026-00225-ip-10-185-27-174.ec2.internal.warc.gz", "length": "35263", "mime": "text/html", "offset": "326446213", "digest": "ZV5RXZEJQCSLCYNYEQCCKGP4FTXLKSMT"}

doesnt appear in search results when i search for *.bloomberg.com in index

Premraj

Sebastian Nagel

unread,

Oct 28, 2016, 9:12:26 AM10/28/16

to common...@googlegroups.com

Hi Premraj,

have a look at the pagination request parameters:
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#pagination-api

You start, e.g., with the query
http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=bloomberg.com&matchType=domain&output=json&pageSize=1&showNumPages=true

which returns
{"blocks": 247, "pages": 247, "pageSize": 1}

Then send requests page by page from 0

http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=bloomberg.com&matchType=domain&output=json&pageSize=1&page=0
to (pages-1)

http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=bloomberg.com&matchType=domain&output=json&pageSize=1&page=246

*Note* that pages does not mean number of result lines/records! Better choose a small value for the
parameter pageSize.

In case you want to look for a large numbers of domains, it may be more efficient to read the index
files sequentially. They are available on S3, 300 files per monthly crawl:
s3://commoncrawl/cc-index/collections/CC-MAIN-2016-40/indexes/CC-MAIN-2016-40/cdx-*.gz

Best,
Sebastian

> doesnt appear in search results when i search for /*.bloomberg.com/ in index
>
> Premraj
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Premraj Narkhede

unread,

Oct 28, 2016, 9:21:31 AM10/28/16

to Common Crawl

Got it!.. Thanks for super quick reply!

What do you reckon if I am looking for about 100 domains? Read sequentially or go by index search way? How distributed would be data from single domain?

Premraj

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Sebastian Nagel

unread,

Oct 28, 2016, 9:39:49 AM10/28/16

to common...@googlegroups.com

Looking up 100 domains isn't a problem. If we assume 100 pages per domain in average, that's 10,000
requests, definitely no problem for the index server and probably processed within 2-3 hours.

> > doesnt appear in search results when i search for /*.bloomberg.com/ <http://bloomberg.com/> in

> index
> >
> > Premraj
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.

> > Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Reply all

Reply to author

Forward