Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Missing data in index search results

88 views
Skip to first unread message

Bahar Zafer

unread,
Mar 9, 2025, 4:26:02 AMMar 9
to Common Crawl

Dear CommonCrawl Team,

I am reaching out to report an issue with the index search functionality. It appears that index.commoncrawl.org might be truncating search results when a URL pattern matches a large number of records.

In my analysis, I processed the full WARC files from CC-MAIN-2017-04 by iterating over all records and matching URL patterns, then compared these results with the URLs returned by the index query. For URL patterns matching fewer than 10,000 records, the index search performed as expected. However, for one URL pattern that matched 96,112 URLs in the full dataset, the index query returned only 14,165 URLs. After merging and sorting the URLs alphabetically, it seems that only the first 14,165 entries are being returned. I observed the same behavior with all five URL patterns that matched between 25K and 420K URLs—the index query consistently returned fewer than 15,000 results.

I tested the query using both Python’s requests package and curl, and both methods yielded identical results. Would you recommend a different approach or tool to handle such long responses?

I can provide additional details and share my files if needed.

Thank you for your assistance.

Best regards,

Bahar

Thom Vaughan

unread,
Mar 9, 2025, 6:25:50 PMMar 9
to Common Crawl
Hi Bahar,

The CDX index server paginates results, which may explain why you're seeing fewer results than expected in a single query. You can use the showNumPages=true URL parameter to get the total number of available pages, and then use page=N to iterate over them.

This is documented here: https://github.com/webrecorder/pywb/wiki/CDX-Server-API#pagination-api

Let us know if you need further assistance.

TV

Jason Grey

unread,
Mar 17, 2025, 1:11:04 AMMar 17
to common...@googlegroups.com, Crawl Common
I’d be helpful to have your queries. If you don’t want to post here, you can send them to me directly and I can have a look.

On Mar 9, 2025, at 3:26 AM, Bahar Zafer <bahar...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/c64ba5ad-56de-4653-b4e2-fdaa11499fb8n%40googlegroups.com.

Bahar Zafer

unread,
Mar 17, 2025, 5:54:10 PMMar 17
to common...@googlegroups.com
Hi Thom and Jason, 

Many thanks for your replies and assistance. Iterating over pages worked well. 

Based on a few responses that I have checked, it seems to me that the index query is returning the results at page=0 when the page parameter was not specified. Would you verify this happens generically? Is page=0 the default value? Would you recommend skipping page=0 and iterating over the rest of pages to complete my index search? 

A simplified version of my code below: 

resp = requests.get(f'https://index.commoncrawl.org/{crawl}-index?', params={'url': url_key, 'showNumPages':'true'})
cont = json.loads(resp.content)
numpages = cont['pages']
for pagenum in range(numpages):
      .....

Thank you for taking the time to help.
Best, 
Bahar 

Thom Vaughan

unread,
Mar 17, 2025, 6:30:32 PMMar 17
to Common Crawl
Hi Bahar,

Glad you've made progress.  The page numbers are zero-indexed, so page=0 is the first page.

Best,
TV

Bahar Zafer

unread,
Mar 18, 2025, 5:06:54 AMMar 18
to common...@googlegroups.com
Hi Thom, 

I am still confused about this. Should response1 and response2 as defined below for some url key and crawl return the same output? 

response1 = requests.get(f'https://index.commoncrawl.org/{crawl}-index?', 
params={'url': url_key, 'output': 'json', 'page'=0} )
response2 = requests.get(f'https://index.commoncrawl.org/{crawl}-index?', 
params={'url': url_key, 'output': 'json'})
Many thanks for your help. 

Best regards,
Bahar

Jason Grey

unread,
Mar 18, 2025, 11:31:41 AMMar 18
to common...@googlegroups.com
Yes, those should return the same results.

However - the JSON/dictionary you have specified below has incorrect syntax (note the use of “=" rather than “:”) - so be sure to fix that.

Bahar Zafer

unread,
Mar 19, 2025, 6:22:23 AMMar 19
to common...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages