Dear CommonCrawl Team,
I am reaching out to report an issue with the index search functionality. It appears that index.commoncrawl.org might be truncating search results when a URL pattern matches a large number of records.
In my analysis, I processed the full WARC files from CC-MAIN-2017-04 by iterating over all records and matching URL patterns, then compared these results with the URLs returned by the index query. For URL patterns matching fewer than 10,000 records, the index search performed as expected. However, for one URL pattern that matched 96,112 URLs in the full dataset, the index query returned only 14,165 URLs. After merging and sorting the URLs alphabetically, it seems that only the first 14,165 entries are being returned. I observed the same behavior with all five URL patterns that matched between 25K and 420K URLs—the index query consistently returned fewer than 15,000 results.
I tested the query using both Python’s requests package and curl, and both methods yielded identical results. Would you recommend a different approach or tool to handle such long responses?
I can provide additional details and share my files if needed.
Thank you for your assistance.
Best regards,
Bahar
On Mar 9, 2025, at 3:26 AM, Bahar Zafer <bahar...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/c64ba5ad-56de-4653-b4e2-fdaa11499fb8n%40googlegroups.com.
resp = requests.get(f'https://index.commoncrawl.org/{crawl}-index?', params={'url': url_key, 'showNumPages':'true'})
cont = json.loads(resp.content)
numpages = cont['pages']
for pagenum in range(numpages):
response = requests.get(f"https://index.commoncrawl.org/{crawl}-index?url={url_key}&output=json&page={pagenum}")
.....
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/819796BB-3721-48EC-9749-D8D247A4AEC8%40commoncrawl.org.
params={'url': url_key, 'output': 'json', 'page'=0} )
Many thanks for your help.params={'url': url_key, 'output': 'json'})
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/97dd082c-2f77-4190-ab16-e4a1e9377a63n%40googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAPKptOXMyRRn%2BQNEA2oEM50-Vo4-bkSng6K6fvugwZ9_9Z0ruA%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/170AAF74-4CD6-4970-977B-5FF529060A5A%40commoncrawl.org.