Hi everyone,
First, thank you for the amazing work you’ve all done to make the Common Crawl corpus queryable at scale. The index structure and Athena access have been essential to our project.
We’ve been querying the ccindex.urls table successfully for the past few months, including partitioned queries against crawl='CC-MAIN-2023-50' using projection settings and filtering down to specific URL patterns. This has been working flawlessly—until yesterday.
Suddenly, queries that had previously returned thousands of well-formed results are either:
Returning zero rows, or
Throwing HIVE_BAD_DATA errors, particularly related to schema mismatches on fields like warc_record_length_median
No changes were made to the underlying Athena table definition, partition projection configuration, or crawl ID. We’re using a stable Glue catalog entry with projection.enabled and filtering against known-good fields. This change in behavior has blocked downstream stages of our pipeline, which depends on WARC extraction based on Athena output.
We’ve carefully reviewed the schema and even tested re-adding partitions manually, but the behavior has shifted from known-good to broken with no apparent config drift on our end.
We’d deeply appreciate any insight into:
Whether any recent schema corrections or index updates may have altered the behavior of ccindex.urls
Whether this is isolated to CC-MAIN-2023-50 or systemic
If this may be linked to how warc_record_length_median or similar fields are stored and interpreted by Athena
Thank you in advance for your time—and again, for everything you’ve made possible. We are fully aligned with your mission and building directly on the shoulders of your work.
Best regards,
Mark Ciccarelli
Hi Sebastian,
Thank you so much for the clear and timely explanation—and for confirming that the issue is on the index side and not a fault in our pipeline.
Just a quick note: the GitHub issue link you shared
(https://github.com/commoncrawl/cc-host-index-builder/issues/1)
currently returns a 404 (“This is not the web page you are looking for”).
Not sure if the repo is private or if the link was posted ahead of publication, but just wanted to flag it in case there’s a different location where the issue is being tracked.
We’re happy to hold off on querying CC-MAIN-2023-50 until the fix is published—and again, thank you for all the work you and the team are doing. We’re grateful to build on top of it.
Best regards,
Mark Ciccarelli
Hi Greg,
Thanks for asking—and great question.
Yes, we are intentionally using the Host Index (the very new, experimental one that emits 1 row per registered web host), not the older URL-level Columnar Index.
Sebastian was correct in interpreting this, as our query leverages fields like warc_record_length_median and nutch_fetched_pct, which are only present in the Host Index.
We’re building a structured WARC parsing engine that relies on prequalifying domains, not specific pages. Our pipeline uses the Host Index as an upstream signal source to:
Rank domains for crawl readiness
Pre-classify anti-bot resistance levels
Feed into a downstream spider assignment system (Crawler Engine)
That said—some of my original message language referenced “pages” and “URLs” too loosely. Thanks for catching that—I’ll be more precise going forward.
Best,
Mark
> We’re happy to hold off on querying CC-MAIN-2023-50
Ok. We'd also need to verify that no other partition or other columns
are affected.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEFvrdUtNoVshxMAA127%2BJa-G9Ns_xfS%3DwJMyxRctSbDnQ%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/0a924940-7a33-4b4a-80de-27f3be8480fan%40googlegroups.com.