Subject: Sudden Query Failure in Host Index — Seeking Insight

58 views
Skip to first unread message

Mark Ciccarelli

unread,
Jun 4, 2025, 6:55:59 AMJun 4
to Common Crawl

Hi everyone,

First, thank you for the amazing work you’ve all done to make the Common Crawl corpus queryable at scale. The index structure and Athena access have been essential to our project.  

We’ve been querying the ccindex.urls table successfully for the past few months, including partitioned queries against crawl='CC-MAIN-2023-50' using projection settings and filtering down to specific URL patterns. This has been working flawlessly—until yesterday.

Suddenly, queries that had previously returned thousands of well-formed results are either:

  • Returning zero rows, or

  • Throwing HIVE_BAD_DATA errors, particularly related to schema mismatches on fields like warc_record_length_median

No changes were made to the underlying Athena table definition, partition projection configuration, or crawl ID. We’re using a stable Glue catalog entry with projection.enabled and filtering against known-good fields. This change in behavior has blocked downstream stages of our pipeline, which depends on WARC extraction based on Athena output.

We’ve carefully reviewed the schema and even tested re-adding partitions manually, but the behavior has shifted from known-good to broken with no apparent config drift on our end.

We’d deeply appreciate any insight into:

  • Whether any recent schema corrections or index updates may have altered the behavior of ccindex.urls

  • Whether this is isolated to CC-MAIN-2023-50 or systemic

  • If this may be linked to how warc_record_length_median or similar fields are stored and interpreted by Athena

Thank you in advance for your time—and again, for everything you’ve made possible. We are fully aligned with your mission and building directly on the shoulders of your work.

Best regards,
Mark Ciccarelli

Sebastian Nagel

unread,
Jun 4, 2025, 8:05:40 AMJun 4
to common...@googlegroups.com
Hi Mark,

thanks for the report! Apologies for any inconvenience.

You didn't do anything wrong. The problem is on our side:
- the column "warc_record_length_median" is declared to be an INT
in the schema [1]
- but for the partition crawl='CC-MAIN-2023-50' it is a string
("BINARY" in Parquet terminology)
- I haven't

We hope to fix this soon and will make sure that the published schema is
followed. Eventually we will release a new version "v3".

Just as a note and caveat: the host index [2] is still in the "public
test" phase.

It's less stable than the columnar index [3] which has a stable schema
resp. the schema is checked for compatibility with the data in older
partitions via schema updates aka schema merging [4,5,6].

The issue is tracked in [7].

Thanks and best,
Sebastian

[1]
https://github.com/commoncrawl/cc-host-index/blob/1fee08f273735bfe3e36ebd646a0e63d1d267c7b/athena_schema.v2.sql
[2] https://commoncrawl.org/blog/introducing-the-host-index
[3]
https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format

[4]
https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
[5]
https://docs.aws.amazon.com/athena/latest/ug/updates-changing-column-type.html
[6]
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

[7] https://github.com/commoncrawl/cc-host-index-builder/issues/1

On 6/4/25 11:54, Mark Ciccarelli wrote:
> Hi everyone,
>
> First, thank you for the amazing work you’ve all done to make the Common
> Crawl corpus queryable at scale. The index structure and Athena access
> have been essential to our project.
>
> We’ve been querying the ccindex.urls table successfully for the past few
> months, including partitioned queries against crawl='CC-MAIN-2023-50'
> using projection settings and filtering down to specific URL patterns.
> This has been working flawlessly—until yesterday.
>
> Suddenly, queries that had previously returned thousands of well-formed
> results are either:
>
> *
>
> Returning *zero rows*, or
>
> *
>
> Throwing *HIVE_BAD_DATA* errors, particularly related to schema
> mismatches on fields like warc_record_length_median
>
> No changes were made to the underlying Athena table definition,
> partition projection configuration, or crawl ID. We’re using a stable
> Glue catalog entry with projection.enabled and filtering against known-
> good fields. This change in behavior has blocked downstream stages of
> our pipeline, which depends on WARC extraction based on Athena output.
>
> We’ve carefully reviewed the schema and even tested re-adding partitions
> manually, but the behavior has shifted from known-good to broken with no
> apparent config drift on our end.
>
> We’d deeply appreciate any insight into:
>
> *
>
> Whether any recent schema corrections or index updates may have
> altered the behavior of ccindex.urls
>
> *
>
> Whether this is isolated to CC-MAIN-2023-50 or systemic
>
> *

Mark Ciccarelli

unread,
Jun 4, 2025, 8:50:06 AMJun 4
to Common Crawl

Hi Sebastian,

Thank you so much for the clear and timely explanation—and for confirming that the issue is on the index side and not a fault in our pipeline.

Just a quick note: the GitHub issue link you shared
(https://github.com/commoncrawl/cc-host-index-builder/issues/1)
currently returns a 404 (“This is not the web page you are looking for”).

Not sure if the repo is private or if the link was posted ahead of publication, but just wanted to flag it in case there’s a different location where the issue is being tracked.

We’re happy to hold off on querying CC-MAIN-2023-50 until the fix is published—and again, thank you for all the work you and the team are doing. We’re grateful to build on top of it.

Best regards,
Mark Ciccarelli

Sebastian Nagel

unread,
Jun 4, 2025, 10:44:48 AMJun 4
to common...@googlegroups.com
Hi Mark,

> if the repo is private

Yes, sorry. The repo is a private one.

> We’re happy to hold off on querying CC-MAIN-2023-50

Ok. We'd also need to verify that no other partition or other columns
are affected.

~Sebastian

On 6/4/25 14:50, Mark Ciccarelli wrote:
> *Hi Sebastian,*
>
> Thank you so much for the clear and timely explanation—and for
> confirming that the issue is on the index side and not a fault in our
> pipeline.
>
> Just a quick note: the GitHub issue link you shared
> (https://github.com/commoncrawl/cc-host-index-builder/issues/1 <https://
> github.com/commoncrawl/cc-host-index-builder/issues/1>)
> <https://github.com/commoncrawl/cc-host-index/
> blob/1fee08f273735bfe3e36ebd646a0e63d1d267c7b/athena_schema.v2.sql>
> [2] https://commoncrawl.org/blog/introducing-the-host-index
> <https://commoncrawl.org/blog/introducing-the-host-index>
> [3]
> https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-
> columnar-format <https://commoncrawl.org/blog/index-to-warc-files-
> and-urls-in-columnar-format>
>
> [4]
> https://docs.aws.amazon.com/athena/latest/ug/handling-schema-
> updates-chapter.html <https://docs.aws.amazon.com/athena/latest/ug/
> handling-schema-updates-chapter.html>
> [5]
> https://docs.aws.amazon.com/athena/latest/ug/updates-changing-
> column-type.html <https://docs.aws.amazon.com/athena/latest/ug/
> updates-changing-column-type.html>
> [6]
> https://spark.apache.org/docs/latest/sql-data-sources-
> parquet.html#schema-merging <https://spark.apache.org/docs/latest/
> sql-data-sources-parquet.html#schema-merging>
>
> [7] https://github.com/commoncrawl/cc-host-index-builder/issues/1
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com <mailto:common-
> crawl+un...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/common-
> crawl/948a4ba5-3450-40b1-8e14-75cbb856b3f7n%40googlegroups.com <https://
> groups.google.com/d/msgid/common-
> crawl/948a4ba5-3450-40b1-8e14-75cbb856b3f7n%40googlegroups.com?
> utm_medium=email&utm_source=footer>.

Greg Lindahl

unread,
Jun 4, 2025, 11:13:05 AMJun 4
to common...@googlegroups.com
Mark, are you intending to use the columnar index (which is 1 row per
url) or the very new, experimental host index (1 row per web host)?

I ask because your initial note says many things about urls, and there
are no urls in the host index. Sebastian guessed that you were talking
about the host index because of the field name
"warc_record_length_median" is only in the host index.

-- greg
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/common-crawl/eae7a667-b5a4-4920-bf0a-ebbca628c2da%40commoncrawl.org.

Mark Ciccarelli

unread,
Jun 4, 2025, 2:59:32 PMJun 4
to Common Crawl

Hi Greg,

Thanks for asking—and great question.

Yes, we are intentionally using the Host Index (the very new, experimental one that emits 1 row per registered web host), not the older URL-level Columnar Index.

Sebastian was correct in interpreting this, as our query leverages fields like warc_record_length_median and nutch_fetched_pct, which are only present in the Host Index.

We’re building a structured WARC parsing engine that relies on prequalifying domains, not specific pages. Our pipeline uses the Host Index as an upstream signal source to:

  • Rank domains for crawl readiness

  • Pre-classify anti-bot resistance levels

  • Feed into a downstream spider assignment system (Crawler Engine)

That said—some of my original message language referenced “pages” and “URLs” too loosely. Thanks for catching that—I’ll be more precise going forward.

Best,
Mark

Tom Morris

unread,
Jun 4, 2025, 5:12:34 PMJun 4
to common...@googlegroups.com
On Wed, Jun 4, 2025 at 10:44 AM Sebastian Nagel <seba...@commoncrawl.org> wrote:

 > We’re happy to hold off on querying CC-MAIN-2023-50

Ok. We'd also need to verify that no other partition or other columns
are affected.

It doesn't appear to be restricted to that one partition. In addition to the warc_record_length_median column, it also appears that warc_record_length_av is affected.

It looks like all the backing Parquet files were regenerated on May 24, so it's possible that's when the problem was introduced.

Tom

Greg Lindahl

unread,
Jun 4, 2025, 5:14:50 PMJun 4
to common...@googlegroups.com
No, it was worse before may 24, in other ways.  I didn’t expect anyone to have found it before I announced it. 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEFvrdUtNoVshxMAA127%2BJa-G9Ns_xfS%3DwJMyxRctSbDnQ%40mail.gmail.com.

Greg Lindahl

unread,
Jun 4, 2025, 8:34:44 PMJun 4
to common...@googlegroups.com
Mark,

Well, I'm very surprised! That is exactly what the host index wants to facilitate! But it's nowhere near production quality yet.

I'll contact you directly to discuss how we can make sure what you're using continues to work, and how we should move forward.

Everyone else, if this new data product sounds interesting, please take a look at https://github.com/commoncrawl/cc-host-index and pay explicit attention to what it says will change in the future and what the bugs are.

I have a call in with my AWS open dataset account rep to see how I can exclude testing datasets from what the glue crawler returns. Looks like I should have named the folder starting with a '_'. Live and learn.

-- greg



Reply all
Reply to author
Forward
0 new messages