A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails

11 views
Skip to first unread message

Polovina, Simon (BTE)

unread,
Jan 9, 2026, 5:55:33 AM (12 days ago) Jan 9
to ontolog-forum

Hi all.

This article raises interesting issues regarding the dimensions and feasibility of semantic specifications.

May interest 🤔

Simon

 

---------- Forwarded message ---------
From: Ananth Packkildurai from Data Engineering Weekly <dataengine...@substack.com>


How a Semantically Correct API Becomes Operationally Unreliable at Scale

͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­͏   ­

Forwarded this email? Subscribe here for more

 

A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails

How a Semantically Correct API Becomes Operationally Unreliable at Scale

Jan 9

 

 

 

“Latency is not just a performance characteristic; it is a fundamental part of correctness.” Designing Data-Intensive Applications

In Designing Data-Intensive Applications, Martin Kleppmann makes a subtle but critical point: the CAP theorem omits latency, yet in real systems, latency often determines whether a system is usable at all. A system that is correct but slow is, in practice, incorrect.

This observation is directly applicable to the Apache Iceberg REST Catalog specification. While the specification achieves semantic clarity, it fails to define the operational realities that enable distributed systems to remain predictable at scale. The result is a standard that is formally correct, yet operationally fragile.


Semantic Interoperability Without Predictability

Over the past two years, the Iceberg REST Catalog specification has emerged as the de facto standard for metadata access in the Iceberg ecosystem. We have seen the outburst of the catalog war around the REST spec. It promises a universal interface that allows engines such as Trino, Spark, Flink, and StarRocks to interact with Iceberg tables via a common REST abstraction, independent of the underlying catalog implementation.

At the semantic level, this promise largely holds. The specification rigorously defines metadata structures: tables, schemas, snapshots, and namespace operations. A LoadTable or CreateNamespace request looks identical across implementations. This semantic interoperability has been critical to Iceberg’s rapid ecosystem adoption.

However, semantic interoperability alone is insufficient. The specification defines what metadata operations mean, but it avoids specifying how they must behave in real-world conditions, such as concurrency, latency sensitivity, and cross-catalog synchronization.

This gap—between semantic interoperability and operational interoperability—is where systems begin to fail in production.


The Core Problem: No Operational SLA, No Predictability

The Iceberg REST Catalog specification is intentionally silent on performance guarantees. There are no latency expectations, no throughput baselines, and no service-level objectives. While this flexibility lowers the barrier to implementation, it creates an ecosystem where:

·         Two catalogs can both be “compliant” yet differ by orders of magnitude in response time.

·         Clients cannot reason about metadata latency during query planning.

·         Synchronization behavior across catalogs becomes unpredictable.

In distributed data systems, predictability matters more than raw performance. Without a strict operational SLA—or at least defined behavioral constraints—clients are forced into defensive, retry-heavy designs that amplify load and increase tail latency.


The “List Tables” Problem: Cross-Catalog Sync Failure

The ListTables endpoint (GET /v1/namespaces/{namespace}/tables) is semantically straightforward. It allows clients to enumerate tables within a namespace and supports pagination through pageSize and pageToken.

The primary issue is not pagination itself. The real failure emerges when the same Iceberg tables are registered in multiple catalogs, a pattern that is increasingly common in hybrid and multi-platform deployments.

A Realistic Scenario

·         An Iceberg table is registered in Catalog A and Catalog B

·         Both catalogs point to the same underlying metadata and object storage.

·         One catalog is used by ingestion and streaming workloads.

·         Analytics engines or BI tools use the other.

The Sync Pathology

When a client connects to Catalog B and issues a metadata discovery operation—such as listing tables or syncing namespace state—the catalog must:

1.       Enumerate all tables

2.      Resolve metadata pointers

3.      Validate access permissions

4.      Reconcile the state with the underlying storage.

Because the REST specification defines no operational expectations:

·         There is no SLA for how long this sync should take

·         There is no distinction between a “lightweight” listing and a fully validated listing.

·         There is no mechanism to express intent (e.g., names only, no ACL validation)

As table counts grow into the tens of thousands, synchronization latency grows non-linearly. In practice, sync operations can take minutes—or fail—causing engines to stall, time out, or repeatedly retry.

The result is not merely slow metadata access. It is system-wide unpredictability. Query engines cannot determine whether a delay is transient, systemic, or catastrophic.


Latency Is Treated as an Implementation Detail—But It Is a Contract

The REST Catalog specification implicitly treats latency as an implementation concern. From a standards perspective, this is understandable. But in data-intensive systems, latency is part of the correctness contract.

The specification does not define:

·         Upper bounds on metadata retrieval latency

·         Maximum metadata payload sizes

·         Limits on metadata fan-out operations

·         The number of round-trip required to plan a query

As a result, a compliant catalog may require megabytes of JSON metadata and dozens of HTTP calls just to validate a single query plan. Engines appear slow and unstable, even though the root cause lies in an underspecified protocol.

This is precisely the class of problem Kleppmann warns about: correctness without latency guarantees is operationally meaningless.


Commit Semantics Under Contention: Undefined and Unfair

Iceberg relies on optimistic concurrency control. When multiple writers attempt to commit simultaneously, conflicts are expected and resolved through retries.

The REST specification defines the 409 Conflict response, but stops there. It does not define:

·         Backoff expectations

·         Retry fairness

·         Starvation prevention

In a multi-engine environment, this creates asymmetric outcomes. A high-frequency streaming writer with aggressive retries can permanently starve batch compaction jobs that follow conservative retry policies. Over time, table health degrades due to file explosion and unbounded metadata growth.

Once again, the issue is not semantic correctness. It is the absence of operational guarantees.


Caching Without a Freshness Model

While HTTP caching is permitted, it is not part of the correctness model. Support for conditional requests, ETags, or freshness validation is optional.

This forces clients into a pessimistic stance: always re-fetch, always revalidate, always assume staleness. The REST protocol degenerates into a chatty, high-latency control plane that negates its own architectural benefits.

Without a standardized freshness contract, caching becomes a gamble rather than a reliability tool.


Behavioral Conformance Is Missing

The Iceberg ecosystem has strong conformance testing for table formats. It lacks an equivalent for catalog behavior.

Today, “REST Catalog compliant” means:

·         The endpoints exist

·         The JSON schema is correct.

·         The happy path works.

It does not mean:

·         Predictable latency under load

·         Stable pagination during concurrent updates

·         Graceful overload signaling

·         Bounded retry amplification

Without behavioral conformance tests, compliance guarantees syntax, not operability.


Underspecification Is Still a Design Decision

The absence of operational constraints is not accidental. It reflects a deliberate choice to prioritize adoption and flexibility.

However, in distributed systems, underspecification pushes complexity downstream. It burdens clients, operators, and platform teams with the need to implement compensating logic. As Iceberg becomes core infrastructure rather than experimental tooling, this trade-off increasingly limits its reliability.

Semantic agreement without behavioral agreement leads to fragile systems.


Toward Operational Interoperability

Operational interoperability does not require rigid SLAs or centralized control. It requires acknowledging that latency, retries, and fairness are part of the interface.

Concrete improvements could include:

·         Defined operational profiles with minimum latency and concurrency expectations

·         Lightweight metadata views to avoid synchronization amplification

·         Standardized retry and backoff semantics for conflict scenarios

·         Explicit freshness and caching contracts

Semantic interoperability enabled Iceberg’s success. Operational interoperability will determine whether it remains dependable at scale.

Until then, the Iceberg REST Catalog remains a textbook example of why semantic specifications alone are not enough.


All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

 

Like

Comment

Restack

 

© 2026 Ananth Packkildurai
San Francisco
Unsubscribe

Get the appStart writing

Mike Peters

unread,
Jan 15, 2026, 6:19:43 PM (5 days ago) Jan 15
to ontolog-forum
Hi Simon

I read the article last week as well. He makes some valid points. It would be great if the next version of the Iceberg REST Catalog fixed this.

Mike Peters
-----------------------------------
Ajabbi
 
PO Box 902
Invercargill 9840
New Zealand
 
M 64+ 22 600 5006
Email mi...@redworks.co.nz
Meetings https://calendly.com/mike-ajabbi
 
Conservation Project www.mtchocolate.com  
Film Art www.redworks.co.nz
Software Engineering www.blog.ajabbi.com
------------------------------------------
Reply all
Reply to author
Forward
0 new messages