Recap of your meeting with Eto Labs

7 views
Skip to first unread message

Fathom

unread,
Feb 26, 2026, 12:44:38 PMFeb 26
to Lance Format Devlist
Meeting Purpose Sync on Lance community topics, including performance, caching, and metadata. Key Takeaways - Commit Performance: A new "hint file" optimization was benchmarked, cutting manifest load time on S3 Express from linear to ~40ms. The spec change requires a community vote. - Multi-Tenant Caching: The current session-based caching model is flawed, causing cross-tenant data contamination and potential security bypasses. A redesign is needed. - Metadata Tables: Exposing metadata as a table is a popular idea but risks poor performance on large datasets. The group favors exposing it via functions that return Arrow tables. - Release: RC2 was created to include recent fixes and a DataFusion upgrade. A versioning issue was noted where 4.0-beta and 3.0 branches both contain the same "breaking" change. Topics Multi-Tenant Caching Flaws - Problem: The session-based caching model creates issues in multi-tenant environments. - Data Contamination: Tenants can access each other's cached data, causing access issues. - Security Bypass: A tenant could read cached data from another tenant who has access, bypassing their own object store permissions. - Resource Waste: Caching the same data n times for n tenants is inefficient. - Root Cause: The index and metadata caches may hold references to object store pointers, which are not keyed by tenant credentials. - Solution Path: - Bug Fix: Investigate and fix the cross-tenant data contamination bug. - Refactor: Decouple object store pointers from index/metadata objects to ensure caches are tenant-isolated. - Permissioning: Consider a namespace-level permissioning system to enforce access control before caching. Metadata Tables - Goal: Expose structured metadata (e.g., versions, files) as a queryable table. - Performance Risk: A SELECT * on a metadata table could be very slow on tables with many versions, leading to user frustration. - Proposed Solution: Expose metadata via functions that return Arrow tables, rather than as a formal table provider. - Rationale: This provides the desired data access without setting false performance expectations. - Precedent: The existing all_files() function provides a similar view. - Optimization Idea: Leverage cloud-provider inventory tables (e.g., S3 Inventory) for faster, cheaper file listing. - Trade-off: Data may be slightly outdated but is sufficient for periodic cleanup tasks. Commit Performance Optimization - Goal: Improve performance for tables with many commits, especially on S3 Express. - Problem: S3 Express listing is unordered, requiring a full scan of all manifests to find the latest version, which scales poorly. - Solution: "Hint File" Optimization - A small hint file is written at commit time, pointing to the latest version. - Parallel Resolution: On load, the system simultaneously: 1. Performs a full manifest listing (backup). 2. Reads the hint file and performs a HEAD request to check for newer versions. - Outcome: Returns the result from whichever process finishes first. - Benchmark Results (20k sequential commits): - S3 Express: Drastically reduced load time from linear to a flat ~40ms. - S3 Standard: Showed a smaller improvement (~10ms) but was already performing well. - Implementation Note: The JSON-based hint file is preferred over using file size for versioning, as it avoids creating large, empty files. Release & Versioning - RC2 Created: Includes recent fixes and a DataFusion upgrade. - Versioning Issue: The DataFusion upgrade was merged into main and backported to the 3.0 branch. - Result: Both 4.0-beta and 3.0 branches contain the same "breaking" change. - Impact: Not a major issue, but creates an awkward versioning state. Next Steps - Jack: - Organize the community vote for the hint file spec change. - Investigate and fix the multi-tenant caching bug. - Will: - Verify the RC2 release. - Refactor index/metadata caches to decouple object store pointers. - Kevin: - Continue research on metadata tables, focusing on the function-based approach. - All: - Test the hint file optimization under high-contention scenarios. - Benchmark performance on Azure Blob Storage, including premium tiers.
FATHOM Get your own FREE AI Meeting Assistant
#1 rated on G2, 5/5, 5000+ reviews
Meeting with Eto Labs
Lance Community Sync
February 26, 2026    43 mins    View Meeting or Ask Fathom
Action Items ✨
Verify RC2; announce to community
Will Jones
Run hint-file benchmarks: high-contention; Azure Premium vs Standard
Jack Ye
Organize vote on hint-file spec change
Jack Ye
Investigate multi-tenant cache cross-contamination; fix if confirmed
Jack Ye
Audit index/metadata caches for embedded object-store refs; refactor to remove
Will Jones
Research cloud inventory tables (S3 Inventory, Azure blob inventory); assess for metadata tables
Kevin Liu
Meeting Summary ✨

Meeting Purpose

Sync on Lance community topics, including performance, caching, and metadata.

Key Takeaways

Topics

Multi-Tenant Caching Flaws

Metadata Tables

Commit Performance Optimization

Release & Versioning

Next Steps

View Meeting →
Ask Fathom!
Ask our AI Assistant for answers and insights. It's ChatGPT for your meetings!
Try Ask Fathom →
Never take notes again. Sign up for Free
🎁 Referral bonus: Sign up now and unlock a free month of Premium for you
Reply all
Reply to author
Forward
0 new messages