Meeting Purpose
Sync on Lance community topics, including performance, caching, and metadata.
Key Takeaways
- Commit Performance: A new "hint file" optimization was benchmarked, cutting manifest load time on S3 Express from linear to ~40ms. The spec change requires a community vote.
- Multi-Tenant Caching: The current session-based caching model is flawed, causing cross-tenant data contamination and potential security bypasses. A redesign is needed.
- Metadata Tables: Exposing metadata as a table is a popular idea but risks poor performance on large datasets. The group favors exposing it via functions that return Arrow tables.
- Release: RC2 was created to include recent fixes and a DataFusion upgrade. A versioning issue was noted where 4.0-beta and 3.0 branches both contain the same "breaking" change.
Topics
Multi-Tenant Caching Flaws
- Problem: The session-based caching model creates issues in multi-tenant environments.
- Data Contamination: Tenants can access each other's cached data, causing access issues.
- Security Bypass: A tenant could read cached data from another tenant who has access, bypassing their own object store permissions.
- Resource Waste: Caching the same data n times for n tenants is inefficient.
- Root Cause: The index and metadata caches may hold references to object store pointers, which are not keyed by tenant credentials.
- Solution Path:
- Bug Fix: Investigate and fix the cross-tenant data contamination bug.
- Refactor: Decouple object store pointers from index/metadata objects to ensure caches are tenant-isolated.
- Permissioning: Consider a namespace-level permissioning system to enforce access control before caching.
Metadata Tables
- Goal: Expose structured metadata (e.g., versions, files) as a queryable table.
- Performance Risk: A SELECT * on a metadata table could be very slow on tables with many versions, leading to user frustration.
- Proposed Solution: Expose metadata via functions that return Arrow tables, rather than as a formal table provider.
- Rationale: This provides the desired data access without setting false performance expectations.
- Precedent: The existing all_files() function provides a similar view.
- Optimization Idea: Leverage cloud-provider inventory tables (e.g., S3 Inventory) for faster, cheaper file listing.
- Trade-off: Data may be slightly outdated but is sufficient for periodic cleanup tasks.
Commit Performance Optimization
- Goal: Improve performance for tables with many commits, especially on S3 Express.
- Problem: S3 Express listing is unordered, requiring a full scan of all manifests to find the latest version, which scales poorly.
- Solution: "Hint File" Optimization
- A small hint file is written at commit time, pointing to the latest version.
- Parallel Resolution: On load, the system simultaneously:
1. Performs a full manifest listing (backup).
2. Reads the hint file and performs a HEAD request to check for newer versions.
- Outcome: Returns the result from whichever process finishes first.
- Benchmark Results (20k sequential commits):
- S3 Express: Drastically reduced load time from linear to a flat ~40ms.
- S3 Standard: Showed a smaller improvement (~10ms) but was already performing well.
- Implementation Note: The JSON-based hint file is preferred over using file size for versioning, as it avoids creating large, empty files.
Release & Versioning
- RC2 Created: Includes recent fixes and a DataFusion upgrade.
- Versioning Issue: The DataFusion upgrade was merged into main and backported to the 3.0 branch.
- Result: Both 4.0-beta and 3.0 branches contain the same "breaking" change.
- Impact: Not a major issue, but creates an awkward versioning state.
Next Steps
- Jack:
- Organize the community vote for the hint file spec change.
- Investigate and fix the multi-tenant caching bug.
- Will:
- Verify the RC2 release.
- Refactor index/metadata caches to decouple object store pointers.
- Kevin:
- Continue research on metadata tables, focusing on the function-based approach.
- All:
- Test the hint file optimization under high-contention scenarios.
- Benchmark performance on Azure Blob Storage, including premium tiers.
|
|
|
|
|
Action Items ✨
|
|
|
|
|
Meeting Purpose
Sync on Lance community topics, including performance, caching, and metadata.
Key Takeaways
Topics
Multi-Tenant Caching Flaws
Metadata Tables
Commit Performance Optimization
Release & Versioning
Next Steps
|
|
|
|
|
Ask Fathom!
|
|
Ask our AI Assistant for answers and insights. It's ChatGPT for your meetings!
|
|
Try Ask Fathom →
|
|
|
|
|
|
Never take notes again.
Sign up for Free
|
|
🎁 Referral bonus: Sign up now and unlock a free month of Premium for you
|
|
|
|
|
|
|
|