Meeting Purpose
Sync on Lance community contributions, releases, and key technical discussions.
Key Takeaways
- New Release Process: Adopt a Data Fusion-style model where minor releases (e.g., v2.x) are cut from a major release branch (e.g., branch-2.0). This enables faster, safer bug fixes and features for stable versions while main progresses toward the next major release.
- Manifest Scaling Strategy: Defer a spec change for manifest scaling. The immediate solution is to use a "composite table" pattern (a meta-table querying many small tables), which requires no format changes. A benchmark analysis will first define the problem's breaking point.
- Type System Consolidation: Consolidate Lance's type system by adopting Substrate's logical/physical model. This simplifies user experience by abstracting away Arrow's concrete types (e.g., String vs. LargeString) and aligns with other major databases.
- Standardized Versioning: Create a formal proposal to standardize how users specify versions, tags, and branches. This is critical for consistent time travel across all integrations (e.g., Spark, DuckDB) and will prevent the fragmentation seen in other formats like Iceberg.
Topics
Release Cycle & Process
- Problem: The current release process is slow (2 weeks per release) and lacks a clear strategy for delivering urgent bug fixes to stable versions.
- Solution: Adopt a Data Fusion-style release model.
- When a major release is cut (e.g., v2.0), a dedicated branch is created (branch-2.0).
- All subsequent minor releases for that version (e.g., v2.1, v2.2) are cut from this branch via cherry-picked PRs.
- This allows main to progress toward the next major version (e.g., v3.0) without blocking stable-version maintenance.
- Status: The v2.0.1 RC is blocked by integration test failures in an internal environment, which are being resolved.
Dataset Column Statistics
- Status: The write-path MVP is complete and in PR review.
- Plan: Merge the PR, marking the feature as "experimental" to allow for future breaking changes to the manifest format.
- Read-Path Use Cases:
- Query Engines: Provide statistics to Spark and Trino for query planning and predicate pushdown.
- Scanner Optimization: Use statistics for filter simplification when no secondary index is available.
Manifest Size for Large Tables
- Problem: The current single-file manifest will become a bottleneck for tables with millions of fragments, impacting performance for operations like opening the table.
- Proposed Solution: Implement a two-level manifest structure, similar to Iceberg.
- Decision: Defer a spec change.
- Rationale: The problem is not yet well-defined. The immediate solution is a "composite table" pattern, which requires no format changes.
- Action: Create a benchmark analysis with a 1M-fragment dataset to identify performance bottlenecks and define the problem's scope.
Type System Consolidation
- Problem: Lance exposes Arrow's concrete types (e.g., String, LargeString), creating user confusion and requiring complex logic in integrations.
- Solution: Consolidate the type system by adopting Substrate's logical/physical model.
- Logical Type: A single, high-level type (e.g., String).
- Physical Type: The underlying Arrow concrete type (e.g., String or LargeString).
- Rationale: This simplifies the user experience by mirroring other major databases (Postgres, Snowflake) and avoids introducing a new, custom type system.
Standardizing Version, Tag, and Branch References
- Problem: There is no standard way for users to specify versions, tags, or branches, leading to inconsistent time travel implementations across integrations (e.g., Spark, DuckDB).
- Goal: Define a single, consistent reference specification in the Lance core library to prevent fragmentation.
- Proposed Approaches:
- Xuanwo: Use a simple heuristic: numbers are versions, non-numbers are tags/branches. This requires enforcing unique names across these reference types.
- Jack: Use a prefix (e.g., ref/) to explicitly distinguish references from version numbers.
- Action: Xuanwo will create a formal proposal to drive this discussion.
Next Steps
- Jack:
- Create a GitHub discussion to formalize the new release process.
- Reply to the "external manifest store" thread to clarify its interaction with the versioning API discussion.
- Weston:
- Clean up and merge the column statistics PR, adding "experimental" warnings.
- Document the read-path follow-up plan for column statistics.
- Add a comment to the type system discussion linking to the Substrate model.
- Xuanwo:
- Create a formal proposal for standardizing version, tag, and branch references.
- All:
- Create a GitHub issue to track the benchmark analysis for manifest scaling.
|
|
|
|
|
Action Items ✨
|
|
|
|
|
Meeting Purpose
Sync on Lance community contributions, releases, and key technical discussions.
Key Takeaways
Topics
Release Cycle & Process
Dataset Column Statistics
Manifest Size for Large Tables
Type System Consolidation
Standardizing Version, Tag, and Branch References
Next Steps
|
|
|
|
|
Ask Fathom!
|
|
Ask our AI Assistant for answers and insights. It's ChatGPT for your meetings!
|
|
Try Ask Fathom →
|
|
|
|
|
|
Never take notes again.
Sign up for Free
|
|
🎁 Referral bonus: Sign up now and unlock a free month of Premium for you
|
|
|
|
|
|
|
|