[Proposal] More flexible dimension types and indexing
=============
Currently, Druid assumes that all dimensions have string values and are associated with a bitmap index.
There has been interest in loosening these constraints to support use cases that blur the existing separation between dimensions and metrics, e.g., filtering on numeric columns, aggregating dimensions at query time.
A recent discussion on these topics can be found here:
This proposal calls for two major changes/features:
-----
1.) Remove the assumption that dimensions always have string values.
This change is a path towards reducing the distinction between dimensions and metrics.
This would involve changes to:
- IncrementalIndex, IndexMerger, etc. (ingestion)
- StorageAdapters (querying)
- On-disk format of segments (storage)
- Ingestion specs, allow user to specify dimension types (e.g., String, Long, Double)
(Perhaps this is also a good time to redesign spatial dimensions to be less of a "special case"?)
-----
2.) Allow user to choose per-column index strategies
Druid could support a wider range of index types beyond bitmaps. Giving users control over what indexes are used on a per-column basis could make Druid more powerful and efficient.
For example, if a dimension is expected to have high cardinality and range filters applied to it, the user may want to choose a tree-based index instead of bitmaps.
As another example, trie indexes could be used to better support text search on dimension values.
The existing ColumnCapabilities class could be used to describe what indexes are supported for a column.
This would involve substantial changes to:
- query-related components
- on-disk storage format
- ingestion specs
-----
These two changes are conceptually separate and could probably be implemented separately (with some later merging/adjustment), although either would be more beneficial with the other change present.
However, it would probably be cleaner/less churn to implement the non-String dimensions first.
A roadmap for this proposal could be:
1. Replace IncrementalIndex with a new Index that accepts non-String dimensions
2. Adjust disk storage format and ingestion specs
3. Update querying logic to be aware of non-String dimensions
4. Implement per-column index type selection
5. Add a new index type that better supports range queries (appears to be the most common use case for numerical dimensions?)
6. Add other index types as needed
If this plan sounds sensible, shall we go ahead and create github issues for the roadmap items? I am thinking of working on the first item.
Thanks,
Jon